In this lecture, we’re going to be talking about complex data types like struct and array. You can think of array as a list in a column, and you can think of struct as almost like a table within a table.
These two concepts are very important when you’re trying to build more compact data sets. The power of this cannot be understated, but it also has very specific use cases because if you’re using these complex data types, they have usability concerns. They’re harder to query, they’re harder to work with, but they are more compact. So knowing who your customer is and knowing who is going to be using this data when you’re modeling data in this way is also critically important.
What is a Dimension?
A lot of times, like when you hear the word dimension, at least like outside of data, you think of like 3D or 2D or 4D, right? You have like the three dimensions, like x, y, and z, and those things kind of define a space or an area or something like that, right? And that’s partially like how this works as well.
In the data realm, Dimensions are usually attributes of an entity. That might be my birthday, it might be my favorite food, it might be the city I live in, it might be like my name.
- Some of these dimensions are called identifier dimensions. Identifier dimensions are what uniquely identify an entity. Usually, this is like a user ID, or it might be something like a social security number, or like maybe a device ID if you’re tracking your phones.
- Other dimensions are just attributes. And attributes, they aren’t critical to the identification of that entity, but they help you do other sorts of analyses. And these attributes, generally speaking, come in two flavors:
- Slowly Changing. These attributes evolve over time. For example, someone’s favorite food may change from lasagna to spicy curry as their preferences shift. Since the value changes over time, it is considered a slowly changing dimension. Modeling this requires capturing the time-dependent nature of the attribute to reflect its history accurately.
- Fixed. A These attributes remain constant over time. For instance, a person’s birthday or the manufacturer of a phone (e.g., Apple or Samsung) cannot change. Fixed dimensions are easier to model because their values are permanent.
When modeling dimensions, distinguishing between these two types is essential, as slowly changing dimensions often require more effort to track and update over time, while fixed dimensions are straightforward.
Knowing your Consumer
The first step in data modeling is understanding who will use the data and tailoring it to their needs:
- Data Analysts and Scientists:
Datasets for analysts and scientists should be simple and easy to work with. Columns should have clear names, use primitive data types like decimals or strings, and allow for straightforward calculations like sums or averages. Avoid making datasets overly complex, as this will hinder their usability. - Data Engineers:
Engineers often handle complex data structures, so using nested types like structs and arrays is acceptable. These formats can improve compactness and efficiency. In advanced roles, engineers may create master datasets—centralized, reusable datasets that multiple pipelines depend on. Engineers are expected to handle complex data formats in their workflows. - Machine Learning Models:
Models typically require flattened data, with an identifier column and feature values in simple formats like decimals or categories. While some models can handle more complex structures, most prefer primitive data types that are straightforward to process. Naming conventions are less critical for models, as they focus on the structure and content rather than column names. - Customers:
Customers generally prefer consuming insights visually through charts and graphs. These visualizations should be annotated and easy to interpret. Only in rare cases, such as with highly analytical customers, might you deliver raw or semi-processed data directly to them.
Compactness vs. Usability Tradeoff
When designing datasets, balancing compactness and usability is a critical consideration. For example, using arrays or structs can significantly reduce the size of a dataset, as seen in scenarios like modeling Airbnb listing availability, where data size was reduced by 95%. However, these compact formats can make querying and analysis harder. Always consider the tradeoff based on your audience’s technical expertise.
Avoiding Costly Mistakes
Modeling errors can lead to inefficiencies, wasted time, or financial loss. By understanding how the data will be used and by whom, you can avoid such issues. Always think about the downstream impact of your decisions when designing data models.
OLTP vs Mater Data vs OLAP
Types of Data Modeling
In data engineering, there are three main types of data modeling, existing along a continuum. Each type serves a specific purpose and comes with unique considerations.
1. OLTP (Online Transaction Processing)
OLTP is mostly outside of the realm of data engineering because it’s commonly used by software engineers to power transactional systems, focusing on efficiency and speed. Key characteristics include:
- Normalization: Data is structured to minimize duplication, often following rules like third normal form.
- Constraints: Extensive use of primary keys, foreign keys, and other constraints ensures data consistency.
- Complex Queries: Accessing data often involves numerous joins across linker tables.
- Tools: Typically implemented in databases like MySQL or PostgreSQL.
- Single-entity Analysis: OLTP systems are designed for operations focusing on individual entities, such as processing a single user action.
These systems are highly efficient for real-time applications but are not optimized for large-scale analytics.
2. OLAP (Online Analytical Processing)
OLAP is the most common data model for data engineering, optimized for analytics rather than transactional operations. Characteristics include:
- Denormalization: Data duplication is acceptable if it improves query performance.
- Efficiency: The goal is to reduce the need for complex joins, enabling fast queries on large datasets.
- Population-Based Analysis: Designed for analyzing broad datasets or significant subsets, rather than focusing on individual entities.
For example, in an OLAP system, data might be pre-aggregated or grouped to eliminate the need for frequent joins, ensuring speed and efficiency for analytical tasks. While OLTP systems focus on real-time operations, OLAP systems support decision-making processes that require processing large amounts of data.
3. Master Data (Middle Ground)
Master data serves as a bridge between OLTP and OLAP systems, combining characteristics of both:
- It unifies multiple transactional datasets into a single, cohesive table.
- Data is normalized and structured for clarity but remains sufficiently comprehensive for analytical use.
- Master data helps avoid inefficiencies when integrating transactional and analytical systems.
For example, at Airbnb, transactional data such as hosts, guests, listings, and pricing was consolidated into a master dataset for pricing and availability. Without this layer, analysts would need to join dozens of tables manually, making analysis cumbersome and inefficient.
Mismatched needs = Less business value!!
Incorrectly applying a data model can lead to inefficiencies:
- Treating OLAP as OLTP (optimize for the population, but you really only need the single row):
- Pulling too much unnecessary data slows down performance.
- Leads to bloated datasets and redundant information due to duplicated information across tables.
- Treating OLTP as OLAP:
- Excessive joins make queries slow and expensive (due to shuffle).
- Causes frustration and inefficiency, especially for analytics.
Master data plays a critical role in avoiding these mismatches. By acting as an intermediary, it ensures smooth integration and supports both operational and analytical needs. Understanding these types of data modeling and their roles in the continuum is essential for designing systems that are efficient, scalable, and tailored to the needs of end users.
The Continuum of Data Layers
The continuum of data modeling consists of four distinct layers:
/attachments/Pasted-image-20241120193531.png)
- Transactional Data (OLTP): Highly normalized and split into many tables, ideal for real-time operations.
- Master Data: A consolidated view of key entities, serving as the bridge between OLTP and OLAP systems.
- OLAP Cube: Denormalized and flattened, designed for slicing and dicing data in analytics (group by).
- Metrics: Highly aggregated data, often distilled into single values or key performance indicators (KPIs).
Practical Example: Airbnb
At Airbnb, production data was transactional and included details about hosts, guests, listings, and prices. This data was:
- Consolidated into master data tables for pricing and availability.
- Transformed into OLAP cubes to allow analysts to perform efficient groupings (e.g., by country, age group, or device).
- Aggregated further into metrics, such as the average listing price, providing top-level insights.
Each layer serves a distinct purpose, from powering the app’s operations to providing actionable insights for business decisions.
Cumulative Table Design
Let’s shift gears and dive into cumulative table design, a foundational approach for managing historical data. The essence of this design is to maintain a complete historical record of key dimensions up until a specific point in time. This enables you to track user activity or status changes over time effectively.
One challenge when building master data is that not every user or entity may appear in daily data consistently—some users might only be active once a week, for instance. However, the cumulative table ensures no one is left out; it captures all historical dimensions, even if there are gaps in daily activity. The design revolves around combining today’s data and yesterday’s cumulative data to create a seamless historical view.
How It Works
There are two ways that cumulative table works:
- Combining two data tables. You start with two data tables: today’s data (representing the most recent activity or updates) and yesterday’s cumulative data (which holds all historical data up until yesterday). These are merged using a full outer join. Why full outer? Because there might be records that exist in one table but not the other and will full outer join we can get the whole set:
- Some entities may only exist in yesterday’s table (they didn’t show up today).
- Conversely, some may be new today and absent in yesterday’s cumulative view.
- Handling Missing Values
After merging, you’ll need to coalesce values. This ensures you handle situations where a value exists in one table but not the other—by prioritizing today’s data when available. - Cumulative Output
The resulting table merges historical and current data into a single cumulative view. This becomes the input for tomorrow’s cumulative table, creating a self-sustaining loop.
If this feels abstract, don’t worry! You’ll see this in action during the lab session, where we’ll apply these concepts directly.
Practical Examples: Facebook and Beyond
One powerful use case for cumulative tables is growth analytics. At Facebook, we had a table called dim_all_users, which tracked every user’s activity daily. This table was pivotal for calculating metrics like daily, weekly, or monthly active users.
- The cumulative table let us “pull history forward” by incorporating previous data and today’s updates through the full outer join.
- It served as the source of truth for all user data, feeding into 10,000 downstream pipelines. If any team needed user information, they relied on
dim_all_users.
Modeling State Transitions
Cumulative tables also enable detailed tracking of state transitions. For instance, Facebook used cumulative designs for “growth accounting” to model user states between today and yesterday. These transitions included:
- Churned Users: Active yesterday but not today.
- Resurrected Users: Inactive yesterday but active today.
- New Users: Nonexistent yesterday but active today.
This framework provides powerful insights into user behavior. Tracking transitions like these is a key component of the analytics track and will be covered extensively.
By analyzing these transitions, we could better understand user behaviors over time.
Diagram of Cumulative Table design
Here’s the kind of diagram for how cumulative table design works:
You have today and yesterday. Today is just the data for today, and yesterday might be all the accumulated history up until that point, or it might be null. If you’re just starting the accumulation process today, then yesterday will be null and won’t have any values.
A critical part of this design is deciding how many users to hold on to and for how long. For example, Facebook has a couple of billion active users, but the number of inactive users—those who have made an account and left—is in the tens of billions. Many of those users are never coming back, so they aren’t relevant to your analysis. There needs to be a line where you prune out stale users. At Facebook, they used a rule: if a user hadn’t been active in 180 days, they were kicked out of the table. But that depends on the company, the business, and the requirements of what your analysis needs.
As this table grows daily, it can get unwieldy. You need to design it carefully, with filtration or pruning processes to keep it efficient. Otherwise, these tables can grow too large to manage.
In cumulative table design, the process starts with a full outer join of yesterday’s and today’s data. You then coalesce the IDs. Yesterday and today will both have a user ID, but one might be null while the other exists. If someone exists in both tables, you handle that. If they only exist in one, you handle that too. Coalescing IDs helps you get back to just one ID.
Once that’s done, you compute cumulative metrics. With all of history in your table, you can see patterns like how long it’s been since a user was last active. For example, if it’s been seven days since a user was last active, you can increment that counter by one if they still aren’t active. This enables you to track all sorts of interesting cumulative metrics. You can also collect additional values or changes into arrays. This way, you can see all the history of a user in one row, and you can add new data points to that row as time goes on.
At the end of the process, today’s cumulative output becomes yesterday’s input for the next day’s accumulation. This is how the cumulative process works continuously.
Strengths and Drawbacks of Cumulative Table Design
Strengths
- It allows historical analysis without needing a
GROUP BYoperation. For example, you can track a user’s last 30 days of activity in one row as an array, making queries much faster. - Transition analysis is straightforward. You can easily track states like churned, resurrected, or active/inactive transitions.
- Scalability: Since all historical data is already in the table, querying just the latest row gives you cumulative insights without scanning many rows.
Drawbacks
- Backfilling can only be done sequentially because each day depends on the previous day’s data. You cannot process days in parallel, which can slow down backfilling.
- Handling PII and inactive users can be messy. You often need additional tables to filter out deleted users or remove stale data, which adds complexity.
Compactness vs. Usability Tradeoff
There is a tradeoff between compactness and usability.
The most usable tables have an identifier and simple, flat attributes like strings or numbers. These are easy to query and group by, making them great for analytics.
The most compact tables, on the other hand, might have only an identifier and a single blob of bytes. This approach compresses data as much as possible, which is useful for production systems with high I/O constraints. However, it’s not user-friendly for analysts.
The middle ground involves using arrays, maps, and structs. This allows some level of compactness while still being queryable. For example, you could have a table with an identifier and an array of structs, making it efficient for storage but slightly harder to query.
When would you use each type of table?
The most compact approach is all about online systems. If you’re serving an app with thousands or millions of users, you want to minimize data as much as possible.
The middle ground is the master data layer. Here, you can use more complex data types, like structs and arrays.
Finally, the most usable is the OLAP cube layer, favored by analysts.
Let’s talk a little bit more about the middle ground layer. It’s particularly useful if your downstream consumers are data engineers who join tables and create additional downstream datasets. In such cases, using more complex data types is highly beneficial. Structs, arrays, and maps all have their trade-offs.
Struct vs Array vs Map
A struct can be thought of as a table within a table, with defined keys and values. The data types of the values can vary, which is an advantage. In contrast, maps require all values to be the same type, which can lead to casting problems. However, maps allow for an unlimited number of keys (up to a framework-specific limit, often 32,000 or 65,000). Arrays are ideal for ordered data sets, where each element in the array must be the same data type. These types can be nested, such as arrays of structs or arrays of maps.
Temporal Cardinality Explosions of Dimensions
At Airbnb, this approach was critical for modeling data with a temporal component. For example, a listing has a calendar, which contains nights. The question becomes: do you model this as six million listings or two billion nights? Compressing data effectively can be challenging. Using an array of nights within a listing ID versus exploding the data into rows presents trade-offs.
Compression techniques, like run-length encoding used in Parquet file format, are essential. When data is sorted, duplicate values can be compressed effectively. For instance, if a listing ID has 365 nights sorted in order, the ID can be stored once with a count, significantly reducing data size. However, if sorting is disrupted, such as during a join in Spark, compression is less effective, leading to larger datasets.
Badness of denormalized termporal dimensions
Let’s delve into how denormalized temporal dimensions work and the challenges they can present. When data is organized at the “listing night” level and is nicely sorted, issues arise if a join operation disrupts that sorting. Specifically, Spark Shuffle can break the sorting, which has a significant impact on data compression. This happens because the effectiveness of run-length encoding relies on sorted data. When sorting is disrupted, compression becomes less efficient, leading to larger datasets.
Run-length encoding compression
Let’s consider this dataset as example:
/attachments/Pasted-image-20241121221858.png)
Run-length encoding works by nullifying duplicate data and only storing the unique value once, accompanied by a count of occurrences. For instance, if a dataset contains five consecutive rows with the value “AC Green,” run-length encoding replaces those rows with a single instance of “AC Green” and a count of five. Let’s see visually:
/attachments/Pasted-image-20241121221936.png)
This drastically reduces storage requirements, especially for datasets with a temporal component, where duplicate values are common.
However, when sorting is broken—such as during a join operation—compression efficiency decreases. Consider the flat table of player names and seasons that we’ve seen above and that is initially sorted. After a join in Spark, the table might become unsorted, scattering previously contiguous data like “AC Green” across different parts of the dataset:
/attachments/Pasted-image-20241121222117.png)
As a result, run-length encoding can only compress small segments of the data (only the last row in this specific example), leading to a much larger final dataset.
This problem is often overlooked. Engineers may not realize that their output dataset has grown significantly due to broken sorting. The solution is twofold:
- one option is to re-sort the data after every join operation, but this approach is inefficient and prone to errors.
- I advocate for modeling data in a way that maintains sorting automatically, without requiring downstream consumers to re-sort it.
One effective strategy is to use arrays for complex data modeling. For example, rather than representing player names and seasons as separate rows, you could store a single row for each player, with an array containing all their seasons. This way, when a join is performed on the player name, the array structure preserves the sorting of seasons. After the join, the array can be exploded back into individual rows, but the original sorting is retained. This approach ensures efficient compression and eliminates the need for re-sorting.
Using complex data types like arrays for master data has numerous advantages. Downstream data engineers benefit from pre-sorted datasets, reducing the likelihood of errors and oversized datasets. Properly modeled data simplifies their work and prevents inefficiencies caused by broken sorting.
Spark Shuffle, while optimizing join performance, inherently disrupts sorting. Understanding and mitigating this challenge is crucial for maintaining efficient data pipelines. By leveraging complex data types and thoughtful modeling practices, you can avoid the pitfalls of sorting disruptions and create more robust systems.