Is it a fact or dimension?
When trying to distinguish between facts and dimensions, it’s not always clear-cut. One thing that came up frequently at Facebook, where I worked in growth, is how similar some definitions can seem, even when they represent very different ideas. For example, let’s consider two dimensions used in Facebook’s user data model: dim_is_active and dim_is_activated. These two are conceptually distinct, yet they often caused confusion.
Example 1: dim_is_active
dim_is_active was a dimension representing user activity. It was based on whether the user had any engagement on the platform. For instance:
- A user might show up for at least a minute on the app, or
- They might engage in some way, like posting a comment, liking, or sharing content.
This is where things get tricky. If this dimension is derived from event data—like liking, commenting, or sharing—then is it really a dimension? Or is it more of an aggregation of facts? The answer is that it can be both. This dual nature of dimensions and facts is a key challenge in data modeling, as they often overlap and blur together.
Example 2: dim_is_activated
By contrast, dim_is_activated represents whether a user account is in an activated state. This dimension is based on an attribute of the user, not on their activity or event-based engagement. For example:
- If a user deactivates their account, this state changes from activated to deactivated.
- It’s a flag that directly reflects the account’s status, not an aggregation of multiple events.
This distinction is clearer because dim_is_activated is directly tied to the user object. It’s not derived from events or activity—it’s just a property of the account.
Complex Scenarios: Overlap Between Dimensions
Things get more complex in edge cases. For instance, at Facebook, a user could deactivate their Facebook account but keep Messenger active. This meant they could have:
dim_is_activated = false(Facebook account deactivated)dim_is_active = true(still engaging on Messenger).
This created a unique group of “Messenger-only” users—those who didn’t use the Facebook feed but still used the DMs functionality. These overlapping states illustrate how dimensions and facts can become entangled.
Bucketization: Aggregating Facts into Dimensions
Another important consideration in modeling is how we handle aggregations of facts when creating dimensions. For example, let’s say you wanted to analyze users based on the number of likes they gave in a day. If one user liked 17 posts, do we want a dim_likes_per_day = 17? Probably not. Instead, we might bucketize this dimension into ranges, such as 10–20 likes per day.
Bucketization helps reduce cardinality, which is crucial for creating informative groupings. Without it, the data can look strange. For example:
- A user who liked 1,700 posts in a day (because they were highly active) might end up in a bucket by themselves. This is not helpful for analysis.
- Instead, you’d create broader buckets, such as “500+ likes,” to capture these outliers in a meaningful way.
When designing buckets, aim for 5 to 10 distinct groups. For instance, you could base them on statistical distribution—like quartiles or quintiles:
- 0–10%
- 10–25%
- 25–50%
- 50–75%
- 75%+
This approach creates a balanced set of groups informed by the data, rather than arbitrarily choosing ranges.
Cardinality and Grouping
High cardinality—where every group has only one member—is another problem to watch out for. Grouping implies forming clusters, and if each “group” contains just one user, it undermines the purpose of aggregation. Additionally, high-cardinality data can affect analysis and predictive modeling:
- Small group sizes lack statistical power. You generally need at least 30 members in a group to assume normality in your data.
- With fewer data points in each group, your results may become less reliable.
Balancing Flexibility and Compression
While bucketization reduces cardinality and improves analysis, it comes at a trade-off. Unbucketed data is more flexible because you can slice it however you like, but the higher cardinality can impact storage and query performance.
The Blurry Line Between Facts and Dimensions
To illustrate how blurry this line can get, let’s revisit dim_is_activated. The state (dim_is_activated) is clearly a dimension. However, the event of changing that state—deactivating your account—would be a fact, complete with a timestamp. The fact represents the action, while the dimension reflects the resulting state.
This overlap highlights an important point: dimensions and facts are often interdependent. Dimensions can be derived from facts, and facts can represent changes in dimensions. The delineation between the two is often not absolute, and this is where thoughtful modeling comes into play.
Final Takeaways
- Dimension vs. Fact: Dimensions are typically attributes (e.g., account state), while facts are events or actions (e.g., deactivating an account). However, they often overlap, especially when dimensions are derived from aggregated facts.
- Bucketization: When creating dimensions from aggregated facts, use informed bucketization to manage cardinality and maintain meaningful groupings. Statistical methods like quartiles or quintiles are a good guide.
- Cardinality Considerations: High cardinality can reduce the effectiveness of groupings and complicate analyses. Aim for groups with sufficient members to ensure statistical reliability.
- Trade-offs: Balancing flexibility (keeping raw facts) and performance (reducing cardinality through bucketization) is key in designing efficient data models.
By carefully considering these points, you can create data models that are both performant and insightful, while avoiding common pitfalls like poorly defined buckets or overly complex dimensions.
Properties of Facts vs Dimensions
Dimensions
- Definition: Dimensions are the attributes or fields used to group data when performing analyses, particularly in SQL queries. They show up in the
GROUP BYclause. Examples include:- User ID
- Country
- Device
- Gender
- Scoring class
- Cardinality:
- Dimensions can be high-cardinality, such as
user_id, where there are many unique values. - They can also be low-cardinality, such as
gender, where there are only a few unique values (e.g., male, female, non-binary, etc.). - Some, like
country, fall somewhere in the middle, with a moderate number of unique values.
- Dimensions can be high-cardinality, such as
- Source: Dimensions typically come from a snapshot of state. For example:
- At companies like Facebook, Netflix, or Airbnb, production databases are periodically captured to create a snapshot of the system at a specific point in time.
- The values in the snapshot represent the dimensions for that particular date.
Facts
- Definition: Facts are the measurable, event-driven data points that you aggregate in your analyses. They are what you sum, average, count, or use in similar aggregation functions. Examples include:
- The number of logins a user performs.
- The number of videos watched.
- The number of likes, comments, or shares.
- Key Characteristics:
- Inside Aggregations: Facts go inside the aggregation functions (by things like
SUM,AVG, andCOUNT), while dimensions are outside as part of theGROUP BY. - Higher Cardinality: Facts often represent user activities, which are inherently more numerous than the users themselves. For example:
- A single user can perform multiple actions—watching videos, liking posts, or making purchases.
- Even if a user performs the same action repeatedly (e.g., watching multiple videos), those instances will all contribute to the fact count.
- Inside Aggregations: Facts go inside the aggregation functions (by things like
- Source: Facts are generated through logs of events. For example:
- Each time a user interacts with an app, such as by clicking a button or watching a video, a log entry is created.
- These event logs are aggregated to calculate the facts.
The Blurry Line Between Dimensions and Facts
The distinction between facts and dimensions isn’t always clear-cut. Here are some scenarios where they overlap or interact:
- Aggregated Facts Turning into Dimensions:
- Aggregated facts, such as the total number of videos watched by a user, can become dimensions for further analysis.
- For instance, you could group users by ranges of total videos watched (e.g., 0–10, 11–50, 51+).
- Facts Changing Dimensions:
- Facts often lead to updates in dimensions. For example:
- When a user deactivates their account, that event (fact) changes the dimension
account_statusfromactivetoinactive.
- When a user deactivates their account, that event (fact) changes the dimension
- Facts often lead to updates in dimensions. For example:
- Change Data Capture (CDC):
- What is CDC? Change Data Capture is a system that logs changes in the state of dimensions as discrete events. For example:
- A dimension like
account_status(active/inactive) can be recreated at any point in time by analyzing a series of change events.
- A dimension like
- Why is CDC Blurry? It sits in the middle of the fact-dimension divide:
- The changes are logged as events (facts).
- These events can then be used to reconstruct the state of the dimension at any time.
- What is CDC? Change Data Capture is a system that logs changes in the state of dimensions as discrete events. For example:
Summary
- Dimensions are attributes used for grouping data, often derived from snapshots of state, and can have low, medium, or high cardinality.
- Facts are measurable data points, generated from logs of events, and are typically aggregated in analyses.
- The line between the two can blur, especially with systems like CDC, where facts and dimensions influence and transform each other. Understanding this interplay is crucial for effective data modeling.
Airbnb Example
When I worked at Airbnb in the Pricing and Availability team, we often dealt with questions about how to model data attributes, particularly price. Price is an interesting case because, on the surface, it seems like it might be a fact. For example, the price for a specific night on an Airbnb listing could be summed, averaged, or counted—common operations for facts. Since prices are typically represented as doubles (decimal numbers), and it’s rare for such values to be dimensions, you might naturally assume that price is a fact.
However, price also behaves like a dimension. It represents the state of a specific night for a listing—the price set for that night. State-based attributes are generally considered dimensions, as they provide a snapshot of something at a particular moment in time.
This dual nature becomes clearer when you consider how prices are derived. Hosts can adjust settings, such as applying a last-minute discount or an early-bird discount. Each time they make such a change, an event is logged—this is a fact. However, the price itself is calculated based on these settings, which are part of the host’s state. As a result, price is better modeled as a dimension, even though it might feel like a fact at first glance.
This distinction highlights the subtle and sometimes surprising nuances of data modeling. Facts and dimensions can overlap or even transform into one another depending on how they are defined and derived, which is what makes data engineering such an intricate and fascinating puzzle.
Boolean/Existence-based Fact/Dimensions
Let’s delve deeper into dimensions that are derived from facts. We’ve already talked about dim_is_active, which makes sense as a dimension since it’s derived from an aggregation of user events, such as logging in or engaging with the app. Similarly, there are other dimensions like dim_bought_something, which could indicate whether a user has ever made a purchase. This dimension is essentially a rollup—it checks if there’s at least one entry in a facts table, such as a purchases table, where the count is greater than zero.
Other examples include dim_has_ever_booked, or even more nuanced ones like dim_ever_labeled_fake, which I worked on at Facebook. This dimension was derived from a machine learning model that flagged an account as fake. Even if the label was later removed, the fact that the account was ever flagged as fake could be a powerful signal for predicting future behavior. These types of dimensions are usually binary and irreversible—they flip from false to true and stay there.
However, some dimensions are more dynamic, like dim_is_monthly_active, which can toggle back and forth based on user activity. For example, a user might be active one month, inactive the next, and active again later. This introduces a time component, adding complexity to how dimensions are modeled.
Another powerful concept is “days since”, which we’ll explore in detail during week five of the analytics track. This is foundational to retention analytics and cohort analysis. It’s used extensively at companies like Facebook. For instance, you might look at users who signed up on a specific day and track how many of them remain active over time. Typically, this follows a retention pattern that starts at 100% (on day one) and then declines, eventually flattening out at an “asymptote” where a smaller percentage of users—say 10–20%—remain active long-term. This is referred to as a J-curve or retention analytical pattern.
Days since dimensions add another layer of aggregation. For example, you might calculate how many days it’s been since a user last engaged with the app or how many days since they signed up. These can help track trends over time and are essentially dimensions on top of dimensions, built on aggregated facts.
As we’ll see in today’s lab, dim_is_active will be a primary dimension to explore. These derived dimensions, though often intricate and layered, are essential tools for understanding user behavior and building actionable insights.
Categorical Fact/Dimensions
- Simple vs. Complex Bucketing:
- Simple bucketing involves grouping users or entities based on a single column or value, like “points” from Week 1.
- Complex bucketing involves evaluating multiple columns or conditions. For example, at Airbnb, determining if a host is a Superhost requires checking criteria like ratings, bookings, revenue, and cancellations. These conditions collectively define whether someone meets the requirements.
- Strategic Importance of Dimensions:
- Dimensions like “Superhost” are key to a company’s goals, often serving as a North Star metric. For Airbnb, increasing the number of Superhosts is a priority.
- Definitions must be meaningful and non-arbitrary to avoid “gaming the system.” For instance, making everyone a Superhost devalues the metric.
Challenges in Dimension Definitions
- Hard-to-Change Dimensions:
- Once a dimension is widely used across the company, changing its definition can have cascading effects, making revisions costly and time-consuming.
- Example 1: Facebook’s 5,000-friend limit prevents reverting to a lower cap once raised, as users can’t retroactively “unfriend” thousands of connections.
- Example 2: LinkedIn’s 30,000-connection limit, which users often criticize as arbitrary but remains unchanged due to system limitations or user behavior anchoring.
- Case Study: Airbnb’s
dim_is_available:- The old definition of availability was: “Did the host set a rule that blocks this night?”
- The new definition became: “Can a trip be booked that contains this night?”
- The difference (~3–4%) may seem minor but introduced complexities like sandwich nights (nights blocked because they don’t meet minimum booking rules).
- Impact: Changing this definition took two years, requiring updates to systems and teams across Airbnb.
Key Takeaways for Categorical Dimensions
- Involve Stakeholders Early:
- Dimension definitions should be created thoughtfully, with input from all relevant teams, to reduce the need for revisions later.
- Account for Edge Cases:
- Complex dimensions often have subtle edge cases (e.g., sandwich nights). Understanding these upfront prevents downstream pain.
- Changing Dimensions Is Expensive:
- Once dimensions are ingrained in systems and analytics, even small changes can take years and significant effort to implement.
- Before finalizing a definition, assess how it aligns with business goals and how changes will impact the organization.
- Set Definitions That Add Value Without Arbitrary Limits:
- Make sure dimensions like thresholds or limits (e.g., Superhost criteria, connection limits) are meaningful and don’t harm user experience or business goals.
Should you use dimensions or facts to analyze users?
- The answer is both.
- There’s an intuition that active users are probably more valuable, but activated users matter too because they give a different view of the user lifecycle.
- One powerful analysis could be dividing active users by activated users, which gives you a percentage of users who signed up and then became active.
- Example: This helps you understand how many users who signed up are actually interacting with the product.
Signups vs. Growth
- Signups focus on new users who’ve completed the initial steps (e.g., creating an account).
- Growth focuses on how those users continue to engage with the product (active users).
Key Takeaway
Depending on the question you want to answer, both metrics are valuable. You need to consider both activation (how many sign up) and engagement (how many stay active) when analyzing users.
The Extremely efficient Date List data structure (do it again the last part)
At Facebook, a common question is tracking Monthly Active Users (MAU), Weekly Active Users (WAU), and Daily Active Users (DAU). The problem arises when trying to calculate these metrics because to determine if a user is monthly active, you need to process the last 30 days of data. This becomes inefficient, especially with a large user base. For example, if there are 2 billion users each performing 50 actions per day, you’d end up with 3 trillion rows of data to process just for MAU.
To tackle this, Facebook used a Cumulative Table Design. This design processes 30 days of data at once, and each new day, it drops the oldest day’s data (the 31st day) and keeps everything else. The table accumulates the data over time, so you don’t need to constantly recalculate the last 30 days from scratch.
However, storing dates as an array of active days can still be inefficient. The solution was to use a thing called “datelist_int”. Instead of storing the actual dates, this format stores the offset of the active days, which reduces the data size. For example, a user might be active on January 1st, and that can be stored as “1,” with later active days stored as integers representing their offsets (e.g., “7” for December 25th of the previous year).
The benefit of this approach is that it stores the same 30 days of activity history in a much more compressed and efficient manner, using integer data types which compress well. This system was critical in handling massive volumes of user data efficiently.