Today, we are discussing slowly changing dimensions (SCDs).
A slowly changing dimension refers to an attribute that evolves over time. For example, consider your favorite food. In the early 2000s, my favorite food was lasagna. Over time, as I developed lactose intolerance, I began to enjoy lasagna less. After moving to the Bay Area, my preference shifted to curry. This illustrates how certain attributes, like favorite foods, can change over time.
These changes are tied to specific timeframes. For instance, from 2000 to 2008, it was lasagna, and since then, it has been curry. In dimensional modeling, it’s important to track such changes accurately. Dimensions like these have associated timeframes, while others, such as your birthdate, remain static. Birthdates are examples of dimensions that don’t change.
Modeling slowly changing dimensions correctly is crucial because improper handling can affect idempotence. Idempotence is the ability of data pipelines to produce consistent results, whether running in production or during backfill. This is a key property for maintaining data quality. Idempotence is essential for reliable data engineering. Understanding and applying this concept will significantly enhance your ability to build robust pipelines.
Idempotent pipelines are CRITICAL
Idempotent pipelines are essential for maintaining reliable data workflows. If a data pipeline is not idempotent, it can lead to significant issues and inefficiencies.
By definition, idempotence refers to an operation that produces the same result when applied multiple times. In the context of data pipelines, this means the pipeline should yield consistent outcomes regardless of whether it runs once, multiple times, or during a backfill.
This concept is critical in data engineering because it ensures data quality and consistency across production and backfill runs. Understanding and applying idempotence in your pipelines will help you avoid errors and build robust, maintainable systems. Let’s explore this idea further to clarify its importance in the data engineering process.
Pipelines should produce the same results
Pipelines must consistently produce the same results, regardless of when or how many times they are run. If all necessary inputs are available and signals are ready, the pipeline should generate identical outputs whether executed today, a year from now, or a decade later. This consistency is the core principle of idempotence, ensuring that a function with the same inputs always produces the same outputs.
In data engineering, failing to achieve idempotence can result in non-reproducible data, leading to significant issues. Imagine running a pipeline today, then backfilling it a week later, only to discover different results despite unchanged code. This discrepancy can confuse analytics teams, cause data inconsistencies, and undermine trust in datasets.
Non-reproducibility can manifest in several ways:
- Inconsistent Data: Running the same pipeline at different times yields different outputs.
- Troubleshooting Challenges: Pipelines don’t visibly fail but instead produce inconsistent data, making debugging difficult.
- Downstream Impacts: Non-idempotent pipelines lead to non-idempotent downstream tables and processes, propagating inconsistencies throughout the data warehouse.
These issues can frustrate analytics teams, as mismatched numbers and unexplained discrepancies lead to a loss of confidence in the data. Following best practices to ensure idempotence in pipelines helps prevent these problems, supports reliable analytics, and maintains data integrity.
What can make a pipeline not idempotent
First, why is it hard to troubleshoot? If you have a pipeline that’s not idempotent, it doesn’t fail—it just produces different data. I prefer not to use terms like “incorrect” or “different” and instead call it “non-reproducible.” A pipeline that doesn’t produce the same data when run at different times is problematic. This usually shows up when an analyst says, “Why does the data look weird? Why do these numbers not match?” This could happen because you backfilled recently, and the pipeline isn’t idempotent.
It could also happen in the opposite direction: if your pipeline depends on a non-idempotent upstream process or data table, you’re at the mercy of that pipeline’s inconsistencies. This is critical because non-idempotent data bleeds inconsistencies throughout the data warehouse due to the transitive property. This frustrates analytics teams, causes them to lose trust in datasets, and prompts questions like, “Why can’t these numbers match?” Let’s break down what causes these problems.
So, let’s explore some motivations make a pipeline not idempotent:
- One major issue is
INSERT INTOwithoutTRUNCATE. Say you’re processing data for a day, and you useINSERT INTO. If you run the pipeline once, great. But if you run it twice, you end up with double the data because nothing clears the table before the insert. Without aTRUNCATEstep, the data duplicates every time you run it, breaking idempotence. My rule as a data engineer is: never useINSERT INTOunless you’re usingTRUNCATE. Generally, I avoidINSERT INTOentirely. Instead, useMERGEorINSERT OVERWRITE.MERGEworks by taking new data and merging it with the old data. If you run the pipeline again, it matches the data and avoids duplication, keeping the dataset consistent. Similarly,INSERT OVERWRITEclears the target partition and replaces it with new data, ensuring consistency regardless of how many times the pipeline runs. - Another common mistake is using start_date > without a corresponding end_date <. For example, if your pipeline has a
WHEREclause likedate > yesterday, the data grows with each run. Running it today processes one day’s data, running it tomorrow processes two, and so on. This accumulates data unnecessarily and breaks idempotence. Instead, define a fixed window, likedate > yesterday AND date < today. This creates a manageable data range, avoiding overlaps or unbounded data growth. - A third issue is not using a full set of partition sensors. If the pipeline runs before all input data is ready, it processes incomplete inputs. This might happen if partition checks aren’t implemented properly. For example, a pipeline could execute too early in production, missing one input dataset, but run correctly during a backfill. This discrepancy creates inconsistencies between production and backfill outputs.
- Another critical problem is cumulative pipelines without sequential processing. Some pipelines depend on yesterday’s data to generate today’s outputs. These pipelines must run sequentially, but running them in parallel during backfills can fail if yesterday’s data hasn’t been processed yet. Without enforcing sequential execution, such pipelines might “start over,” processing only today’s data and skipping the accumulation of previous days.
- Finally, pipelines can suffer from depends-on-past issues. This happens when cumulative pipelines try to process data in parallel but fail due to unmet dependencies. For instance, if yesterday’s data hasn’t been created yet and today’s pipeline runs, it might reset the cumulative dataset, leading to incorrect results.
The key takeaway is this: production behavior and backfill behavior must be the same. That’s the beauty of idempotent pipelines.
Example at Facebook
Let’s explore some other things that can make a pipeline not idempotent. Non-idempotent pipelines can cause significant issues, as demonstrated by one of the most frustrating problems I encountered at Facebook. I was tasked with building a data model to track fake accounts. Fake accounts can transition through various states: they may be labeled fake, unlabeled after passing a challenge, and potentially relabeled as fake if they exhibit suspicious behavior again. These state transitions required tracking inflows and outflows of fake accounts, leading to the creation of a table called Dim All Fake Accounts.
The Problem with Non-Idempotence
The Dim All Fake Accounts table was non-idempotent because it relied on the latest data from another table, Dim All Users. This design aimed to reduce data latency and make the fake account data available as quickly as possible. However, this shortcut came with significant trade-offs.
Some days, Dim All Users would complete on time, allowing Dim All Fake Accounts to use today’s data. Other days, Dim All Users would lag, causing Dim All Fake Accounts to use yesterday’s data. This inconsistency meant that sometimes the fake accounts table reflected current data and other times outdated data. While this approach might be acceptable in certain contexts, it’s generally unsuitable for analytical purposes, especially when tracking metrics like state transitions.
The Consequences
When I built charts and tracked fake account state transitions using this data, the analytics team noticed discrepancies. They asked, “Why are your numbers not reproducible? Why don’t they match Dim All Users?” Despite reviewing all my SQL code and logic, I couldn’t identify the problem for weeks. It turned out that the issue wasn’t with my work but with the non-idempotent upstream pipeline (Dim All Users) that my table relied on.
This problem was a significant source of frustration, as I felt like I was doing everything correctly but still couldn’t achieve reliable results.
Lessons Learned
- Don’t Prioritize Latency Over Data Quality. Relying on the latest partitions in production to reduce landing time is a dangerous shortcut. While it might make the data available faster, it sacrifices consistency and quality. This approach can lead to discrepancies that are difficult to trace and resolve.
- The Exception for Backfilling. The only acceptable scenario for relying on the latest partitions is during backfilling, assuming you have a properly modeled Slowly Changing Dimension (SCD) table. In production, this approach should be avoided.
- Cumulative Pipelines and Non-Idempotent Dependencies. If a cumulative pipeline depends on non-idempotent data, it carries the inconsistencies forward daily, compounding the issue. Fixing such pipelines often requires starting over, which can be extremely time-consuming and painful.
Takeaways for Data Engineering
Non-idempotent pipelines can cause:
- Inconsistent Data: Outputs vary depending on whether data was processed in production or during a backfill.
- Persistent Bugs: Errors in upstream pipelines propagate downstream, leading to incorrect or misleading results.
- Engineering Burnout: The frustration of working with unreliable data pipelines can negatively affect morale and productivity.
To avoid these problems:
- Always prioritize idempotence by designing pipelines that produce consistent results regardless of runtime or conditions.
- Avoid shortcuts like relying on the latest data in production.
- Ensure that cumulative pipelines depend only on idempotent inputs to maintain accuracy over time.
Non-idempotent pipelines might seem like a quick fix for latency, but the long-term costs in terms of data quality, engineering time, and trust in the system far outweigh any immediate benefits.
The pains of not having idempotent pipelines
Idempotence has been a recurring theme, and for good reason. Here are a few more critical points to highlight its importance and the challenges that arise when pipelines aren’t idempotent.
- Backfilling and Production Inconsistencies. If your pipeline isn’t idempotent, backfill and production runs will not produce the same data. Instead of creating consistent outputs, backfilled data will overwrite or differ from production data. This inconsistency leads to hard-to-troubleshoot bugs. Typically, these bugs only surface when an analytics team dives deeply into the SQL and uncovers discrepancies. By that point, the issue has likely caused significant frustration and wasted time.
- Unit Testing Limitations. Unit testing alone cannot replicate production behavior if the pipeline is non-idempotent. Data engineers may think, “Well, my unit tests pass, so the pipeline is fine.” However, if the pipeline isn’t idempotent, unit tests are incapable of catching problems related to non-reproducibility. On the other hand, if the pipeline is idempotent, unit tests become far more effective and meaningful, as they can reliably validate the pipeline’s correctness and guarantee consistent results.
- Silent Failures. Non-idempotent pipelines lead to silent failures, particularly during backfills or restatements. These failures often don’t cause the pipeline to break outright but instead result in incorrect or inconsistent data. Silent failures are among the most time-consuming and frustrating problems in data engineering because they are harder to detect and fix. They waste time, erode trust in the data, and complicate debugging efforts.
The steps to ensure idempotence might seem minor, but they make a substantial difference. Addressing these issues not only improves the reliability of pipelines but also saves significant time and effort in the long run. Silent failures, mismatches between backfill and production, and ineffective unit tests can all be avoided with proper idempotence.
Should you model as Slowly Changing Dimensions?
We’re shifting gears to talk about slowly changing dimensions (SCDs). First, let’s define what they are. A slowly changing dimension is a dimension that changes over time. For example, age is a classic case: I’m 29 now, but I’ll be 30 next year. Another example is the type of phone you use—today I’m an iPhone user, but I used to use Android. Even country can be a slowly changing dimension: someone might have lived in the Dominican Republic but later moved to the U.S.
Not all dimensions change. Some, like your birthday or eye color, are essentially fixed. Conversely, some dimensions change too quickly to be modeled as slowly changing, like heart rate, which fluctuates minute to minute. For dimensions that don’t change rapidly, SCDs are effective. The slower the dimension changes, the more efficient the modeling becomes.
Max’s Criticism of Slowly Changing Dimensions
Max, a former colleague who created Apache Airflow, strongly opposes SCDs. He argues they are inherently non-idempotent and violate principles of functional data engineering. Max advocates treating pipelines as functions, where inputs and outputs align predictably.
Instead of modeling slowly changing dimensions, Max suggests using daily snapshots, where the value for every dimension is stored each day. His reasoning is that storage is cheap, while fixing data quality issues caused by SCDs is costly. By recording the value every day, idempotence is preserved, and pipelines avoid introducing errors during backfills or restatements.
Three Approaches to Modeling Dimensions
- Latest Snapshot. This method stores only the most recent value for a dimension. For instance, if I’m 29 today, the latest snapshot simply records “age = 29.” The problem arises when backfilling. If we backfill data from 2012, it will pull the current value (29) instead of the historical value (18). This could lead to inaccuracies, like associating my teenage posts with my current age, potentially creating serious misinterpretations.
- Daily Snapshot. Max’s preferred method is the daily snapshot, where each day’s value is explicitly recorded. In this case, backfilling data from 2012 would correctly associate my age as 18 at the time. This approach avoids errors during backfill or restatements because each snapshot is tied to a specific date. The trade-off is increased storage, as every dimension value is duplicated daily.
- Slowly Changing Dimensions. SCDs aim to compress data by collapsing daily snapshots into ranges. For example, instead of 365 rows indicating “age = 18” for a year, an SCD would store one row: “age = 18, valid from January 1 to December 31.” This method works best for dimensions that change infrequently. The slower the changes, the greater the storage efficiency.
For rapidly changing dimensions, SCDs offer limited compression. If a dimension changes weekly, you’re only collapsing a few rows, making it less advantageous. At that point, daily snapshots might be more practical.
Zach’s Perspective
Max and I differ here. While I respect his argument, I see value in SCDs for dimensions that change slowly. The compression they provide can be significant, especially when changes occur rarely. For dimensions with infrequent updates, SCDs reduce storage needs while maintaining accuracy.
When deciding how to model a dimension, consider how often it changes. If it changes rarely, SCDs can be effective. If it changes frequently, daily snapshots might be the better option. Balancing these considerations ensures both efficient storage and reliable data.
Why do dimensions change?
Dimensions change for all sorts of reasons. Preferences change, countries change, phones change—there are many examples. For instance, when I worked at Netflix, almost everyone had a sticker on their laptop. It was either a Catflix sticker or a Dogflix sticker. Back then, I was on Team Catflix, but over time I changed. Now, looking back, I’m emphatically Team Dogflix. That’s just another example of a slowly changing dimension for me. Preferences like that can flow back and forth; who knows, maybe in the future I’ll be a cat person again.
How Can You Model Changing Dimensions?
There are three main ways to model dimensions that change:
- Latest Snapshot
- This method uses only the latest value for the dimension. However, if you backfill with the latest snapshot, all old dimensional values will pull in the latest value, which may not be correct for historical data. For example, if you backfill with only the latest snapshot, you’ll overwrite past data with the most recent value, losing the historical context.
- Never do this. You’ll make Zach really sad if you backfill using only the latest snapshot. It’s a major data engineering mistake.
- Daily Partition Snapshots
- This method stores a value for each day. It’s straightforward and easy to understand: every day has its own value for the dimension.
- Slowly Changing Dimension Modeling
- This method has three subcategories: Type 1, Type 2, Type 3.
The types of Slowly Changing Dimensions
Let’s go over the types of Slowly Changing Dimensions (SCDs) as described.
SCD Type 0
There’s also a Type 0 SCD, technically speaking. These aren’t actually slowly changing dimensions because they don’t change at all. If you’re sure a dimension won’t change—if it’s fixed in stone—you can model it as Type 0. This is fine and won’t bring “the wrath of Zach.” With Type 0, the table simply has an identifier and the dimension value, with no temporal component since the value doesn’t change.
SCD Type 1
Type 1 is the one I hate the most. In this approach, the value changes, but only the latest value is kept. Historical values are overwritten, and no record of changes is maintained. Don’t use this.
Type 1 might be acceptable in online transactional processing (OLTP) systems, where apps only care about the current value. However, for analytics—which is the focus here—Type 1 makes your pipeline non-idempotent and creates many issues. Avoid it, or you’ll end up dealing with a lot of pain and suffering.
SCD Type 2
Type 2 is my favorite and what Airbnb calls the “gold standard” of slowly changing dimensions. Instead of overwriting values, Type 2 tracks changes with a start date and an end date. For example, if my favorite food was lasagna from 2000 to 2010, there would be one record for that period. If it changed to curry from 2010 to 2015, there would be a second record for that period.
Type 2 ensures idempotence because you can always retrieve the correct value based on the date. For the current value, the end date might be set to something far into the future, like 9999-12-31, or left as NULL. Some implementations also include an is_current boolean column to indicate whether a record is active.
This method preserves all history and maintains clarity, making it ideal for analytics.
SCD Type 3
Type 3 only retains two values: the original value and the current value. For example, you might have columns for the original value, the current value, the date of the original change, and the date of the current change. This keeps one row per dimension.
While this simplifies querying (no need to filter by date ranges), it has major limitations. If the dimension changes more than once, you lose the intermediate history. This makes it non-idempotent, as backfills can produce incorrect results. Type 3 sacrifices clarity and historical accuracy, so it’s not recommended for most cases.
Which types are idempotent?
Let’s recap the key points about Slowly Changing Dimensions (SCDs).
Type 0 and Type 2 are idempotent:
- Type 0 is only idempotent if the dimension is truly unchanging. If the value can never change, then this method works without issues.
- Type 2 is idempotent but requires careful handling of the start and end dates to ensure accuracy.
Type 1 and Type 3 are problematic:
- Type 1 is not idempotent because backfills will incorrectly pull the latest values into old data, overwriting historical information.
- Type 3 is somewhat in the middle. It doesn’t fully capture historical changes and won’t give you the clarity or reliability needed in most scenarios.
The main takeaway here is that you should focus on Type 0 and Type 2 for modeling slowly changing dimensions. These are the most reliable and widely applicable options in data engineering.
While there are other types—such as Type 4, Type 5, and Type 6—they are rarely used in practice and aren’t as relevant for most data engineering contexts. It’s better to hone your skills in Type 0 and Type 2 as they cover the vast majority of use cases.
SCD2 Loading
We’re shifting gears now to focus specifically on Slowly Changing Dimension (SCD) Type 2 and the two main ways to load these tables.
- Full Historical Load
This method processes all the historical data in one giant query. It takes all daily data and crunches it into a single output, recalculating everything from the start of the dataset. - Incremental Load
In this method, the data is processed cumulatively. You start with data from the previous day, add new records for the current day, and update the table incrementally. This method processes only one new day at a time.
Which Method Should You Use? For production, it’s generally better to use the incremental method. This avoids reprocessing the entire history every time the pipeline runs, saving computational resources. However, if the dataset is small, there may not be a noticeable difference between the two methods, as processing the full history daily might still be efficient enough.
A Real-World Example: Airbnb Unit Economics At Airbnb, I worked on a pipeline for unit economics, which used an SCD table. For example, imagine a customer pays a fee, and that amount is recorded as profit. Later, they receive a refund, which changes the value of the transaction. Using start and end dates, we could track how the value of line items changed over time.
Although I dreamed of making the pipeline load incrementally, we continued processing the entire transaction history every day, going back to 2016. It felt inefficient, but it worked, and the priority was to deliver value to the business rather than perfecting that particular pipeline.
Key Lesson for Data Engineers Not every pipeline has to be a “Ferrari”—perfectly efficient and optimized. Spending time making a pipeline incrementally load when the historical method already works may not always be the best use of resources. Instead, focus on what adds the most value to the business.
In the Airbnb example, I prioritized projects like pricing and availability, which were more impactful than optimizing the incremental load for unit economics. As you progress in your career, remember to weigh the opportunity cost of making pipelines more efficient against other, higher-value work. Sometimes, “good enough” is the right choice.