Fact Data Overview

Fact data represents the largest type of data in data engineering. During my time at Netflix, I worked with fact data streams that processed 2 petabytes per day.

Fact data is massive because it includes every event generated by users. For example, Facebook has 2 billion users, each performing 1,000 to 2,000 events daily, resulting in trillions of records every day. This scale means that careful modeling is essential to avoid skyrocketing cloud costs.

Course Overview: Managing Fact Data

This 5-hour course covers fact data modeling, focusing on small- and large-scale data management.

On Day 1, we’ll cover fact data modeling fundamentals, including defining facts, modeling them, and ensuring fact-to-dimension joins work efficiently. The lecture focuses on Kimball-style data modeling, followed by a lab where we’ll build a fact data model using the NBA game details table in Postgres.

On Day 2, we’ll explore the blurry line between facts and dimensions. When aggregated, facts can behave like dimensions, making it tricky to distinguish between the two. We’ll also cover the Dateless Data Structure, a technique used at Facebook that compresses user activity into a single integer for efficient monthly active user (MAU) calculations. The lab will involve building a dateless data structure using Postgres and performing bit-level operations.

On Day 3, we’ll dive into minimizing shuffle with reduced facts. Shuffle operations can slow down analytics in Spark and Trino, so we’ll explore Reduced Facts, a way to compress fact data while keeping essential details. At Facebook, I built the Long-Term Analysis Framework, which supported decade-long analyses, reducing pipeline run-times from weeks to hours. In the lab, we’ll build a reduced fact framework to optimize analytics by eliminating shuffle.

What is a Fact?

A fact is typically understood as something that happened, an event, or an action that took place. For example:

  • A user logs into an app
  • You Venmo someone $30
  • You run a mile on your Fitbit

However, not all examples are straightforward. Consider running a mile: Is it a fact? Technically, a mile is an aggregation of smaller events — individual steps. Each step is the smallest atomic unit that cannot be further divided, making it the true fact in this context. This highlights a critical point: Facts should be atomic — something that cannot be broken down further.

Facts Across Granularities

Facts can exist at various granularity levels:

  • Atomic Level: Individual steps while running.
  • Aggregated Level: Total distance (e.g., a mile).

This ability to aggregate up or break down facts is a key feature when modeling data.

Facts Are Immutable

One of the best things about facts is that they don’t change. If you took a step at a specific time, that fact cannot be altered. You can’t undo logging into Facebook 30 times today (even if it makes you reconsider your phone habits). Facts represent immutable events from the past.

Comparison with Dimensions

Unlike dimensions, which can change over time (e.g., a user’s status or subscription level), facts are fixed. This makes them easier to handle in some ways. However, facts also come with unique challenges due to their potential volume and frequency, which we will explore in upcoming sections.

What Makes Fact Modeling Hard?

Fact modeling is challenging due to the sheer volume of data involved. Consider the number of steps a person takes daily—around 10,000 steps. Instead of having one row per person, you end up with 10,000 rows per day per person. This highlights why fact data scales exponentially compared to dimensional data.

Data Volume Explosion

  • Fact data often generates 10x to 100x more rows than dimensions.
  • Example: At Facebook, 2 billion users receiving 25-30 notifications daily leads to 50+ billion notifications logged each day.

Context Dependency

  • Isolated facts are often meaningless.
    Example: Knowing a notification was sent isn’t useful unless you also know if the user clicked it or made a purchase afterward.
  • Conversion Funnels: Tracking sequences like sent → clicked → bought is key to business models for companies like Google and Facebook.
  • Data Integration: Adding dimensions (e.g., user location) can reveal deeper insights, such as users in Canada having a better click-through rate than those in the US.

Duplicate Records

  • Common Issue: Duplicates are far more common in fact data due to various reasons:
    • Software Bugs: Incorrect logging can cause duplicate records.
    • Genuine Duplicates: A user may click the same notification multiple times at different intervals.
  • Metric Distortion: Duplicate records can inflate metrics like click-through rates, leading to misleading analytics. For example if an user clicks on a notification and clicks on the same notification again after 1 hours, we’ll have a click rate of 200%. This makes deduplication a crucial and challenging task when working with fact data.

How Does Fact Modeling Work?

Fact modeling works by organizing data into normalized or denormalized formats, each with advantages and challenges. Understanding these structures is crucial for efficient data processing.

1. Normalized Facts

  • Definition: They don’t have any dimensional attributes. They store facts with minimal duplication by using unique IDs that reference related data in other tables.
  • Example: A fact record might say: User ID 17 logged in at this time. To find more about the user, you’d join this record with a separate user dimension table storing details like: User ID 17 → Zach, 29-year-old male, lives in California.
  • Advantage: Saves storage by avoiding redundant data.
  • Disadvantage: Requires joins, which can be computationally expensive and slow for large datasets.

2. Denormalized Facts

  • Definition: Embed related dimensional data directly into the fact table, eliminating the need for joins.
  • Example: Instead of using a user ID, the fact record might say: Zach, 29-year-old male, lives in California, logged in at this time.
  • Advantage: Speeds up queries by avoiding joins.
  • Disadvantage: Data duplication becomes significant if multiple facts reference the same user or dimension. Example: If Zach logs in 50 times a day, his details are duplicated 50 times, consuming more storage.

Both approaches have trade-offs, balancing storage efficiency and query speed. In the upcoming lab, the focus will be on denormalized facts and how they can cause issues, followed by exploring challenges related to normalized facts in later sections.

The key thing to remember is that normalization works better at a smaller scale. With normalized facts, duplicates are removed, data integrity is improved, and there’s no need to worry about restating data.

This makes normalization a strong option, especially when working with smaller datasets. While both normalized and denormalized models have their trade-offs, normalization can provide significant advantages in maintaining clean, consistent, and efficient data storage at smaller scales.

A common misconception is that raw logs and fact data are the same. While they are closely related, they serve different purposes. Raw logs capture events in their most basic form, while fact data is structured, clean, and optimized for analysis. Here’s the key differences:

  • Ownership: Raw logs are often managed by software engineers focused on keeping online systems running. They may not prioritize data quality or structure. Fact data, however, is owned by data teams who enforce strict data quality standards.
  • Data Quality: Raw logs typically lack quality guarantees, may contain duplicates, and often have messy schemas. Fact data undergoes cleaning, deduplication, and formatting processes, ensuring higher reliability.
  • Retention and Usability:
    • Raw Logs: Short-term retention, messy column names, and lower trust.
    • Fact Data: Longer retention, well-defined schemas, and stronger data trust.

As a data engineer, collaborating with software engineers to ensure proper logging formats can simplify fact data modeling. By transforming raw logs into high-quality fact data, you create a trusted data foundation critical for effective data-driven decisions.

Fact modeling involves defining specific aspects of events in a structured way. These aspects include Who, Where, When, What, and How. Let’s explore each in detail:

1. Who

  • The Who field identifies the participants in the event, usually represented by IDs.
  • Examples:
    • User IDs: Identifying specific users in the system.
    • Device IDs: Tracking devices like phones or cars.
    • System Identifiers: Such as operating system versions.
  • For instance, in a Tesla logging system, the Who could be the user ID, car ID, or the operating system ID, helping to pinpoint who was involved.

2. Where

  • Where indicates the event’s location, which can be both physical and virtual.
  • Physical Location: Countries, cities, or states where events take place.
  • Virtual Location: Where within an app the action occurred, such as the homepage, profile page, or settings screen.
  • Example:
    • If a user clicks a button on a webpage, the Where might be “Profile Page” or “Homepage.”
  • In some models, “Where” is represented with IDs; in others, it’s described directly with labels or page URLs.

3. What

  • What describes the action or event that took place.
  • Examples include:
    • Sending notifications
    • Making purchases
    • Taking steps logged by a fitness app
  • Events should be atomic—representing the smallest meaningful action.
    • Example: “Taking a step” is atomic, while “Running a marathon” is not, as it consists of many steps.

Important Note: Aggregations of events can represent larger activities. For instance, “Burning Man” as an event involves multiple actions like event registration, logistics management, and attendance, each of which could be tracked separately as individual facts.

4. How

  • How captures the method or means by which an event occurred.
  • Examples:
    • Device Used: An iPhone, Android phone, or a specific Tesla model.
    • Action Method: Clicking a button, making a voice command, or using a specific feature in the app.

Complex Scenarios: Sometimes How and Who can overlap. For example:

  • A Tesla’s car ID could be seen as both Who (identifying the car owner) and How (indicating the vehicle used for the action). These fields can become tricky to distinguish.

5. When

  • When defines the precise moment the event happened, usually logged as a timestamp or date.
  • Best Practices:
    • Use UTC Time Zone for all event logs to avoid timezone-related problems.
    • Client-side logging often defaults to the user’s local time zone, which can cause inconsistencies when data is sent to the server. Logging in UTC prevents these issues.

Example Scenario: If a user in New York clicks a notification at 8 PM, but the system logs events using the client’s timezone, it might create confusion when aggregating data across multiple time zones. UTC timestamps ensure uniformity.

Fact Data Quality Guarantees

Fact data sets should have clear quality guarantees to ensure accuracy and usability. Some essential guarantees include:

  • No Duplicates: Fact data should be deduplicated so that when analysts work with it, the data is already flattened and easy to understand.
  • Mandatory Fields: Certain fields should never be null, including:
    • What Field: The event that happened.
    • When Field: The timestamp indicating when the event occurred.
    • Who Field: While not as critical as “what” and “when,” the “who” field should ideally also be not null to ensure meaningful analysis.

Fact Data vs. Raw Logs

  • Fact Data is Smaller: Fact data should be smaller than raw logs. Raw logs often include extra data used for system diagnostics or troubleshooting that isn’t relevant for business analysis.
  • Raw Logs Have Extra Details: Software engineers may log additional information needed to monitor and maintain servers. This data isn’t necessary for understanding business performance and should be excluded from fact data models.

Simplify Complex Columns

Fact data should minimize hard-to-understand columns. Software engineers sometimes pass columns as JSON strings or other complex formats, making them difficult to query and analyze. Avoiding these complex types is essential to create clean and easy-to-use fact data. Here’s some best practices:

  • Use simple data types like:
    • Strings
    • Integers
    • Decimals
    • Enumerations
  • Limit Complex Data Types: While complex data types should be rare, they can be valuable in specific use cases. For example:
    • At Facebook, they used an array of strings to track which experiment groups a notification belonged to. This helped slice and dice conversion rates effectively.

Data Engineer’s Responsibility

Your job as a data engineer is to simplify data and make it delightful for analysts to use. Avoid leaving messy JSON blobs or passing the burden of data parsing to analysts. If fact data is well-modeled, analysts should never have to deal with raw logs or complex data types directly.

If you’re not ensuring data usability, you’re not fulfilling your role as a data engineer. Consider these points carefully when designing fact data models to create an efficient, user-friendly system.

When Should You Model in Dimensions?

Case Study: Network Logs Pipeline at Netflix

One of the largest data pipelines I ever worked on was the Network Logs Pipeline at Netflix. This pipeline processed two petabytes of data every day, making it the most extensive pipeline in my career.

The source data came from every network request Netflix received. The primary goal of this pipeline was to understand how Netflix’s microservices communicated with one another.

Why Was This Important?

Netflix is famous for pioneering the microservice architecture, a powerful way to scale development by splitting responsibilities into smaller, manageable services. However, this approach creates security challenges because instead of securing one large application, you now have to secure thousands of microservices. If one service gets hacked, it can potentially compromise every other service it communicates with.

The only way to track these communication patterns was by analyzing the network traffic generated by every microservice interaction. This was a massive data engineering challenge, and the only way we could solve it was by joining the network traffic data with a small IP address-to-service mapping table.

Broadcast Join: Why It Worked (At First)

We used a broadcast join in Spark, an efficient join method where one side of the join is small enough (under 5-6 GB). This small side of the join contained the IP address mappings for Netflix’s microservices.

The join allowed us to match every incoming network request’s IP address with its corresponding app or microservice, enabling us to track which services were talking to which. After completing the join, we could aggregate the results to create a communication map.

The IPv6 Challenge: Scaling Problems

Later, Netflix decided to upgrade from IPv4 to IPv6, significantly increasing the number of IP addresses we needed to track. This made the IP address mapping dataset too large for a broadcast join, forcing us to try a shuffle join instead.

The result? Costs ballooned 10x. The pipeline couldn’t run efficiently anymore.

Fact Data Modeling Solution

We faced a critical fact modeling challenge: How do we continue answering essential security questions without blowing up costs?

Our initial solution worked because of a denormalized fact model using a small IP address table. However, the scaling challenge showed that denormalization wouldn’t work anymore because of the increasing data volume.

Takeaway: When to Use Dimensions in Fact Modeling

The Netflix example highlights when to model in dimensions:

  • If the dimension table is small (under 5-6 GB), consider using a broadcast join. This approach can be efficient and cost-effective.
  • If the dimension table is large, rethink the data model. Consider alternative joins, data aggregation, or even restructuring the entire pipeline to avoid skyrocketing costs.

By modeling dimensions carefully, you can keep pipelines efficient, reduce costs, and scale your data systems even when faced with massive data growth.

The Solution: Denormalization Through Logging

To solve the challenge of joining IP addresses with app names, we removed the join entirely. Instead, we required each app to log its own app name whenever it received a network request. This allowed the app name to be logged directly in the network request data, eliminating the need for a costly join.

The Sidecar Proxy Approach

We introduced a system called a Sidecar Proxy, a lightweight service running alongside each app. This proxy would automatically log the app name whenever a request came in, embedding the app’s identity directly in the logs.

By doing this, we effectively denormalized the network request data, shifting the app information from an external lookup table into the fact data itself. This approach avoided the massive shuffle join and made the pipeline scalable again.

The Cost of Implementation

While this technically solved the problem, implementing it wasn’t easy. We needed to have conversations with 3,000 app owners at Netflix, convincing them to adopt the Sidecar Proxy in their services. Every new app would also need to integrate this system from the start.

Engineering Trade-Off: Pipeline vs. Conversations

From an engineering perspective, which sounds easier?

  • Owning a 100 terabyte-per-hour pipeline that processes enormous data volumes using a shuffle join?
  • Or having 3,000 conversations with app owners to change how their services log network requests?

Most data engineers would prefer the pipeline, as it seems like a purely technical challenge. But in reality, solving problems upstream by influencing how data is logged often has a much larger impact than just building a complex data pipeline.

The Lesson: Solving Problems at the Source

This experience taught me that data engineering isn’t just about writing pipelines or optimizing queries. It’s about collaborating with teams, advocating for better data practices, and solving problems at the source before they even reach the pipeline stage.

Takeaway: Denormalization as a Solution

In this case, denormalization wasn’t the problem—it was the solution.

By embedding app names directly in the network request logs, we bypassed a complex join, enabling massive scalability while reducing costs. Sometimes, denormalization is the best answer to large-scale data engineering problems, even when it seems counterintuitive.

How Does Logging Fit into Fact Data?

Logging Should Cover Essential Columns

Logging should capture all necessary columns except for dimensional columns, which should be referenced using IDs. Analysts can later join the relevant dimensions using those IDs. This ensures efficiency and keeps the logged data clean.

Collaboration with Online System Engineers

Fact data relies heavily on collaboration with online system engineers, as they understand event generation better. They know when and how events are triggered in the app, making them critical partners in designing proper logging schemas.

Log Only What You Need

Be cautious about logging only essential data. Logging everything “just in case” can lead to massive cloud costs. This anti-pattern can easily inflate storage and increase expenses. Data engineers should focus on efficiency and relevance when deciding what to log.

Data Conformance and Schema Contracts

When logging data, schema contracts ensure data consistency across different teams and services. Without this, systems can drift and produce junk data over time.

A Real-World Example: Airbnb’s Cross-Language Schema

At Airbnb, the app was built using Ruby, while the backend data pipelines were built in Scala. Since these languages are incompatible, a shared schema layer was introduced using Thrift. Here’s the benefits of Thrift:

  • Language-Agnostic: Both Ruby and Scala could reference the same schema.
  • Automatic Updates: If the Ruby team added a new column for price calculations, the Scala-based pipeline would detect the change.
  • Integration Tests: If a schema change broke the data pipeline, the tests would fail, forcing both teams to collaborate and adjust.

Why Schema Contracts Matter

This shared schema approach ensured that changes in the app were synchronized with the data pipeline, preventing data drift. Without this system, fact data would become inconsistent and unreliable. A shared schema creates a common language between teams and keeps data pipelines trustworthy and robust.

Potential options when working with high volume fact data

1. Filter and Sample the Data

Sometimes the best solution for working with high-volume fact data is to reduce the data size through filtering or sampling. This can drastically lower computational and storage costs while still providing meaningful insights.

Example: Netflix’s Infrastructure Impact Metric

  • Use Case: Measuring whether an A/B test caused a higher AWS or S3 bill.
  • Solution: They sampled 1% of network requests, which provided enough accuracy for cost comparisons.
  • Why It Worked: Infrastructure metrics don’t require full data coverage, only directional metrics.

When Sampling Fails

  • Security Applications: Full data coverage is necessary due to needle-in-the-haystack problems.
  • Rare Events: When detecting low-probability events, sampling could miss critical cases.

Law of Large Numbers

  • As you gather more data, your sample’s statistics will approach a normal distribution.
  • Minimum Sample Size: Even 30 data points can be enough for directional metrics.
  • Best Use Cases: Sampling works well for metrics indicating directionality, not when you need exact row-level details.

2. Bucketing the Data

Bucketing involves dividing data into manageable chunks, often based on user IDs or other identifiers, making joins and queries more efficient.

Why Use Bucketing?

  • Minimizes Shuffle: Joins only happen within buckets, reducing the need to shuffle data across the entire dataset.
  • Efficient Processing: Queries process only relevant buckets instead of the entire dataset.

Bucket Joins

  • Standard Bucket Joins: Require data to be partitioned and bucketed by a shared key.
  • Sorted Merge Bucket (SMB) Joins:
    • Both datasets must be sorted and bucketed on the same key.
    • The join works like a zipper, aligning rows without any shuffle.

Real-World Use Case: Facebook

  • Why It Worked: Facebook used SMB Joins to avoid shuffling and handle massive data volumes efficiently.
  • Why It’s Less Common Today: Spark’s improved shuffling has reduced the reliance on SMB Joins, but they are still valuable in specific use cases, especially for large-scale data.

Key Takeaways

  • Sampling: Great for metrics and monitoring but fails for rare-event detection.
  • Bucketing: Best for reducing shuffle costs and enabling fast joins.
  • Sorted Merge Bucket Joins: Highly efficient when dealing with large-scale, sorted datasets.

These techniques help manage large fact datasets, making analysis scalable and cost-effective.

How long should you hold onto fact data

Retention Challenges with High-Volume Fact Data

Retention is a critical consideration when managing high-volume fact data. Unlike dimensional data, which can often be stored indefinitely if legally permissible, fact data’s sheer size makes indefinite storage impractical. Managing retention effectively means balancing storage costs, processing demands, and analytical value.

In big tech companies, tables exceeding 100 terabytes are often referred to as “whale tables.” These tables usually have very short retention periods, typically one to two weeks, due to their substantial storage and processing costs. Despite representing only 2% to 3% of all data tables, these whale tables significantly impact storage budgets and infrastructure requirements.

For tables smaller than 10 terabytes, retention policies tend to be more flexible. At this scale, concerns often shift from storage efficiency to legal compliance, such as ensuring data is anonymized to minimize legal risks. Even when a table’s storage costs are manageable, companies must still comply with privacy regulations and ensure that sensitive data is properly protected.

Custom Retention Thresholds and Balancing Needs

Retention thresholds vary depending on a company’s size, storage capacity, and budget. In smaller organizations, the definition of a “large” table might be closer to 1 terabyte or even 100 gigabytes. Each company must define its own limits based on its technical infrastructure and financial constraints.

Data engineers must continuously evaluate the return on investment (ROI) of retaining more data. Storage costs can grow rapidly, while the analytical value of older fact data often diminishes over time. While data scientists may request longer retention “just in case” they need historical data for future analysis, such scenarios rarely materialize. It’s crucial for data engineers to push back against indefinite storage requests and ensure that data retention policies remain cost-effective and focused on real business needs.

Ultimately, managing fact data retention requires balancing storage expenses, legal obligations, and data utility. Limiting retention for large tables, ensuring compliance for smaller datasets, and challenging unnecessary storage requests can help control costs while preserving essential data for meaningful analysis.

Deduplication of fact data

Deduplicating fact data is inherently challenging due to the complexity of distinguishing meaningful records from irrelevant duplicates. Fact data can contain both actual duplicates caused by system or logging errors and legitimate records that might appear as duplicates depending on context. For example, consider Facebook notifications: if a user clicks on a notification today and then clicks on it again a year later, should both clicks be treated as duplicates? In most cases, the answer is no. The first click might be relevant to current engagement, while the second represents a different interaction altogether.

This raises an important consideration—time-based deduplication windows. Defining a meaningful time frame for deduplication can help focus on relevant records while ignoring duplicates that occur outside the relevant window. This decision should be driven by data analysis, examining the frequency and distribution of duplicates to identify the appropriate cutoff for deduplication.

Case Study: Facebook Notification Deduplication

While working on deduplication for Facebook’s notification event dataset, we encountered significant processing delays. The deduplication process was initially implemented as a massive Hive query that took over nine hours to run. This pipeline generated master data that many other data sets depended on, creating a substantial bottleneck. Downstream pipelines could only begin processing after this lengthy deduplication process was complete.

To reduce latency, we explored two primary solutions:

  1. Microbatch Processing: Running deduplication jobs on an hourly basis allowed us to reduce the time between data collection and availability. While still batch-oriented, this approach offered a balance between latency and processing efficiency.
  2. Streaming Processing: Implementing real-time deduplication through streaming minimized delays even further by processing events as they arrived. This approach is more complex but offers near-instant updates and reduced end-to-end latency.

Ultimately, selecting the right deduplication strategy depends on the volume of data, latency requirements, and available processing infrastructure. Balancing these factors can transform deduplication from a time-consuming bottleneck into an efficient, scalable data processing pipeline.

Streaming to deduplicate facts

Streaming offers an efficient way to capture duplicates within a defined time window. By tracking events such as notification IDs, you can detect and eliminate duplicates that occur within a short time after the initial event. For example, if you monitor incoming notifications over a 15, 20, or 30-minute window, you can catch most duplicates since a large percentage of them tend to happen within that time frame. This makes streaming particularly appealing for real-time data pipelines, as it can handle duplicates dynamically with minimal delay.

Challenges in Long Retention Deduplication

However, streaming isn’t always the right solution, especially when dealing with high-volume data or extended deduplication windows. When I worked on Facebook’s notification system, the requirements were far more demanding. We needed to track every notification ID for an entire day, not just for a brief time window. This meant holding billions of records in memory throughout the day, which quickly became unsustainable due to the enormous memory overhead.

I spent six weeks trying to make streaming work for this scenario, exploring different architectures and optimizations, but ultimately, the sheer data volume—around 50 billion records per day—was too much for a streaming-based approach to handle effectively. It became clear that streaming wasn’t the right tool for this specific job.

When Streaming Works

Despite this limitation, streaming remains a highly viable solution for many deduplication scenarios, especially when:

  • Data Volumes Are Moderate: If your data volume is smaller than the 50 billion records we dealt with, streaming could work exceptionally well.
  • Short Deduplication Windows: If you only need to deduplicate within a short time frame—say a few minutes or even a few hours—streaming is likely the simplest and most effective approach.
  • Real-Time Requirements: When real-time processing is a must, streaming offers unparalleled low-latency deduplication.

In conclusion, while streaming wasn’t suitable for my high-scale use case, it could be the perfect solution for more moderate data volumes or shorter deduplication windows. Don’t let one extreme example deter you from considering streaming as a powerful tool for deduplicating fact data.

Hourly Microbatch Dedupe

To deduplicate Facebook’s notification data, I implemented an hourly microbatch deduplication (DDop) process. This approach reduced a nine-hour deduplication pipeline to a process that completed one hour after midnight, significantly improving availability. Here’s how the process worked, broken down step by step:

Step 1: Hourly Data Aggregation

The first step involved collecting all data for one-hour intervals and performing a group by operation.

For example, for each hour from hour 0 to hour 23, the data was grouped by relevant fields like product ID, event type, and other necessary keys. This aggregation eliminated duplicates within each hour. After this step, we had a deduplicated data set within each hour.

Step 2: Full Outer Join Across Hours

The next step was performing full outer joins between consecutive hour batches:

  • Join hour 0 with hour 1, hour 2 with hour 3, hour 4 with hour 5, and so on.

This cross-hour join removed duplicates across two-hour intervals. If the same event occurred in hour 0 and hour 1, the join would deduplicate that event.

Step 3: Tree-like Merging

The process continued in a tree-like structure, merging progressively larger time intervals:

  • After merging hours 0-1 and hours 2-3, the results were joined into a four-hour window (hours 0-3).
  • This pattern repeated across all intervals until the entire 24-hour window was processed.

Final Output: Daily Deduplicated Data Set

By the end, the process merged everything into a daily deduplicated data set. The entire system worked incrementally and efficiently, making the pipeline both resilient and scalable.

Here’s a diagram of the entire workflow:

Why This Method Works

  • Batch-Oriented but Low Latency: Despite being a batch process, the pipeline achieved low-latency deduplication, thanks to hourly processing.
  • Arbitrary Scalability: The tree-like merging structure made the solution scale horizontally. Each step could process data independently and merge incrementally.

This approach proved highly effective and can be adapted to handle deduplication at arbitrary scales, making it a recommended best practice for large-scale fact data processing systems.