Dimensional Data Modeling Day Two Lab

Today’s lab builds on the Day One lab, where we created a cumulative table that tracks NBA players’ seasons and their stats throughout their careers. This time, we’ll take those datasets and convert them into Slowly Changing Dimension (SCD) Type 2 tables.

Why SCD Type 2?

Remember, SCD Type 2 is considered the gold standard for slowly changing dimensions because:

  • It’s the only purely idempotent SCD type.
  • It retains complete historical records while tracking changes over time.

This lab focuses on applying SCD Type 2 modeling using PostgreSQL.

LAB

Let’s drop and create again the players table with a new boolean column is_active. Then we’ll populated it with this query:

CREATE TABLE players (  
	player_name TEXT,  
	height TEXT,  
	college TEXT,  
	country TEXT,  
	draft_year TEXT,  
	draft_round TEXT,  
	draft_number TEXT,  
	season_stats season_stats[],  
	scoring_class scoring_class,  
	years_since_last_season INTEGER,  
	current_season INTEGER,  
	is_active BOOLEAN,  
	PRIMARY KEY (player_name, current_season)  
)

Then, we’ll use the following query that creates the same table that we had from 02. Data Modeling - Cumulative Dimensions, Struct and Array - Day 1 Lab:

INSERT INTO players  
WITH years AS (  
    SELECT *  
    FROM GENERATE_SERIES(1996, 2022) AS season  
),  
p AS (  
    SELECT  
        player_name,  
        MIN(season) AS first_season  
    FROM player_seasons  
    GROUP BY player_name  
),  
players_and_seasons AS (  
    SELECT *  
    FROM p  
    JOIN years y  
        ON p.first_season <= y.season  
),  
windowed AS (  
    SELECT  
        pas.player_name,  
        pas.season,  
        ARRAY_REMOVE(  
            ARRAY_AGG(  
                CASE  
                    WHEN ps.season IS NOT NULL  
                        THEN ROW(  
                            ps.season,  
                            ps.gp,  
                            ps.pts,  
                            ps.reb,  
                            ps.ast  
                        )::season_stats  
                END)  
            OVER (PARTITION BY pas.player_name ORDER BY COALESCE(pas.season, ps.season)),  
            NULL  
        ) AS seasons  
    FROM players_and_seasons pas  
    LEFT JOIN player_seasons ps  
        ON pas.player_name = ps.player_name  
        AND pas.season = ps.season  
    ORDER BY pas.player_name, pas.season  
), static AS (  
    SELECT  
        player_name,  
        MAX(height) AS height,  
        MAX(college) AS college,  
        MAX(country) AS country,  
        MAX(draft_year) AS draft_year,  
        MAX(draft_round) AS draft_round,  
        MAX(draft_number) AS draft_number  
    FROM player_seasons  
    GROUP BY player_name  
)  
SELECT  
    w.player_name,  
    s.height,  
    s.college,  
    s.country,  
    s.draft_year,  
    s.draft_round,  
    s.draft_number,  
    seasons AS season_stats,  
    CASE  
        WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 20 THEN 'star'  
        WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 15 THEN 'good'  
        WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 10 THEN 'average'  
        ELSE 'bad'  
    END::scoring_class AS scoring_class,  
    w.season - (seasons[CARDINALITY(seasons)]::season_stats).season as years_since_last_active,  
    w.season,  
    (seasons[CARDINALITY(seasons)]::season_stats).season = season AS is_active  
FROM windowed w  
JOIN static s  
    ON w.player_name = s.player_name;

We essentially want to track changes int two columns (because with SCD tables you can track multiple columns at once). To do that let’s create a new table called players_scd:

CREATE TABLE players_scd (  
	player_name TEXT,  
	scoring_class scoring_class,  
	is_active BOOLEAN,  
	start_season INTEGER,  
	end_season INTEGER,  
	current_season INTEGER,  
	PRIMARY KEY(player_name, start_season)  
)
  • scoring_class and is_active are the columns we want to track;
  • since we want to build a type 2 SCD, we need start_season and end_season.

This is essentially a properly modelled SCD table.

We’re going to do the following things:

  • write the query that looks at all of history and creates one SCD record from all of history;
  • take and existing SCD table and then build on top of it incrementally. (because you can do it either way. TODO: che significa? either way a cosa si riferisce. Di seguito il riassunto di chatgpt

Now that the table has been populated with historical data, we can handle updates incrementally as new seasons are added:

  1. Compare New Data with Existing Records: Check for changes in scoring_class and is_active.*
  2. Extend or Create New Records: If the attributes are unchanged, update the end_season of the last record. Otherwise, create a new record. )

Query

Now, we want to calculate the streak of how many seasons a player has been in a particular state, such as scoring class or active status. To do this, we need to look at the previous season’s data to compare changes over time.

Using Window Functions to Analyze Changes

We can use window functions, specifically LAG, to track previous values for scoring_class and is_active

SELECT
	player_name,
	current_season,
	scoring_class,
	is_active,
	LAG(scoring_class, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_scoring_class,
	LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_is_active
FROM players

This will allow us to compare the current season with the previous season.

Output:

Adding Change Indicators

Next, create a change indicator to show when a player’s scoring class or active status has changed. Use a CASE statement to check for differences between the current and previous values:

SELECT  
	*,  
	CASE  
		WHEN scoring_class <> previous_scoring_class THEN 1  
		WHEN is_active <> previous_is_active THEN 1  
		ELSE 0  
	END as change_indicator  
FROM with_previous

This will create flags to indicate when changes occur.

Output:

If we run this, we can observe changes like:

  • Player: Aaron Brooks
    • 2008: Scoring class changes from “Bad” to “Average.”
    • 2009: Scoring class changes from “Average” to “Good.”

This player improved significantly in back-to-back seasons but later regressed from “Good” to “Average.” These indicators help track such transitions.

Creating a Streak for Changes

Now, we want to calculate a streak for each change. This will assign a sequential number to each change a player makes:

  1. When the player enters the table for the first time, their streak starts at 0.
  2. For each subsequent change, increment the streak.

This can be done by creating a new CTE (Common Table Expression) called with_indicators that incorporates the change indicators and calculates streaks incrementally.

This process helps analyze not just when changes happen but also how frequently players transition between different states, providing deeper insights into their performance over time.

Calculating Streaks

Next, we track streaks for players based on their changes. To do this, we sum up the change_indicator using a window function. Here’s how:

  1. Partition the data by player_name to handle each player’s data independently.
  2. Order by current_season to process changes chronologically.

The query looks like this:

SELECT
	*,
	SUM(change_indicator) OVER (PARTITION BY player_name ORDER BY current_season) as streak_identifier
FROM with_indicators

The streak_identifier will increment each time there’s a change, creating a sequential identifier for each streak.

Output:

Collapsing Rows Using Streak Identifiers

Now that we have the streak identifiers, we can use them to aggregate the data. By performing a MIN and MAX on the season column for each streak identifier, we can collapse all rows with the same streak into a single record. This works because the streak identifier ensures that all grouped rows have the same values for the tracked attributes. This is the full query:

WITH with_previous as (  
	SELECT  
			player_name,  
			current_season,  
			scoring_class,  
			is_active,  
			LAG(scoring_class, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_scoring_class,  
			LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_is_active  
	FROM players  
),  
with_indicators as (  
	SELECT  
			*,  
			CASE  
					WHEN scoring_class <> previous_scoring_class THEN 1  
					WHEN is_active <> previous_is_active THEN 1  
					ELSE 0  
			END as change_indicator  
	FROM with_previous  
),  
with_streaks as (  
	SELECT  
			*,  
			SUM(change_indicator) OVER (PARTITION BY player_name ORDER BY current_season) as streak_identifier  
	FROM with_indicators  
)  
SELECT  
	player_name,  
	streak_identifier,  
	is_active,  
	scoring_class,  
	MIN(current_season) as start_season,  
	MAX(current_season) as end_season  
FROM with_streaks  
GROUP BY player_name, streak_identifier, is_active, scoring_class  
ORDER BY player_name, streak_identifier

Output (look at A.C. Green and Aaron Brooks):

Observations

This approach effectively identifies streaks for each player:

  • Consistency: Players like A.C. Green have long periods with no changes, as reflected in their streak identifiers: 0 for the first streak and 1 for the second streak.
  • Frequent Changes: Players like Aaron Brooks are more dynamic. He starts as “bad” and active, transitions to “average” and active, and then “good” and active, changing almost every season. He even has a year of inactivity before becoming consistent for two years. For players like this, the streak identifier highlights frequent transitions across seasons.

The streak identifier increments each time there is a change, allowing us to trace continuous values for different players over time.

This aggregated table serves as the Slowly Changing Dimension (SCD) representation. It tracks:

  1. When changes occurred (start and end seasons).
  2. The state during each streak (e.g., scoring class, active status).

This method allows for a concise, accurate history of changes while preserving the temporal context for analysis.

Filtering for Incremental Build and Setting Current Season

To prepare for the incremental build, let’s filter the data to include only rows where the current season is less than or equal to 2021. This allows us to reserve 2022 for testing the incremental logic. After running the query, the data now ends at 2021, as expected:

with_indicators as (  
SELECT  
	*,  
	CASE
		WHEN scoring_class <> previous_scoring_class THEN 1
		WHEN is_active <> previous_is_active THEN 1  
		ELSE 0  
	END as change_indicator  
FROM with_previous  
WHERE current_season <= 2021  
)

To simulate this as part of a pipeline (e.g., in Airflow), we can hardcode the current_season to 2021. This acts as a parameter you would typically inject during pipeline execution:

SELECT
	player_name,
	scoring_class,
	is_active,
	MIN(current_season) as start_season,
	MAX(current_season) as end_season,
	2021 AS current_season
FROM with_streaks
GROUP BY player_name, streak_identifier, is_active, scoring_class
ORDER BY player_name, streak_identifier

Inserting Data into the SCD Table

Now we finalize the Slowly Changing Dimension (SCD) table:

INSERT INTO players_scd  
WITH with_previous as (
...
...
ORDER BY player_name, streak_identifier

SCD Table Output

The final players_scd table captures:

  • Each player’s streak.
  • The start and end seasons for each streak.
  • Relevant attributes such as scoring class and active status.

This table accurately reflects changes in player states over time.

Flexibility of SCD Modeling

The same logic can apply to different temporal dimensions:

  • Instead of seasons, you could use dates, years, or months as the temporal component.
  • This flexibility makes SCD modeling powerful for tracking historical changes in various datasets.

By leveraging this approach, you can process historical and incremental data efficiently, ensuring a comprehensive and accurate history of changes.

Limitations of the Current Pattern

Before diving into the incremental build, let’s review some limitations of the current approach to better understand its trade-offs.

  1. Expensive Query Operations
    • This query relies heavily on window functions, which are computationally expensive. For example, we perform:
      • Two window functions over the entire dataset.
      • A final aggregation to collapse the data.
    • Data is only reduced at the very end during the GROUP BY, meaning window functions process the full dataset without any intermediate reduction.
  2. Scalability Concerns
    • While the approach works well for moderate-sized datasets, processing the entire history daily can be inefficient, especially as data scales. This method is powerful but computationally heavy.
    • For smaller scales, like at Airbnb, this method worked effectively even when applied to line items and other data. Technologies like Spark efficiently handled these operations. However, at much larger scales, like Facebook, where there are billions of users, this approach may not be viable.
  3. Handling of Outliers
    • In datasets with inconsistent change rates, certain dimensions or users may generate many streaks, increasing the cardinality and slowing the query. For example, a player who changes scoring class every season generates more streaks than others, creating potential skew.

Benefits of the Current Pattern

Despite its limitations, this method has distinct advantages:

  • Simplicity: It regenerates the entire historical view daily, ensuring consistency and allowing for changes in logic to propagate across all records.
  • Comprehensive Data: The resulting SCD table provides a detailed, complete history that captures every change.
  • Applicability to Smaller Scales: For dimensional data, which is typically smaller than fact data, this method often suffices. Dimensional data is manageable in size, even at larger companies, compared to fact data, which can grow exponentially.

Considerations for Scale

  • At Airbnb, processing dimensional data for millions of users was practical with this approach, leveraging tools like Spark to manage historical scans.
  • At Facebook, where user data scales to billions, such queries are less feasible due to the added zeros in dataset size. The approach might lead to out-of-memory errors, skew, and performance bottlenecks.
  • As data scales or includes dimensions with high cardinality or frequent changes, this approach becomes less practical. In such cases, alternative methods like incremental builds are better suited to manage efficiency and scalability. Understanding when to use each method is a critical skill as you work with dimensional data.

Incremental Approach

Let’s move to an incremental approach for building the Slowly Changing Dimension (SCD) table. This method focuses on processing only the new and changed data, rather than reprocessing the entire history.

We’ll structure the query into several components.

Historical Records Are Static. Historical records are complete and unaffected by current changes, so they will never change:

WITH historical_scd AS (  
    SELECT  
        player_name,  
        scoring_class,  
        is_active,  
        start_season,  
        end_season  
    FROM players_scd  
    WHERE current_season = 2021  
    AND end_season < 2021  
),

Last Season’s Records. This component contains records for players who were active in the most recent season (2021). These records may be extended or updated if changes occur in 2022:

last_season_scd AS (  
   SELECT * FROM players_scd  
   WHERE current_season = 2021  
   AND end_season = 2021  
),

Current Season Records. The component brings in data for 2022, identifying any changes that require new records:

this_season_data AS (  
	SELECT * FROM players  
	WHERE current_season = 2022  
),

Let’s see how last_season_scd is just to get a grapple of what’s going on: This query shows the most recent records, specifically from the 2021 season. Each player has exactly one record, representing their current state for that year.

Incremental Update Logic. The incremental update process works by:

  • For records that remain unchanged in 2022, the end_season field will simply be incremented to include 2022.
  • For records with changes in 2022, a new record will be created in the SCD table, capturing the updated state.

The following one is a possible procedure to implement this incremental update logic.

Identifying Unchanged Records

We will approach this by analyzing and processing the records from the last season (last_season_scd) and the current season (this_season_data). The goal is to identify unchanged records and extend them to the new season. Here’s the query:

unchanged_records AS (  
	SELECT  
		ts.player_name,  
		t.scoring_class,  
		ls.scoring_class,  
		ls.is_active,  
		ls.scoring_class,  
		ls.is_active  
	FROM last_season_scd ls  
	JOIN  this_season_data ts  
	ON ls.player_name = ts.player_name  
	WHERE ts.scoring_class = ls.scoring_class  
	AND ts.is_active = ls.is_active  
),
  • Use a LEFT JOIN because new players may appear in the current season but not in the last season.
  • For records that did not change between seasons:
    • scoring_class remains the same.
    • is_active remains the same. Add a WHERE condition to filter these records:
  • The unchanged_records table now contains rows where players’ states remained the same. For these rows:
    • the start_season remains as it was in last_season_scd.
    • the end_season is extended by one (i.e., to the current season).

Identifying Changed Records

Next, we handle the changed records, which are key parts of this process. The challenge with changed records is that we need to create two rows for each change:

  1. One row to “close” the previous record (the old record that has ended).
  2. Another row for the new state (the updated record).

Here’s the approach:

  1. Identify changes. Compare the scoring_class and is_active fields between the last season (last_season_scd) and the current season (this_season_data).
  2. Use an array of structs. For changed records, generate both the old and new rows together.
  3. Handle PostgreSQL-specific requirements. Create a custom type (scd_type) to define the structure of the rows being generated.

Here’s the definition of scd_type:

CREATE TYPE scd_type AS (  
    scoring_class scoring_class,  
    is_active BOOLEAN,  
    start_season INTEGER,  
    end_season INTEGER  
)

Here’s the query:

changed_records AS (  
    SELECT  
        ts.player_name,  
        UNNEST(  
        ARRAY[  
            ROW(  
                ls.scoring_class,  
                ls.is_active,  
                ls.start_season,  
                ls.end_season  
                )::scd_type,  
            ROW(  
                ts.scoring_class,  
                ts.is_active,  
                ts.current_season,  
                ts.current_season  
                )::scd_type  
        ]  
        ) AS records
    FROM last_season_scd ls  
    LEFT JOIN  this_season_data ts  
    ON ls.player_name = ts.player_name  
    WHERE (ts.scoring_class <> ls.scoring_class  
    OR ts.is_active <> ls.is_active)  
),

The query generates an array with two rows for each change:

  • The old record, using fields from last_season_scd.
  • The new record, using fields from this_season_data.

For example: A player with a changed scoring_class might show:

  • Old record: scoring_class = "average", is_active = true, start_season = 2020, end_season = 2021.
  • New record: scoring_class = "good", is_active = true, start_season = 2022, end_season = 2022.

Here’s the result of the query:

After generating the changed_records with its array of rows, the next step is to unnest this array to separate the structured data into individual columns. This process effectively flattens the array into separate columns:

unnested_changed_records AS (  
    SELECT  
        player_name,  
        (records::scd_type).scoring_class,  
        (records::scd_type).is_active,  
        (records::scd_type).start_season,  
        (records::scd_type).end_season  
        FROM changed_records  
),

Here’s the output:

Result:

  • Each record from changed_records is now expanded into its individual components.
  • This effectively “flattens” the data, making it easier to work with in subsequent queries.

Identifying New Records

The final step is generating new records, which are records for players who did not exist in the previous season (last_season_scd). This is straightforward since we only need to filter players that are missing from the last season.

We perform a LEFT JOIN between this_season_data (ts) and last_season_scd (ls). The WHERE clause filters for records where ls.player_name is NULL, meaning the player did not exist in the previous season:

new_records AS (  
    SELECT  
        ts.player_name,  
        ts.scoring_class,  
        ts.is_active,  
        ts.current_season AS start_season,  
        ts.current_season AS end_season  
    FROM this_season_data ts  
    LEFT JOIN last_season_scd ls  
    ON ts.player_name = ls.player_name  
    WHERE ls.player_name IS NULL  
)

Final Query - Incremental SCD Table Creation

Now that we have all the components, we can UNION ALL the records together. This includes:

  1. Historical Records (historical_scd): Records with a closed end_season, meaning they won’t change.
  2. Unchanged Records (unchanged_records): Records that continue unchanged into the new season.
  3. Changed Records (unnested_changed_records): Records where something about the player changed.
  4. New Records (new_records): Players appearing for the first time.

Here’e the entire query:

CREATE TYPE scd_type AS (  
    scoring_class scoring_class,  
    is_active BOOLEAN,  
    start_season INTEGER,  
    end_season INTEGER  
)  
  
WITH last_season_scd AS (  
    SELECT * FROM players_scd  
    WHERE current_season = 2021  
    AND end_season = 2021  
),  
historical_scd AS (  
    SELECT  
        player_name,  
        scoring_class,  
        is_active,  
        start_season,  
        end_season  
    FROM players_scd  
    WHERE current_season = 2021  
    AND end_season < 2021  
),  
this_season_data AS (  
    SELECT * FROM players  
    WHERE current_season = 2022  
),  
unchanged_records AS (  
    SELECT  
            ts.player_name,  
            ts.scoring_class,  
            ts.is_active,  
            ls.start_season,  
            ts.current_season as end_season  
    FROM last_season_scd ls  
    JOIN  this_season_data ts  
    ON ls.player_name = ts.player_name  
    WHERE ts.scoring_class = ls.scoring_class  
    AND ts.is_active = ls.is_active  
),  
changed_records AS (  
    SELECT  
        ts.player_name,  
        UNNEST(  
        ARRAY[  
            ROW(  
                ls.scoring_class,  
                ls.is_active,  
                ls.start_season,  
                ls.end_season  
                )::scd_type,  
            ROW(  
                ts.scoring_class,  
                ts.is_active,  
                ts.current_season,  
                ts.current_season  
                )::scd_type  
        ]  
        ) AS records  
    FROM last_season_scd ls  
    LEFT JOIN  this_season_data ts  
    ON ls.player_name = ts.player_name  
    WHERE (ts.scoring_class <> ls.scoring_class  
    OR ts.is_active <> ls.is_active)  
),  
unnested_changed_records AS (  
    SELECT  
        player_name,  
        (records::scd_type).scoring_class,  
        (records::scd_type).is_active,  
        (records::scd_type).start_season,  
        (records::scd_type).end_season  
        FROM changed_records  
),  
new_records AS (  
    SELECT  
        ts.player_name,  
        ts.scoring_class,  
        ts.is_active,  
        ts.current_season AS start_season,  
        ts.current_season AS end_season  
    FROM this_season_data ts  
    LEFT JOIN last_season_scd ls  
    ON ts.player_name = ls.player_name  
    WHERE ls.player_name IS NULL  
)  
SELECT * FROM historical_scd  
UNION ALL  
SELECT * FROM unchanged_records  
UNION ALL  
SELECT * FROM unnested_changed_records  
UNION ALL  
SELECT * FROM new_records  
ORDER BY player_name, start_season, end_season

Why This Works:

  • Reduced Data Processing: We only process compacted records from 2021 and records from 2022, reducing the amount of data significantly (probably 20 times less).
  • Efficient Updates: Instead of rebuilding the entire table, we only process the latest season incrementally.

Critical Considerations:

  1. Null Handling:
    • We assumed scoring_class and is_active would never be NULL. If they were, comparisons like NULL <> NULL would break the logic.
    • Using IS DISTINCT FROM could fix this, but we chose = and <> operators for simplicity.
  2. Sequential Dependency:
    • The query assumes yesterday’s data is always available and correct. This sequential dependency makes backfilling more difficult since we can’t process a season without its predecessor.

Final Thought:

While this query might seem complex, it’s powerful and processes a much smaller amount of data compared to a full-table rebuild. This incremental approach is more scalable and faster in most production environments. However, its reliance on prior season data introduces dependencies that require careful management in real-world pipelines.