Dimensional Data Modeling Day Two Lab
Today’s lab builds on the Day One lab, where we created a cumulative table that tracks NBA players’ seasons and their stats throughout their careers. This time, we’ll take those datasets and convert them into Slowly Changing Dimension (SCD) Type 2 tables.
Why SCD Type 2?
Remember, SCD Type 2 is considered the gold standard for slowly changing dimensions because:
- It’s the only purely idempotent SCD type.
- It retains complete historical records while tracking changes over time.
This lab focuses on applying SCD Type 2 modeling using PostgreSQL.
LAB
Let’s drop and create again the players table with a new boolean column is_active. Then we’ll populated it with this query:
CREATE TABLE players (
player_name TEXT,
height TEXT,
college TEXT,
country TEXT,
draft_year TEXT,
draft_round TEXT,
draft_number TEXT,
season_stats season_stats[],
scoring_class scoring_class,
years_since_last_season INTEGER,
current_season INTEGER,
is_active BOOLEAN,
PRIMARY KEY (player_name, current_season)
)Then, we’ll use the following query that creates the same table that we had from 02. Data Modeling - Cumulative Dimensions, Struct and Array - Day 1 Lab:
INSERT INTO players
WITH years AS (
SELECT *
FROM GENERATE_SERIES(1996, 2022) AS season
),
p AS (
SELECT
player_name,
MIN(season) AS first_season
FROM player_seasons
GROUP BY player_name
),
players_and_seasons AS (
SELECT *
FROM p
JOIN years y
ON p.first_season <= y.season
),
windowed AS (
SELECT
pas.player_name,
pas.season,
ARRAY_REMOVE(
ARRAY_AGG(
CASE
WHEN ps.season IS NOT NULL
THEN ROW(
ps.season,
ps.gp,
ps.pts,
ps.reb,
ps.ast
)::season_stats
END)
OVER (PARTITION BY pas.player_name ORDER BY COALESCE(pas.season, ps.season)),
NULL
) AS seasons
FROM players_and_seasons pas
LEFT JOIN player_seasons ps
ON pas.player_name = ps.player_name
AND pas.season = ps.season
ORDER BY pas.player_name, pas.season
), static AS (
SELECT
player_name,
MAX(height) AS height,
MAX(college) AS college,
MAX(country) AS country,
MAX(draft_year) AS draft_year,
MAX(draft_round) AS draft_round,
MAX(draft_number) AS draft_number
FROM player_seasons
GROUP BY player_name
)
SELECT
w.player_name,
s.height,
s.college,
s.country,
s.draft_year,
s.draft_round,
s.draft_number,
seasons AS season_stats,
CASE
WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 20 THEN 'star'
WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 15 THEN 'good'
WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 10 THEN 'average'
ELSE 'bad'
END::scoring_class AS scoring_class,
w.season - (seasons[CARDINALITY(seasons)]::season_stats).season as years_since_last_active,
w.season,
(seasons[CARDINALITY(seasons)]::season_stats).season = season AS is_active
FROM windowed w
JOIN static s
ON w.player_name = s.player_name;We essentially want to track changes int two columns (because with SCD tables you can track multiple columns at once). To do that let’s create a new table called players_scd:
CREATE TABLE players_scd (
player_name TEXT,
scoring_class scoring_class,
is_active BOOLEAN,
start_season INTEGER,
end_season INTEGER,
current_season INTEGER,
PRIMARY KEY(player_name, start_season)
)scoring_classandis_activeare the columns we want to track;- since we want to build a type 2 SCD, we need
start_seasonandend_season.
This is essentially a properly modelled SCD table.
We’re going to do the following things:
- write the query that looks at all of history and creates one SCD record from all of history;
- take and existing SCD table and then build on top of it incrementally. (because you can do it either way. TODO: che significa? either way a cosa si riferisce. Di seguito il riassunto di chatgpt
Now that the table has been populated with historical data, we can handle updates incrementally as new seasons are added:
- Compare New Data with Existing Records: Check for changes in
scoring_classandis_active.* - Extend or Create New Records: If the attributes are unchanged, update the
end_seasonof the last record. Otherwise, create a new record. )
Query
Now, we want to calculate the streak of how many seasons a player has been in a particular state, such as scoring class or active status. To do this, we need to look at the previous season’s data to compare changes over time.
Using Window Functions to Analyze Changes
We can use window functions, specifically LAG, to track previous values for scoring_class and is_active
SELECT
player_name,
current_season,
scoring_class,
is_active,
LAG(scoring_class, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_scoring_class,
LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_is_active
FROM playersThis will allow us to compare the current season with the previous season.
Output:
/attachments/Pasted-image-20241202224354.png)
Adding Change Indicators
Next, create a change indicator to show when a player’s scoring class or active status has changed. Use a CASE statement to check for differences between the current and previous values:
SELECT
*,
CASE
WHEN scoring_class <> previous_scoring_class THEN 1
WHEN is_active <> previous_is_active THEN 1
ELSE 0
END as change_indicator
FROM with_previousThis will create flags to indicate when changes occur.
Output:
/attachments/Pasted-image-20241202224812.png)
If we run this, we can observe changes like:
- Player: Aaron Brooks
- 2008: Scoring class changes from “Bad” to “Average.”
- 2009: Scoring class changes from “Average” to “Good.”
This player improved significantly in back-to-back seasons but later regressed from “Good” to “Average.” These indicators help track such transitions.
Creating a Streak for Changes
Now, we want to calculate a streak for each change. This will assign a sequential number to each change a player makes:
- When the player enters the table for the first time, their streak starts at
0. - For each subsequent change, increment the streak.
This can be done by creating a new CTE (Common Table Expression) called with_indicators that incorporates the change indicators and calculates streaks incrementally.
This process helps analyze not just when changes happen but also how frequently players transition between different states, providing deeper insights into their performance over time.
Calculating Streaks
Next, we track streaks for players based on their changes. To do this, we sum up the change_indicator using a window function. Here’s how:
- Partition the data by
player_nameto handle each player’s data independently. - Order by
current_seasonto process changes chronologically.
The query looks like this:
SELECT
*,
SUM(change_indicator) OVER (PARTITION BY player_name ORDER BY current_season) as streak_identifier
FROM with_indicatorsThe streak_identifier will increment each time there’s a change, creating a sequential identifier for each streak.
Output:
/attachments/Pasted-image-20241202231011.png)
Collapsing Rows Using Streak Identifiers
Now that we have the streak identifiers, we can use them to aggregate the data. By performing a MIN and MAX on the season column for each streak identifier, we can collapse all rows with the same streak into a single record. This works because the streak identifier ensures that all grouped rows have the same values for the tracked attributes. This is the full query:
WITH with_previous as (
SELECT
player_name,
current_season,
scoring_class,
is_active,
LAG(scoring_class, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_scoring_class,
LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) as previous_is_active
FROM players
),
with_indicators as (
SELECT
*,
CASE
WHEN scoring_class <> previous_scoring_class THEN 1
WHEN is_active <> previous_is_active THEN 1
ELSE 0
END as change_indicator
FROM with_previous
),
with_streaks as (
SELECT
*,
SUM(change_indicator) OVER (PARTITION BY player_name ORDER BY current_season) as streak_identifier
FROM with_indicators
)
SELECT
player_name,
streak_identifier,
is_active,
scoring_class,
MIN(current_season) as start_season,
MAX(current_season) as end_season
FROM with_streaks
GROUP BY player_name, streak_identifier, is_active, scoring_class
ORDER BY player_name, streak_identifierOutput (look at A.C. Green and Aaron Brooks):
/attachments/Pasted-image-20241202232205.png)
Observations
This approach effectively identifies streaks for each player:
- Consistency: Players like
A.C. Greenhave long periods with no changes, as reflected in their streak identifiers:0for the first streak and1for the second streak. - Frequent Changes: Players like Aaron Brooks are more dynamic. He starts as “bad” and active, transitions to “average” and active, and then “good” and active, changing almost every season. He even has a year of inactivity before becoming consistent for two years. For players like this, the streak identifier highlights frequent transitions across seasons.
The streak identifier increments each time there is a change, allowing us to trace continuous values for different players over time.
This aggregated table serves as the Slowly Changing Dimension (SCD) representation. It tracks:
- When changes occurred (start and end seasons).
- The state during each streak (e.g., scoring class, active status).
This method allows for a concise, accurate history of changes while preserving the temporal context for analysis.
Filtering for Incremental Build and Setting Current Season
To prepare for the incremental build, let’s filter the data to include only rows where the current season is less than or equal to 2021. This allows us to reserve 2022 for testing the incremental logic. After running the query, the data now ends at 2021, as expected:
with_indicators as (
SELECT
*,
CASE
WHEN scoring_class <> previous_scoring_class THEN 1
WHEN is_active <> previous_is_active THEN 1
ELSE 0
END as change_indicator
FROM with_previous
WHERE current_season <= 2021
)To simulate this as part of a pipeline (e.g., in Airflow), we can hardcode the current_season to 2021. This acts as a parameter you would typically inject during pipeline execution:
SELECT
player_name,
scoring_class,
is_active,
MIN(current_season) as start_season,
MAX(current_season) as end_season,
2021 AS current_season
FROM with_streaks
GROUP BY player_name, streak_identifier, is_active, scoring_class
ORDER BY player_name, streak_identifierInserting Data into the SCD Table
Now we finalize the Slowly Changing Dimension (SCD) table:
INSERT INTO players_scd
WITH with_previous as (
...
...
ORDER BY player_name, streak_identifierSCD Table Output
The final players_scd table captures:
- Each player’s streak.
- The start and end seasons for each streak.
- Relevant attributes such as scoring class and active status.
This table accurately reflects changes in player states over time.
Flexibility of SCD Modeling
The same logic can apply to different temporal dimensions:
- Instead of seasons, you could use dates, years, or months as the temporal component.
- This flexibility makes SCD modeling powerful for tracking historical changes in various datasets.
By leveraging this approach, you can process historical and incremental data efficiently, ensuring a comprehensive and accurate history of changes.
Limitations of the Current Pattern
Before diving into the incremental build, let’s review some limitations of the current approach to better understand its trade-offs.
- Expensive Query Operations
- This query relies heavily on window functions, which are computationally expensive. For example, we perform:
- Two window functions over the entire dataset.
- A final aggregation to collapse the data.
- Data is only reduced at the very end during the
GROUP BY, meaning window functions process the full dataset without any intermediate reduction.
- This query relies heavily on window functions, which are computationally expensive. For example, we perform:
- Scalability Concerns
- While the approach works well for moderate-sized datasets, processing the entire history daily can be inefficient, especially as data scales. This method is powerful but computationally heavy.
- For smaller scales, like at Airbnb, this method worked effectively even when applied to line items and other data. Technologies like Spark efficiently handled these operations. However, at much larger scales, like Facebook, where there are billions of users, this approach may not be viable.
- Handling of Outliers
- In datasets with inconsistent change rates, certain dimensions or users may generate many streaks, increasing the cardinality and slowing the query. For example, a player who changes scoring class every season generates more streaks than others, creating potential skew.
Benefits of the Current Pattern
Despite its limitations, this method has distinct advantages:
- Simplicity: It regenerates the entire historical view daily, ensuring consistency and allowing for changes in logic to propagate across all records.
- Comprehensive Data: The resulting SCD table provides a detailed, complete history that captures every change.
- Applicability to Smaller Scales: For dimensional data, which is typically smaller than fact data, this method often suffices. Dimensional data is manageable in size, even at larger companies, compared to fact data, which can grow exponentially.
Considerations for Scale
- At Airbnb, processing dimensional data for millions of users was practical with this approach, leveraging tools like Spark to manage historical scans.
- At Facebook, where user data scales to billions, such queries are less feasible due to the added zeros in dataset size. The approach might lead to out-of-memory errors, skew, and performance bottlenecks.
- As data scales or includes dimensions with high cardinality or frequent changes, this approach becomes less practical. In such cases, alternative methods like incremental builds are better suited to manage efficiency and scalability. Understanding when to use each method is a critical skill as you work with dimensional data.
Incremental Approach
Let’s move to an incremental approach for building the Slowly Changing Dimension (SCD) table. This method focuses on processing only the new and changed data, rather than reprocessing the entire history.
We’ll structure the query into several components.
Historical Records Are Static. Historical records are complete and unaffected by current changes, so they will never change:
WITH historical_scd AS (
SELECT
player_name,
scoring_class,
is_active,
start_season,
end_season
FROM players_scd
WHERE current_season = 2021
AND end_season < 2021
),Last Season’s Records. This component contains records for players who were active in the most recent season (2021). These records may be extended or updated if changes occur in 2022:
last_season_scd AS (
SELECT * FROM players_scd
WHERE current_season = 2021
AND end_season = 2021
),Current Season Records. The component brings in data for 2022, identifying any changes that require new records:
this_season_data AS (
SELECT * FROM players
WHERE current_season = 2022
),Let’s see how last_season_scd is just to get a grapple of what’s going on:
This query shows the most recent records, specifically from the 2021 season. Each player has exactly one record, representing their current state for that year.
Incremental Update Logic. The incremental update process works by:
- For records that remain unchanged in 2022, the
end_seasonfield will simply be incremented to include 2022. - For records with changes in 2022, a new record will be created in the SCD table, capturing the updated state.
The following one is a possible procedure to implement this incremental update logic.
Identifying Unchanged Records
We will approach this by analyzing and processing the records from the last season (last_season_scd) and the current season (this_season_data). The goal is to identify unchanged records and extend them to the new season. Here’s the query:
unchanged_records AS (
SELECT
ts.player_name,
t.scoring_class,
ls.scoring_class,
ls.is_active,
ls.scoring_class,
ls.is_active
FROM last_season_scd ls
JOIN this_season_data ts
ON ls.player_name = ts.player_name
WHERE ts.scoring_class = ls.scoring_class
AND ts.is_active = ls.is_active
),- Use a LEFT JOIN because new players may appear in the current season but not in the last season.
- For records that did not change between seasons:
scoring_classremains the same.is_activeremains the same. Add aWHEREcondition to filter these records:
- The
unchanged_recordstable now contains rows where players’ states remained the same. For these rows:- the
start_seasonremains as it was inlast_season_scd. - the
end_seasonis extended by one (i.e., to thecurrent season).
- the
Identifying Changed Records
Next, we handle the changed records, which are key parts of this process. The challenge with changed records is that we need to create two rows for each change:
- One row to “close” the previous record (the old record that has ended).
- Another row for the new state (the updated record).
Here’s the approach:
- Identify changes. Compare the
scoring_classandis_activefields between the last season (last_season_scd) and the current season (this_season_data). - Use an array of structs. For changed records, generate both the old and new rows together.
- Handle PostgreSQL-specific requirements. Create a custom type (
scd_type) to define the structure of the rows being generated.
Here’s the definition of scd_type:
CREATE TYPE scd_type AS (
scoring_class scoring_class,
is_active BOOLEAN,
start_season INTEGER,
end_season INTEGER
)Here’s the query:
changed_records AS (
SELECT
ts.player_name,
UNNEST(
ARRAY[
ROW(
ls.scoring_class,
ls.is_active,
ls.start_season,
ls.end_season
)::scd_type,
ROW(
ts.scoring_class,
ts.is_active,
ts.current_season,
ts.current_season
)::scd_type
]
) AS records
FROM last_season_scd ls
LEFT JOIN this_season_data ts
ON ls.player_name = ts.player_name
WHERE (ts.scoring_class <> ls.scoring_class
OR ts.is_active <> ls.is_active)
),The query generates an array with two rows for each change:
- The old record, using fields from
last_season_scd. - The new record, using fields from
this_season_data.
For example: A player with a changed scoring_class might show:
- Old record:
scoring_class = "average", is_active = true, start_season = 2020, end_season = 2021. - New record:
scoring_class = "good", is_active = true, start_season = 2022, end_season = 2022.
Here’s the result of the query:
/attachments/Pasted-image-20241206220642.png)
After generating the changed_records with its array of rows, the next step is to unnest this array to separate the structured data into individual columns. This process effectively flattens the array into separate columns:
unnested_changed_records AS (
SELECT
player_name,
(records::scd_type).scoring_class,
(records::scd_type).is_active,
(records::scd_type).start_season,
(records::scd_type).end_season
FROM changed_records
),Here’s the output:
/attachments/Pasted-image-20241207112027.png)
Result:
- Each record from
changed_recordsis now expanded into its individual components. - This effectively “flattens” the data, making it easier to work with in subsequent queries.
Identifying New Records
The final step is generating new records, which are records for players who did not exist in the previous season (last_season_scd). This is straightforward since we only need to filter players that are missing from the last season.
We perform a LEFT JOIN between this_season_data (ts) and last_season_scd (ls). The WHERE clause filters for records where ls.player_name is NULL, meaning the player did not exist in the previous season:
new_records AS (
SELECT
ts.player_name,
ts.scoring_class,
ts.is_active,
ts.current_season AS start_season,
ts.current_season AS end_season
FROM this_season_data ts
LEFT JOIN last_season_scd ls
ON ts.player_name = ls.player_name
WHERE ls.player_name IS NULL
)Final Query - Incremental SCD Table Creation
Now that we have all the components, we can UNION ALL the records together. This includes:
- Historical Records (
historical_scd): Records with a closedend_season, meaning they won’t change. - Unchanged Records (
unchanged_records): Records that continue unchanged into the new season. - Changed Records (
unnested_changed_records): Records where something about the player changed. - New Records (
new_records): Players appearing for the first time.
Here’e the entire query:
CREATE TYPE scd_type AS (
scoring_class scoring_class,
is_active BOOLEAN,
start_season INTEGER,
end_season INTEGER
)
WITH last_season_scd AS (
SELECT * FROM players_scd
WHERE current_season = 2021
AND end_season = 2021
),
historical_scd AS (
SELECT
player_name,
scoring_class,
is_active,
start_season,
end_season
FROM players_scd
WHERE current_season = 2021
AND end_season < 2021
),
this_season_data AS (
SELECT * FROM players
WHERE current_season = 2022
),
unchanged_records AS (
SELECT
ts.player_name,
ts.scoring_class,
ts.is_active,
ls.start_season,
ts.current_season as end_season
FROM last_season_scd ls
JOIN this_season_data ts
ON ls.player_name = ts.player_name
WHERE ts.scoring_class = ls.scoring_class
AND ts.is_active = ls.is_active
),
changed_records AS (
SELECT
ts.player_name,
UNNEST(
ARRAY[
ROW(
ls.scoring_class,
ls.is_active,
ls.start_season,
ls.end_season
)::scd_type,
ROW(
ts.scoring_class,
ts.is_active,
ts.current_season,
ts.current_season
)::scd_type
]
) AS records
FROM last_season_scd ls
LEFT JOIN this_season_data ts
ON ls.player_name = ts.player_name
WHERE (ts.scoring_class <> ls.scoring_class
OR ts.is_active <> ls.is_active)
),
unnested_changed_records AS (
SELECT
player_name,
(records::scd_type).scoring_class,
(records::scd_type).is_active,
(records::scd_type).start_season,
(records::scd_type).end_season
FROM changed_records
),
new_records AS (
SELECT
ts.player_name,
ts.scoring_class,
ts.is_active,
ts.current_season AS start_season,
ts.current_season AS end_season
FROM this_season_data ts
LEFT JOIN last_season_scd ls
ON ts.player_name = ls.player_name
WHERE ls.player_name IS NULL
)
SELECT * FROM historical_scd
UNION ALL
SELECT * FROM unchanged_records
UNION ALL
SELECT * FROM unnested_changed_records
UNION ALL
SELECT * FROM new_records
ORDER BY player_name, start_season, end_seasonWhy This Works:
- Reduced Data Processing: We only process compacted records from 2021 and records from 2022, reducing the amount of data significantly (probably 20 times less).
- Efficient Updates: Instead of rebuilding the entire table, we only process the latest season incrementally.
Critical Considerations:
- Null Handling:
- We assumed
scoring_classandis_activewould never be NULL. If they were, comparisons likeNULL <> NULLwould break the logic. - Using
IS DISTINCT FROMcould fix this, but we chose=and<>operators for simplicity.
- We assumed
- Sequential Dependency:
- The query assumes yesterday’s data is always available and correct. This sequential dependency makes backfilling more difficult since we can’t process a season without its predecessor.
Final Thought:
While this query might seem complex, it’s powerful and processes a much smaller amount of data compared to a full-table rebuild. This incremental approach is more scalable and faster in most production environments. However, its reliance on prior season data introduces dependencies that require careful management in real-world pipelines.