The main table we’re working with is player_seasons. Each row in this table represents one NBA player in a specific season, along with their stats. Here’s a preview:

The table has a temporal problem because each row corresponds to a specific season, leading to data redundancy. If this table is joined with a downstream table, it can cause shuffling and loss of compression. To address this, we aim to restructure the table so that:

Each player has one row in the table.
All seasons are stored in an array, removing the temporal component from the main structure.

Let’s review the attributes in the table to determine what’s consistent (unchanging) versus what varies by season:

Unchanging Player Attributes: player_name, height, college, country, draft_year, draft_round, draft_number. These attributes remain constant for a player and are duplicated in the current table.
Seasonal Attributes: Attributes like gp (games played), pts, reb, ast, and the season itself vary by year and represent the temporal component.

Designing the New Schema

The table currently has a lot of duplicate data in the unchanging attributes, which we can remove by restructuring. We need to create a new struct called season_stats to hold the seasonal data. In PostgreSQL, we can define a new type using CREATE TYPE. For this lab, we’ll focus on key attributes:

CREATE TYPE season_stats AS (
    season INTEGER,
    gp INTEGER,
    pts REAL,
    reb REAL,
    ast REAL
);

We’re simplifying the struct by including only essential attributes. Some additional attributes, which are “super nerdy,” are excluded to avoid making the SQL overly complex.

Now, we want to consider creating a new table. This table will include all the player-level attributes—those that don’t change between seasons—and an array of season stats to capture season-specific information. We’ll create a table called players:

CREATE TABLE players (
        player_name TEXT,
        height TEXT,
        college TEXT,
        country TEXT,
        draft TEXT,
        draft_round TEXT,
        draft_number TEXT,
        season_stats season_stats[],
        current_season INTEGER,
        PRIMARY KEY (player_name, current_season)
)

season_stats Array. This column will hold an array of season_stats, the type we previously created. It allows us to aggregate all the player’s seasonal stats into a single array.
current_season. We add a column called current_season (integer). This column represents the most recent season for a player. As we process data cumulatively, current_season helps track the latest season value and ensures updates are managed correctly.
Primary Key. The primary key for this table will be a combination of player_name and current_season. This ensures uniqueness while allowing for cumulative updates.

Some columns like age and weight are omitted:

age: It’s a property of the season, not the player. It can be derived later based on the season data.
weight: This value can change, so it’s excluded from the player-level attributes.

The current_season column ensures that as we build the table incrementally (e.g., via full outer joins), we can track and update the latest season for each player. This column plays a critical role in the cumulative development of the table and will make more sense as the table evolves.

We start by creating the players table, which we’ve already done. Next, we think about how to implement the full outer join logic for cumulating data year by year.

Building the Cumulative Query

The first step is to identify the earliest year in the dataset:

SELECT MIN(season) FROM player_seasons;

This query gives us the starting year, which is 1996. Now, let’s build the logic to combine data from today and yesterday for this first year.

We create two temporary tables:

yesterday: Contains data from the players table for the previous season (1995). Since we haven’t processed earlier years yet, this will return null for now.
today: Contains data from the player_seasons table for the current season (1996).

WITH yesterday AS (
    SELECT * FROM players WHERE current_season = 1995
),
today AS (
    SELECT * FROM player_seasons WHERE season = 1996
)

We join the today and yesterday tables using a full outer join on player_name. This allows us to handle three cases:

Players in today but not in yesterday (new players).
Players in both today and yesterday (continuing players).
Players in yesterday but not in today (retired players).

SELECT *
FROM today t
FULL OUTER JOIN yesterday y
ON t.player_name = y.player_name;

When you run this query, everything from yesterday will show as null because yesterday is empty for this first year. This is expected, as we’re starting with the seed query: Next, we coalesce non-temporal attributes (e.g., player_name, height, college) to ensure a single consistent value for each player:

SELECT
    COALESCE(t.player_name, y.player_name) AS player_name,
    COALESCE(t.height, y.height) AS height,
    COALESCE(t.college, y.college) AS college,
    COALESCE(t.country, y.country) AS country,
    COALESCE(t.draft_year, y.draft_year) AS draft_year,
    COALESCE(t.draft_round, y.draft_round) AS draft_round,
    COALESCE(t.draft_number, y.draft_number) AS draft_number
FROM today T
FULL OUTER JOIN yesterday Y
ON T.player_name = Y.player_name;

This step ensures that attributes that don’t change between seasons are carried forward correctly.

We then handle the season_stats array to accumulate seasonal data:

If y.season_stats is null, initialize a new array with the current year’s data (today values).
If today is not null (WHEN t.season IS NOT NULL), we create the new value
otherwise we carry forward the history (retired players who don’t have any new seasons).

CASE  
	WHEN y.season_stats IS NULL THEN ARRAY[  
		ROW(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats  
	]  
	WHEN t.season IS NOT NULL THEN y.season_stats || ARRAY[ROW(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]  
	ELSE y.season_stats  
END AS season_stats,

Finally, we calculate the current_season column. If there’s a value in today, use it. Otherwise, increment y.current_season by one:

COALESCE(t.season, y.current_season + 1) AS current_season

Combining all the steps, the query looks like this:

WITH yesterday AS (
    SELECT * FROM players WHERE current_season = 1995
),
today AS (
    SELECT * FROM player_seasons WHERE season = 1996
)
SELECT
    COALESCE(t.player_name, y.player_name) AS player_name,
    COALESCE(t.height, y.height) AS height,
    COALESCE(t.college, y.college) AS college,
    COALESCE(t.country, y.country) AS country,
    COALESCE(t.draft_year, y.draft_year) AS draft_year,
    COALESCE(t.draft_round, y.draft_round) AS draft_round,
    COALESCE(t.draft_number, y.draft_number) AS draft_number,
    CASE
        WHEN y.season_stats IS NULL THEN ARRAY[
            ROW(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats
        ]
        WHEN t.season IS NOT NULL THEN y.season_stats || ARRAY[ROW(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
        ELSE y.season_stats
    END AS season_stats,
    COALESCE(t.season, y.current_season + 1) AS current_season
FROM today t
FULL OUTER JOIN yesterday y
ON t.player_name = y.player_name;

Here’s a preview:

Results (TODO: Not said by Zach. confirm!):

Players Active in 1996: Their season_stats arrays are initialized with the 1996 data.
Retired Players: Their data from yesterday is carried forward unchanged.
New Players: They’re added with their first year’s stats.

We’re going to turn the cumulative query into a pipeline. First, we use INSERT INTO to add data to the players table:

INSERT INTO players  
WITH yesterday AS (  
    SELECT * FROM players WHERE current_season = 1995  
),  
today AS (  
    SELECT * FROM player_seasons WHERE season = 1996  
)  
SELECT  
    ...

The result, that you can look by running SELECT * FROM players; is, of course, the same of the latest picture.

Next, we modify the cumulative query for the next year. We change yesterday to 1996 and today to 1997:

INSERT INTO players
WITH yesterday AS (
    SELECT * FROM players WHERE current_season = 1996
),
today AS (
    SELECT * FROM player_seasons WHERE season = 1997
)
SELECT
	...

Now the table shows:

Players active in 1996 and 1997 with an array of two seasons.
New players, such as rookies, with only their 1997 stats.
Retired players with only their 1996 stats.

We repeat the process for later years. For example:

Update yesterday to 1997 and today to 1998, then update to 1998 and 1999, then 1990 and 2000, then 2000 and 2001
Insert the results into the players table.

After several iterations, the table accumulates all players and their histories. Some players have arrays with multiple seasons, while others have only one or two.

Special Cases and Flexibility

Special cases, like Michael Jordan, are handled naturally:

If a player skips a year, their historical data remains intact.
When they return, their new stats are added to the array.

To check Michael Jordan’s data:

SELECT * FROM players
WHERE current_season = 2001
AND player_name = 'Michael Jordan';

Result: The result shows a gap for the years he was retired.

Now we can use UNNEST to revert the season_stats array into individual rows. This is useful for downstream processes or joins:

SELECT
	player_name,
	UNNEST(season_stats) AS season_stats
FROM players
WHERE current_season = 2001
AND player_name = 'Micheal Jordan'

Output: This query shows:

Each season as its own row.
Player-level attributes (like height and college) repeated for each row.

To break out the season_stats struct into columns:

SELECT  
	PLAYER_NAME,  
	(UNNEST(SEASON_STATS)::SEASON_STATS).*  
FROM  
	PLAYERS  
WHERE  
	CURRENT_SEASON = 2001  
	AND PLAYER_NAME = 'Michael Jordan';

Output:

This is how you get back to the old schema. You can easily go back and forth between the UNNEST data and the original cumulative table. That flexibility is one of the powerful aspects of these queries and syntaxes.

You get the versatility of knowing all sorts of other facts about a player while maintaining the ability to go back to the cumulative players table. This table is already at the appropriate grain, making it highly efficient for certain types of queries.

For example, let’s remove Michael Jordan from the query real quick:

SELECT  
	PLAYER_NAME,  
	(UNNEST(SEASON_STATS)::SEASON_STATS).*  
FROM  
	PLAYERS  
WHERE  
	CURRENT_SEASON = 2001

Output: You’ll notice that everything remains sorted. Each player’s records are grouped together, and their temporal data (like seasons) is kept intact. This approach addresses the run-length encoding problem by ensuring that all players’ names and their associated seasons are kept neatly grouped.

This is the power of cumulative table design. When you have a temporal component—in this case, the season—you can:

Perform a join to add new data or attributes for a player.
After the join, use UNNEST to expand the data back into individual rows for detailed analysis.

Everything remains compressed and sorted. The temporal pieces stay together, so you don’t need to worry about losing the order or disrupting the grouping.

I highly recommend exploring this design further. It ensures efficient joins, preserves sorting, and keeps your data organized, which is a powerful approach for handling temporal components in data modeling.

Enhancing the Schema

Let’s drop the players table (with DROP TABLE players;) because we want to create a new one. Then we’ll create a type for the scoring_class column as an ENUM:

CREATE TYPE scoring_class AS ENUM ('star', 'good', 'average', 'bad');

Now, let’s create a new players table with two more columns:

CREATE TABLE players (
        player_name TEXT,
        height TEXT,
        college TEXT,
        country TEXT,
        draft_year TEXT,
        draft_round TEXT,
        draft_number TEXT,
        season_stats season_stats[],
        scoring_class scoring_class,
        years_since_last_season INTEGER,
        current_season INTEGER,
        PRIMARY KEY (player_name, current_season)
)

Now, we need to compute these new two columns in the cumulative query. Let’s start with years_since_last_season:

CASE  
	WHEN t.season IS NOT NULL THEN 0  
	ELSE y.years_sunce_last_season + 1  
	END as years_since_last_season

Explanation:

If the current season (t.season) exists, that means the player is active, so the value is 0.
Otherwise, take the previous year’s value (y.years_since_last_season) and add 1. This will keep incrementing for retired players. If the player comes back, it resets to 0.

For the scoring_class, we have four possible values: star, good, average, and bad. We will use the points column to determine the classification. The logic is:

CASE
	WHEN t.season IS NOT NULL THEN  
		CASE WHEN t.pts > 20 THEN 'star'  
				WHEN t.pts > 15 THEN 'good'  
				WHEN t.pts > 10 THEN 'average'  
				ELSE 'bad'  
		END::scoring_class  
		ELSE y.scoring_class  
    END as scoring_class,

The scoring logic will vary depending on whether the player is active in the current season:

Active Players (t.season IS NOT NULL): Compute the scoring_class based on the current season’s points using the logic above.
Retired Players (t.season IS NULL): Carry forward the scoring_class from their most recent season.

Let’s populate the players table as we did before by updating yesterday to 1997 and today to 1998, then update to 1998 and 1999, then 1990 and 2000, then 2000 and 2001.

We begin by querying the players table for the current season of 2001:

SELECT * FROM players WHERE current_season = 2001;

When you look at the results:

Some players, like those who haven’t played in several years, show a higher value in the years_since_last_season column.
For example, there’s a player who last played in 1997, so their years_since_last_season is now 4.

You’ll also notice many players who only appear for one or two years in the dataset and then disappear. These retired players still retain their previous scoring_class and history.

To drill down into a specific player, filter by player_name:

SELECT * FROM players WHERE current_season = 2001  
and player_name = 'Michael Jordan';

In the 2001 season:

years_since_last_season: This is 0, as Michael Jordan is playing that year.
scoring_class: He’s classified as a “star” because he averaged 22.9 points in 2001, which meets the threshold for the “star” category.

If you query Michael Jordan for 2000:

SELECT * FROM players WHERE current_season = 2000
and player_name = 'Michael Jordan';

You’ll see:

years_since_last_season: This is 3, reflecting the three years he took off between 1997 and 2001.
scoring_class: Since he didn’t play in 2000, his classification carries over from his most recent active season.

This process illustrates how you can build the table incrementally. With each new season:

Active players get updated with their latest stats and a recalculated scoring_class.
Retired players retain their scoring_class and have their years_since_last_season incremented.

This approach allows for efficient queries and powerful ways of analyzing player histories.

Analyzing Player Improvement Over Time

We’re now going to run an analysis to see which player had the biggest improvement from their first season to their most recent season. This can be done by comparing the first and last entries in the season_stats array.

Step 1: Extract Points from First and Latest Seasons

Use array indexing to get the first and latest seasons:

SELECT
        player_name,
        (season_stats[1]::season_stats).pts as first_season,
        (season_stats[CARDINALITY(season_stats)]::season_stats).pts as latest_season
FROM players
WHERE current_season = 2001

Output:

This gives you:

Each player’s first and most recent seasons, stored in two separate columns.

Step 3: Calculate Improvement Ratio and Find the Most Improved Player

To measure improvement, divide the latest season points by the first season points. Ensure no division by zero by handling cases where first_season_points = 0. Then order the results by improvement_ratio in descending order to find the player with the most significant improvement:

SELECT  
        player_name,  
        (season_stats[CARDINALITY(season_stats)]::season_stats).pts /  
        CASE
	        WHEN (season_stats[1]::season_stats).pts=0 THEN 1
	        ELSE (season_stats[1]::season_stats).pts
	    END AS improvement_ratio
FROM players  
WHERE current_season = 2001  
ORDER BY 2 DESC

Output:

Key Observations

Efficiency Without Group By: Unlike typical queries requiring GROUP BY for aggregations, this query avoids it entirely. Instead, the cumulative table design enables direct access to the first and last seasons via array indexing.
Speed of Query Execution: Since no grouping or shuffling occurs, the query is extremely fast. The ORDER BY step is the slowest part, but without that, the query runs in a map-only step and is highly parallelizable.

Benefits of Cumulative Tables

No Grouping Required: Access to historical and latest data is built into the cumulative table design, avoiding the need for complex groupings or shuffles.
Performance: The query runs exceptionally fast because all operations are local to each row. Without shuffling, it’s “infinitely parallelizable.”
Historical Analysis: Enables quick comparisons and insights into player performance over time.

Quartz 4

Explorer

02. Data Modeling - Cumulative Dimensions, Struct and Array - Day 1 Lab

Designing the New Schema

Building the Cumulative Query

Special Cases and Flexibility

Enhancing the Schema

Analyzing Player Improvement Over Time

Step 1: Extract Points from First and Latest Seasons

Step 3: Calculate Improvement Ratio and Find the Most Improved Player

Key Observations

Benefits of Cumulative Tables

Graph View

Table of Contents

Backlinks