So today’s lab is all about cleaning and transforming the game_details table from our NBA dataset. This table contains key information about games, teams, players, and their stats, but it’s a bit messy, and we need to fix some issues before we can use it properly.
Let’s start by identifying the grain of the table. When working with fact data, the grain is really important because it defines the unique identifier for each row. For game_details, the grain is a combination of game_id, team_id, and player_id. Each row should represent a unique player’s contribution to a specific game for a specific team.
Now, one of the first problems with this table is that it contains duplicates. We can identify this by running a query that counts rows grouped by the grain of the table. When we group by game_id, team_id, and player_id and check for counts greater than one, we find that nearly every unique combination is duplicated:
SELECT
game_id
, team_id
, player_id
, COUNT(*)
FROM public.game_details
GROUP BY 1,2,3
having COUNT(*)>1Here’s the output:
/attachments/Pasted-image-20250105224848.png)
To fix this, we can use a common deduplication approach. First, we use the ROW_NUMBER() function to assign a unique number to each row within a group of duplicates. We partition the data by game_id, team_id, and player_id, and then assign a row number to each entry within those groups. This allows us to identify which rows are duplicates:
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY game_id, team_id, player_id) AS row_num
FROM game_details
)
SELECT *
FROM deduped
WHERE row_num = 1;Some point to highlight in this query:
- This query adds a
row_numcolumn that starts at 1 for each group of duplicates. By filteringWHERE row_num = 1, we keep only the first row from each group and drop all the others. - Furthermore, another issue with the dataset is that it lacks an ordering column. This is important because, in many cases, we’d want to decide which duplicate to keep based on some priority (by adding an
ORDER BYfunction afterPARTITION BYinside theOVER()function)—like a timestamp or a specific field value. However, thegame_detailstable doesn’t have such a column. This means that our deduplication process is somewhat arbitrary, and we’re just keeping the first row based on how the data is processed.
So now that we have our deduplicated game data with the row_num column, things are looking better. However, we still have a lot to clean up. For instance, there’s a bunch of data in this table that we don’t really need, and some important data is missing.
If you look at the table, you’ll notice it’s heavily denormalized. For example, it includes columns like team_id, team_abbreviation, and team_city. Similarly, for players, we have player_id, player_name, and other columns. Many of these columns, while useful in some contexts, don’t belong in a fact table.
One of the major principles of fact data modeling is that we should only include what’s necessary for analysis. If something can be joined cheaply later, it doesn’t need to be in the fact table. Let’s consider team_abbreviation and team_city—they’re completely unnecessary here. There are only 30 teams in the NBA, and even 250 years from now, we might have 100 teams. That’s still manageable data. You could fit all the teams in an Excel file easily, so there’s no need to include this data in the fact table.
On the other hand, some critical information is missing from this table, like the game date. A “when” column is essential in fact tables for proper time-based analysis, but it’s completely absent here. Fortunately, we can get this information from the games table by performing a join.
Here’s how we approach this:
- We join
game_detailswith thegamestable ongame_id. - From
games, we include thegame_date_estcolumn. - We update our deduplication logic to account for
game_date_est. When ordering rows for deduplication, we can now includegame_date_ESTin theORDER BYclause to ensure consistency.
For readability, we can alias the tables (gd for game_details and g for games) and write the join query:
with deduped as (
select
gd.*
, g.game_date_est
, row_number() over(partition by gd.game_id, team_id, player_id order by g.game_date_est) as row_num
from game_details gd
join games g on gd.game_id = g.game_id
)
select * from deduped
where row_num = 1Once we run this, the results will look much cleaner. We’ll see columns like game_date_est, game_id, team_id, and player_id, but we’ll also notice that unnecessary columns like team_abbreviation and team_city are still present, which we’ll remove in the next step.
Here’s why this matters: including things like game_date_est in the fact table is crucial because the games table is dynamic and will continue to grow over time. In 250 years, the number of games played will be in the hundreds of thousands, and this table will become massive. If we had to repeatedly join with the games table to get the game date for every record, our queries would get slower and slower. You can already see this issue in play—our current query takes about 7 seconds for just 10 years of data. Imagine running this on 250 years of data—it wouldn’t scale linearly but would grow even slower due to the increasing complexity.
This is why it’s so important to include essential columns like the game date in the fact table itself while stripping out unnecessary ones like team metadata that can easily be joined later without performance hits.
Let’s walk through the thought process behind the creation of this final query step by step. We’ll identify the columns that matter, why they were chosen, and how they are transformed to serve a clear analytical purpose.
We start with the games table, focusing on essential columns. While game_id might seem critical initially, it becomes unnecessary once other columns are incorporated:
game_date_est: Provides the game’s exact date in EST, a key timestamp for analysis.season: Ensures the context of the game within a specific season.home_team_id: Critical for determining whether a player is playing at home or away. Instead of retaining bothhome_team_idandvisitor_team_id, onlyhome_team_idis used since any team not at home is logically the away team. This simplifies the query without losing functionality.
We derive the column dim_is_playing_at_home using:
team_id = home_team_id as dim_is_playing_at_homeThis Boolean indicator shows whether the team is playing at home. Including such derived values reduces the need for additional joins or manual logic during analysis.
Moving to the game_details table, the focus shifts to player-level and game-level granular data. Here’s the reasoning behind the selected columns:
-
Team and Player Identifiers:
team_idandplayer_idare critical for identifying the team and player for each record.- Including
player_namealongsideplayer_idmakes the data more interpretable, allowing analysts to immediately understand who the player is without requiring a separate join.
-
Player Role:
start_position: Captures the player’s starting position for a specific game (e.g., small forward, power forward). This column is dynamic as player roles can change game-to-game, making it a core fact in the model.
-
Player Availability (
commentParsing): Thecommentcolumn holds complex, high-cardinality data. Parsing it into distinct Boolean dimensions improves usability:dim_did_not_play: Indicates the player was available but didn’t enter the game (e.g., “DNP”).dim_did_not_dress: Indicates the player was present but not in uniform (e.g., “DND”).dim_not_with_team: Indicates the player was not present (e.g., “NWT”).
These derived columns simplify analysis while preserving the original intent of the raw data. Example logic for parsing:
coalesce(position('DNP' in comment), 0) > 0 as dim_did_not_play -
Game Metrics: All fundamental performance statistics are retained:
- Shooting Metrics:
fgm,fga,fg3m,fg3a,ftm,fta. - Rebounding:
oreb,dreb,reb. - Playmaking:
ast(assists),stl(steals),blk(blocks). - Fouls and Turnovers:
pf(personal fouls),TO(renamed toturnoversto avoid using SQL keywords). - Scoring:
pts(points) andplus_minus.
- Shooting Metrics:
-
Minutes Played:
- The raw
mincolumn is a poorly formatted string. To improve usability, it’s transformed into a decimal representing fractional minutes:cast(split_part(min, ':', 1) as real) + cast(split_part(min, ':', 2) as real) / 60 as minutes - This allows analysts to calculate metrics like points per minute or rebounds per minute efficiently.
- The raw
Here’s how the final query looks:
with deduped as (
select
g.game_date_est,
g.season,
g.home_team_id,
gd.*
, row_number() over(partition by gd.game_id, team_id, player_id order by g.game_date_est) as row_num
from game_details gd
join games g on gd.game_id = g.game_id
)
select
game_date_est
, season
, team_id = home_team_id as dim_is_playing_at_home
, player_id
, player_name
, start_position
, coalesce (position('DNP' in comment), 0) > 0 as dim_did_not_play
, coalesce (position('DND' in comment), 0) > 0 as dim_did_not_dress
, coalesce (position('NWT' in comment), 0) > 0 as dim_not_with_team
, cast(split_part(min, ':', 1) as real) + cast(split_part(min, ':', 2) as real)/60 as minutes
, fgm
, fga
, fg3m
, fg3a
, ftm
, fta
, oreb
, dreb
, reb
, ast
, stl
, blk
, "TO" as turnovers
, pf
, pts
, plus_minus
from deduped
where row_num = 1This query strikes a balance between storing raw data and creating derived columns, enabling both straightforward analysis and flexibility for more complex use cases.
Let’s start by creating the DDL for our table. We’ll name the table fact_game_details. For the first column, since it’s a date, we don’t need to worry about it being in Eastern time, so we’ll call it game_date, but more appropriately, it should follow dimensional naming conventions as dim_game_date. Here’s the DDL query:
CREATE TABLE fct_game_details (
dim_game_date date,
dim_season integer,
dim_team_id integer,
dim_player_id integer,
dim_player_name text,
dim_start_position text,
dim_is_playing_at_home boolean,
dim_did_not_play boolean,
dim_did_not_dress boolean,
dim_not_with_team boolean,
m_minutes real,
m_fgm integer,
m_fga integer,
m_fg3m integer,
m_fg3a integer,
m_ftm integer,
m_fta integer,
m_oreb integer,
m_dreb integer,
m_reb integer,
m_ast integer,
m_stl integer,
m_blk integer,
m_turnovers integer,
m_pf integer,
m_pts integer,
m_plus_minus integer,
PRIMARY KEY (dim_game_date, dim_team_id, dim_player_id)
)
Fact data often benefits from distinguishing dimensions (fields to filter or group by) and measures (fields used for aggregations). For example, dim_season will be an INTEGER, and dim_team_id will also be an INTEGER. While team_id might grow large, an INTEGER should suffice unless it approaches the range limit. We also have dim_is_playing_at_home, which will be a BOOLEAN.
Considering this DDL query, we need to change a little bit the previous final query by fixing the order and the name of some columns:
with deduped as (
select
g.game_date_est,
g.season,
g.home_team_id,
gd.*
, row_number() over(partition by gd.game_id, team_id, player_id order by g.game_date_est) as row_num
from game_details gd
join games g on gd.game_id = g.game_id
)
select
game_date_est AS dim_game_date
, season AS dim_season
, team_id AS dim_team_id
, player_id AS dim_player_id
, player_name AS dim_player_name
, start_position AS dim_start_position
, team_id = home_team_id as dim_is_playing_at_home
, coalesce (position('DNP' in comment), 0) > 0 as dim_did_not_play
, coalesce (position('DND' in comment), 0) > 0 as dim_did_not_dress
, coalesce (position('NWT' in comment), 0) > 0 as dim_not_with_team
, cast(split_part(min, ':', 1) as real) + cast(split_part(min, ':', 2) as real)/60 as m_minutes
, fgm AS m_fgm
, fga AS m_fga
, fg3m AS m_fg3m
, fg3a AS m_fg3a
, ftm AS m_ftm
, fta AS m_fta
, oreb AS m_oreb
, dreb AS m_dreb
, reb AS m_reb
, ast AS m_ast
, stl AS m_stl
, blk AS m_blk
, "TO" AS m_turnovers
, pf AS m_pf
, pts AS m_pts
, plus_minus AS m_plus_minus
from deduped
where row_num = 1When we move to measures, naming conventions typically prefix them with m to clarify their purpose. For example:
m_minutes(minutes played) asREAL.m_fgm(field goals made) andm_fga(field goals attempted) asINTEGER.- Similarly,
m_fg3m,m_fg3a,m_ftm(free throws made), and others follow this pattern.
This naming approach ensures clarity—dimensions are for filtering and grouping, while measures are for calculations.
For the primary key, we combine dim_game_date, dim_player_id, and optionally dim_team_id. While team_id might seem redundant, including it could improve indexing, especially for filtering. However, logically, a player can’t be on two teams in one game, so its necessity is debatable.
Once the table is ready, we move on to inserting data. We use an INSERT INTO statement and ensure all fields match the column names and order. This setup supports querying efficiently. For example:
- A query to identify players with the highest “bail percentage” (e.g., games where they didn’t play) can be written using
CASEstatements and aggregations likeCOUNT:SELECT dim_player_name , count(*) AS num_games , count(CASE WHEN dim_not_with_team THEN 1 END) AS bailed_num , cast(count(CASE WHEN dim_not_with_team THEN 1 END) AS REAL)/count(*) AS bail_pct FROM fct_game_details GROUP BY 1 ORDER BY 4 desc - Aggregations like total points (
SUM(m_points)) or minutes played can also be easily retrieved:SELECT dim_player_name , count(*) AS num_games , sum(m_pts) AS total_points , count(CASE WHEN dim_not_with_team THEN 1 END) AS bailed_num , cast(count(CASE WHEN dim_not_with_team THEN 1 END) AS REAL)/count(*) AS bail_pct FROM fct_game_details GROUP BY 1
Renaming columns during transformation is critical, especially in fact data modeling, where the original names are often unclear or inconsistent. Following conventions for dimensions (dim_) and measures (m_) makes queries more intuitive and tables easier to use.
This approach supports efficient joins, like adding team data by joining the table teams on team_id=dim_team_id, and enables complex analysis with minimal effort. For example, as we have just seen:
- Finding the player with the highest bail percentage.
- Aggregating stats like points scored or assists while grouping by home/away games.
Ultimately, the goal of this modeling approach is to create tables that are easy to query, performant, and capable of answering diverse business questions without requiring excessive manipulation.