Create a PostgreSQL table with a million rows (Lab)

To create a PostgreSQL table with a large number of rows, you can use a combination of Docker and the generate_series function. This process allows you to quickly set up a database and populate it with millions of rows for testing purposes.

Step-by-Step Guide

Here’s how to create a PostgreSQL table with a million rows:

  1. Run a PostgreSQL Docker Container: Start a new PostgreSQL instance using Docker. This command sets the POSTGRES_PASSWORD and names the container pg1.
    docker run -e POSTGRES_PASSWORD=postgres -d --name pg1 postgres
  2. Access the Database Shell: Connect to the running container and access the psql command-line tool.
    docker exec -it pg1 psql -U postgres
    You are now in the PostgreSQL shell, ready to execute commands.
  3. Create a Table: Create a new table. In this example, the table is named temp and has a single column t_value of type INTEGER.
    CREATE TABLE temp (t_value INTEGER);
  4. Populate the Table with Data: Use the INSERT INTO statement combined with the built-in generate_seriesfunction to populate the table. The generate_series function creates a sequence of numbers, which you can use in a SELECT statement.
    INSERT INTO temp (t_value)
    SELECT
        (random() * 100)::int
    FROM
        generate_series(0, 1000000);
    • generate_series(0, 1000000) produces a list of numbers from 0 to 1 million.
    • random() generates a random floating-point number between 0.0 and 1.0.
    • random() * 100 scales the random number to be between 0 and 100.
    • ::int casts the resulting float to an integer. This gives you a million random integer values to insert into the table.

Verifying the Data

After the command finishes, you can verify that the table has been populated correctly:

  • View some rowsSELECT t_value FROM temp LIMIT 10;
  • Count all rowsSELECT count(*) FROM temp;

This method provides a quick and efficient way to create a large dataset for testing performance, indexing, and other database features. You can easily modify the generate_series function to create more rows or adjust the randomfunction to create different types of data.

Getting started with Indexing (Lab)

Indexing is a fundamental concept every backend engineer working with databases must understand. An index dramatically affects query performance, transforming queries from taking seconds to executing in milliseconds. However, indexes aren’t automatic solutions—they require careful design and understanding of when and how databases use them.

What Is Indexing?

An index is a separate data structure built on top of a table that creates shortcuts for finding data. The analogy of a phone book is apt: rather than scanning every entry sequentially, the phone book has tabs for each letter, allowing you to jump directly to the ‘Z’ section when searching for “Zebra Company.”

Indexes typically use two data structures: B-trees and LSM trees. B-trees organize data hierarchically, enabling logarithmic search times. To find row 2000, the B-tree compares: “Is 2000 between 1 and 4000? Yes. Is it between 1000 and 3000? Yes.” This process continues, narrowing the search space until the exact row is located.

Demonstration: Queries With and Without Indexes

Consider a table named employees with 11 million rows, containing two columns: id (an auto-incrementing primary key) and name (random strings). The id column has an index by default because it’s a primary key. The name column has no index initially.

Query 1: Select by Indexed Column (ID)

EXPLAIN ANALYZE SELECT id FROM employees WHERE id = 2000;

Execution time: 0.6 milliseconds

The database performs an index scan on employees_pkey (the primary key index). It doesn’t scan 11 million rows—it searches the B-tree index, which is far smaller and more structured.

Critically, the query doesn’t access the heap (the main table storage). Since only id is requested and id exists in the index, the database retrieves the value directly from the index—an “inline query.” This is the fastest possible scenario: finding data without touching the main table. This is called Index Only Scan.

Query 2: Select Different Column by Indexed Column

EXPLAIN ANALYZE SELECT name FROM employees WHERE id = 5000;

Execution time: 2.5 milliseconds

The database still uses the index to find row 5000 quickly. However, the name column isn’t in the index—it’s in the heap. After locating the row via the index, the database must jump to the heap, read the page containing that row, and extract the name value.

This additional heap access (“heap fetch”) adds latency. While 2.5 milliseconds is still fast, it’s roughly 4x slower than the inline query. The difference between 0.6ms and 2.5ms may seem trivial for a single query, but at scale—thousands of queries per second—this overhead compounds significantly.

Query 3: Select by Non-Indexed Column

EXPLAIN ANALYZE SELECT id FROM employees WHERE name = 'xyz';

Execution time: ~3 seconds

Without an index on name, the database has no choice but to perform a full table scan. It reads every row in the 11 million row table, checking whether name equals ‘xyz’. This is the worst-case scenario.

PostgreSQL attempts to mitigate this with parallel sequential scans, spawning multiple worker threads to divide the table among them. Even with parallelism, scanning 11 million rows takes roughly 3 seconds—orders of magnitude slower than indexed queries.

Query 4: Creating an Index on Name

CREATE INDEX employees_name_idx ON employees(name);

Creating the index takes time—the database must scan all 11 million rows, extract name values, and build a B-tree. Once complete, the same query executes much faster:

EXPLAIN ANALYZE SELECT id FROM employees WHERE name = 'xyz';

Execution time: ~47 milliseconds

The database now performs a bitmap index scan on employees_name_idx, locating matching rows via the index rather than scanning the entire table. 47ms is dramatically better than 3 seconds.

Query 5: When Indexes Aren’t Used (The LIKE Operator)

Consider this query:

EXPLAIN ANALYZE SELECT id FROM employees WHERE name LIKE '%xyz%';

Execution time: ~1+ seconds (full table scan)

Despite having an index on name, the database performs a sequential scan. Why? The LIKE '%xyz%' pattern with wildcards on both ends cannot use an index. The database must examine every row to check if ‘xyz’ appears anywhere within the name field.

Indexes work with exact matches or prefix matches (name LIKE 'xyz%'), not arbitrary substring searches. The query planner detects that the index can’t satisfy this expression and falls back to a sequential scan.

This highlights a critical principle: having an index doesn’t guarantee the database will use it. The query planner evaluates whether the index helps. If the query involves expressions or patterns the index can’t handle, the index is ignored.

Index Design Principles

Select Only Necessary Columns

SELECT * is generally discouraged because it forces the database to fetch all columns from the heap, even if only one or two are needed. If a query can be satisfied entirely from the index (inline query), it avoids heap access entirely.

Consider:

SELECT id FROM employees WHERE id = 1000;  -- Fast: inline from index
SELECT * FROM employees WHERE id = 1000;   -- Slower: requires heap fetch

Multi-Column Indexes (Covering Indexes)

Indexes can include multiple columns, allowing more queries to be satisfied inline. For example:

CREATE INDEX employees_id_name_idx ON employees(id, name);

Now a query like SELECT id, name FROM employees WHERE id = 1000 retrieves both columns from the index without accessing the heap. This technique—called a “covering index”—is one of the most powerful optimizations available.

Index Overhead

Indexes aren’t free. They consume disk space and must be updated whenever rows are inserted, updated, or deleted. Too many indexes slow down writes. The trade-off is query speed versus write throughput and storage overhead.

Query Planning

The query planner decides whether to use an index based on cost estimates. Sometimes, if the planner believes a sequential scan is faster (for example, when querying most rows in the table), it ignores indexes. Understanding EXPLAIN ANALYZE output helps diagnose why queries aren’t using expected indexes.

Primary Key Indexes

Primary keys automatically have indexes in most databases. These indexes enforce uniqueness and provide fast lookups. In many databases (like MySQL InnoDB), the primary key is a clustered index—the table is physically organized by primary key order. In PostgreSQL, the primary key is a regular B-tree index, and the table (heap) remains unordered.

Other indexes in MySQL store the primary key value rather than a direct heap pointer. This means querying a secondary index for the primary key is fast—the primary key value is already in the index, avoiding a heap fetch.

Practical Takeaways

  • Indexes enable fast lookups by providing structured shortcuts through data, reducing query times from seconds to milliseconds
  • Not all queries benefit from indexes—expressions, wildcards, and certain patterns force full table scans despite indexes existing
  • Inline queries are fastest—if the index contains all requested columns, the heap isn’t accessed
  • Heap fetches add latency—retrieving non-indexed columns requires additional I/O to the main table
  • Sequential scans are slow—without indexes, the database must examine every row, which scales poorly
  • Query planning matters—use EXPLAIN ANALYZE to understand which indexes are used and why

Indexing is both powerful and nuanced. Properly designed indexes can accelerate queries by orders of magnitude, but poorly chosen indexes or misunderstood query patterns can negate their benefits. Mastering indexing requires understanding not just how to create indexes, but how databases use them and when they don’t.

Understanding the SQL Query Planner and Optimizer with EXPLAIN

The EXPLAIN command in PostgreSQL reveals how the database plans to execute a SQL statement without actually running it. Understanding EXPLAIN output is essential for diagnosing slow queries and optimizing database performance.

Basic EXPLAIN Output

Consider a simple query:

EXPLAIN SELECT * FROM grades;

This single line contains critical information about query execution:

Sequential Scan

A sequential scan is equivalent to a full table scan—the database reads every row from the table (the “heap”). PostgreSQL chose this approach because the query has no WHERE clause and requests all columns, making a table scan the only option.

Sometimes PostgreSQL uses a “parallel sequential scan,” spawning multiple worker threads to divide the table and scan portions concurrently, improving performance on large tables.

Cost: Startup and Total

The cost notation (cost=0.00..289000.00) contains two numbers separated by ..:

Startup cost (first number: 0.00): The time in milliseconds required to retrieve the first page of results. In this case, it’s zero because PostgreSQL immediately accesses the table and begins returning rows. If you execute SELECT * FROM grades LIMIT 1, results appear almost instantly.

Startup cost can increase when PostgreSQL must perform work before returning any rows—sorting with ORDER BY, aggregating with GROUP BY, or joining tables. These operations require processing data before the first result can be returned.

Total cost (second number: 289000.00): The estimated total execution time in milliseconds to complete the query. This estimates approximately 289 seconds to scan all 200 million rows.

Important: These are estimates, not actual measurements. EXPLAIN doesn’t execute the query—it predicts behavior based on table statistics. For actual timing, use EXPLAIN ANALYZE, which executes the query and reports real durations.

Rows

rows=200000000 is PostgreSQL’s estimate of how many rows the query will return. This number is extremely valuable for understanding query impact.

For counting rows, consider using EXPLAIN instead of SELECT COUNT(*) when an approximation suffices. SELECT COUNT(*) scans the entire table, which is expensive for billions of rows. EXPLAIN provides an instant estimate based on table statistics. For applications like social media showing “1.2M likes,” exact counts aren’t necessary—approximations are sufficient and orders of magnitude faster.

Width

width=31 represents the average row size in bytes—the sum of all column sizes. For a table with three columns (integer ID, double-precision grade, text name), the width indicates the typical row occupies 31 bytes.

Width matters for network transfer. Larger widths mean more data transferred to the application. If a query returns millions of rows with large widths (such as including blob columns), network and memory overhead become significant. This is why SELECT * is discouraged—returning all columns when only one is needed wastes bandwidth and processing.

ORDER BY: Increased Startup Cost

EXPLAIN SELECT * FROM grades ORDER BY g;

The startup cost increased to 0.43 milliseconds. PostgreSQL must perform sorting before returning the first row. However, because an index exists on column g, and indexes are inherently sorted, PostgreSQL can use the index to satisfy the ORDER BY efficiently. The work is minimal but non-zero, explaining the slight startup cost increase.

Ordering by Non-Indexed Column

EXPLAIN SELECT * FROM grades ORDER BY name;

Startup cost jumps to 1,000,000 milliseconds (about 1000 seconds). Without an index on name, PostgreSQL must:

  1. Perform a parallel sequential scan to retrieve all rows
  2. Load all rows into memory
  3. Sort them by name

Reading EXPLAIN output from bottom to top clarifies execution flow. The inner operation (parallel sequential scan) fetches data. The outer operation (sort) processes that data before returning results. The enormous startup cost reflects the sorting overhead for 200 million unsorted rows.

The query also shows Parallel Workers Planned: 2, indicating two threads will cooperate on the scan. After scanning completes, a “gather merge” operation combines sorted results from both workers.

Width Varies by Selected Columns

EXPLAIN SELECT id FROM grades;

Width is now 4 bytes because only the id column (an integer) is selected. Integers occupy 4 bytes.

Caution: Using integer for auto-incrementing primary keys limits tables to approximately 2 billion rows (2^31). If a table might exceed this, use bigint (8 bytes), which supports up to 9 quintillion rows. The social media platform Parler famously exceeded the integer limit, causing their notification table to crash.

EXPLAIN SELECT name FROM grades;

Width is now 19 bytes. The name column is text with variable length. PostgreSQL estimates an average of 19 bytes per name based on table statistics.

Large widths impact performance. Selecting blob columns or many large text fields significantly increases data transfer over the network and memory consumption in the application. Only select columns actually needed to minimize width and improve performance.

EXPLAIN SELECT g FROM grades;

Width is 8 bytes because g is a double-precision floating-point number.

Index Scans

EXPLAIN SELECT * FROM grades WHERE id = 10;

PostgreSQL uses an index scan because a unique index exists on id. The startup cost is 0.43 milliseconds—slightly higher than a sequential scan’s zero because the database must traverse the B-tree index. However, the total cost is only 8 milliseconds compared to 289,000 for a full table scan.

After finding the matching row in the index, PostgreSQL performs a “heap fetch”—jumping to the table to retrieve all columns (since SELECT * was specified). This extra I/O explains why the cost isn’t closer to zero.

Index-Only Scan

EXPLAIN SELECT id FROM grades WHERE id = 10;

An index-only scan is faster because the query requests only id, which exists in the index. PostgreSQL doesn’t need to access the heap at all—it returns the value directly from the index. The total cost drops from 8 to 4 milliseconds, and the width is 4 bytes instead of 31.

Index-only scans are the fastest queries possible. When all requested columns exist in the index, no heap access is required. This is why covering indexes (indexes containing all columns a query needs) are so powerful.

Reading EXPLAIN Output

  1. Start from the innermost (bottom) operation and work upward. Inner operations execute first, and their output feeds outer operations.
  2. Focus on startup cost for latency-sensitive queries. If users wait for the first result (such as paginated queries with LIMIT), startup cost matters more than total cost.
  3. Compare total costs when evaluating different query plans or index strategies. Lower total cost generally indicates better performance.
  4. Check row estimates to understand whether the query processes thousands, millions, or billions of rows. Wildly inaccurate estimates (compared to actual row counts) indicate outdated statistics—run ANALYZE on the table to update them.
  5. Monitor width to understand data transfer sizes. Large widths suggest the query retrieves more data than necessary.

EXPLAIN vs. EXPLAIN ANALYZE

EXPLAIN provides estimates without executing the query. It’s fast and safe to run on production databases because it doesn’t modify data or consume significant resources.

EXPLAIN ANALYZE executes the query and reports actual timing. It’s more accurate but slower and, for write queries (INSERT, UPDATE, DELETE), will modify data. Use EXPLAIN ANALYZE when you need precise measurements and can afford the execution time.

Bitmap Index Scan vs Index Scan vs Table Scan

PostgreSQL uses different scan strategies depending on query characteristics and the number of rows expected. Understanding when each strategy applies—and why—helps optimize queries and interpret EXPLAIN output effectively.

Setup: The Grades Table

Consider a table with 5 million rows containing:

  • id: Auto-incrementing primary key with an index
  • name: Student name (no index)
  • g: Grade value between 0 and 100 (with an index)

The index on g is a covering index that also includes id as a non-key column, enabling index-only scans for certain queries.

Index Scan: Few Rows

EXPLAIN SELECT name FROM grades WHERE id = 1000;

PostgreSQL uses an index scan because it expects to return exactly one row. The execution:

  1. Traverse the B-tree index to find id = 1000
  2. Retrieve the row’s page location from the index
  3. Jump to the heap and fetch the page containing that row
  4. Extract the name column and return it

This is efficient because only one heap access is required.

Index Scan with Small Range

EXPLAIN SELECT name FROM grades WHERE id < 100;

PostgreSQL still uses an index scan for approximately 99 rows. The execution:

  1. Traverse the B-tree to find all rows where id < 100
  2. For each matching row (1, 2, 3, … 99):
    • Retrieve its page location from the index
    • Jump to the heap and fetch that page
    • Extract the name column

This involves random access: each row may reside on a different page, requiring up to 99 separate heap accesses. For small result sets, this overhead is acceptable—index lookup plus a few dozen random heap accesses is still faster than scanning 5 million rows.

Sequential Scan: Many Rows

EXPLAIN SELECT name FROM grades WHERE id > 100;

PostgreSQL uses a sequential scan despite an index existing on id. Why?

The query returns nearly 5 million rows (everything except the first 100). Using an index scan would require:

  1. Scan the index to find all matching rows (~5 million)
  2. For each row, jump to the heap (~5 million random accesses)

Reading the index structure plus making millions of random heap accesses is more expensive than simply reading the entire table sequentially.

PostgreSQL uses table statistics (gathered by ANALYZE) to estimate row counts. When it determines that a query will return most of the table, it abandons the index and scans the heap directly. Sequential I/O is faster than millions of random accesses.

This explains why having an index doesn’t guarantee it will be used. The query planner evaluates cost and chooses the optimal strategy. For queries returning large portions of a table, indexes add overhead without benefit.

Bitmap Index Scan: Moderate Row Counts

EXPLAIN SELECT name FROM grades WHERE g > 95;

A bitmap scan is a compromise between index scans and sequential scans. It’s used when:

  • The result set is too large for efficient random access (thousands of rows)
  • But too small to justify a full table scan

How Bitmap Scans Work

  1. Build a bitmap: PostgreSQL creates an in-memory bitmap where each bit represents one page in the table. If the table has 10,000 pages, the bitmap has 10,000 bits.
  2. Scan the index: As PostgreSQL scans the g index and finds rows where g > 95, it doesn’t immediately fetch those rows. Instead, it notes which pages contain matching rows by setting the corresponding bit in the bitmap. If a row on page 9 matches, bit 9 is set to 1. If three rows on page 42 match, bit 42 is set to 1 (once—multiple rows on the same page still result in just one bit).
  3. Fetch pages in order: After scanning the entire index, PostgreSQL has a bitmap indicating which pages contain matching rows. It then reads those pages from the heap in sequential order (not random order). Sequential reads are faster than random reads, especially on traditional hard drives.
  4. Recheck condition: Each fetched page contains multiple rows, but not all rows on that page necessarily match g > 95 (some may have g <= 95). PostgreSQL applies the condition again to filter out non-matching rows from the fetched pages.

Why bitmap scans are efficient:

  • Reduces random I/O: Instead of thousands of random heap accesses, PostgreSQL makes sequential reads of only the pages containing matching rows
  • Deduplicates page accesses: If 10 rows match on page 42, the bitmap ensures page 42 is fetched only once
  • Sequential reads are fast: Reading pages 3, 9, 15, 42, 100 in order is much faster than jumping randomly

Bitmap AND: Combining Multiple Indexes

EXPLAIN SELECT name FROM grades WHERE g > 95 AND id < 10000;

PostgreSQL can combine multiple bitmap scans using boolean logic. The execution:

  1. Scan the g index: Build a bitmap for rows where g > 95, setting bits for pages containing matching rows
  2. Scan the id index: Build a bitmap for rows where id < 10000, setting bits for pages containing matching rows
  3. AND the bitmaps: Perform a bitwise AND operation. Only pages that appear in both bitmaps (pages containing rows satisfying both conditions) remain set
  4. Fetch pages: Read only the pages indicated by the final bitmap
  5. Recheck conditions: Filter rows to ensure they satisfy both g > 95 AND id < 10000

This is remarkably efficient. Instead of fetching thousands of pages for g > 95 and thousands for id < 10000, PostgreSQL fetches only the pages containing rows that satisfy both conditions—perhaps just a few hundred pages. The bitmap AND operation eliminates pages that satisfy only one condition, dramatically reducing I/O.

Bitmap OR operations work similarly for queries with OR conditions, combining bitmaps so any page matching either condition is fetched.

Why Recheck Is Necessary

After fetching pages via bitmap scan, PostgreSQL rechecks conditions because the bitmap tracks pages, not individual rows. If page 42 contains 100 rows and only 3 match g > 95, the bitmap indicates page 42 should be fetched, but all 100 rows are retrieved. The recheck filters out the 97 non-matching rows.

Rechecking is inexpensive because pages are already in memory. It’s a simple in-memory filter operation, far cheaper than the I/O to fetch pages

Summary: When Each Scan Is Used

Scan TypeWhen UsedCharacteristics
Index ScanFew rows (1-100)Fast for small result sets; random heap access for each row
Sequential ScanMany rows (large % of table)Reads entire table; faster than millions of random accesses
Bitmap Index ScanModerate rows (1000s)Builds bitmap of pages, reads them sequentially; reduces random I/O
BitmapAnd/BitmapOrMultiple conditionsCombines bitmaps to fetch only pages matching all/any conditions

Practical Implications

Index presence doesn’t guarantee usage. The query planner evaluates whether using an index is cheaper than a sequential scan based on estimated row counts.

ANALYZE updates statistics used by the planner. If query plans seem suboptimal, outdated statistics may be the cause. Run ANALYZE table_name to refresh them.

Bitmap scans shine for moderate selectivity. Queries returning 1-10% of a table often benefit from bitmap scans, avoiding both the overhead of many random accesses and the waste of full table scans.

Covering indexes improve performance. If all requested columns exist in the index, heap access is unnecessary. Queries become index-only scans, the fastest possible execution.

Understanding these scan strategies helps interpret EXPLAIN output, diagnose slow queries, and design indexes that PostgreSQL will actually use. The query planner is sophisticated, but knowing how it chooses between strategies enables better schema design and query optimization.

Key vs Non-Key Column Database Indexing (TODO: not read yet)

When creating an index in PostgreSQL or other databases, you specify one or more fields to index, creating a B-tree structure for searching. The indexed columns are key columns—the database planner uses them for searches, and index entries contain row pointers back to the table to fetch complete rows.

Non-key indexes allow including additional columns in the index without making them searchable. If you create an index on the grade field and include the id field, queries that search by grade and select id don’t need to access the table—everything exists in the index. This eliminates heap fetches, improving performance.

Demonstration: Students Table

Consider a table with 50 million records containing multiple fields including id (primary key) and grade. Initially, no index exists on grade.

Query Without Index

EXPLAIN ANALYZE 
SELECT id, grade 
FROM students 
WHERE grade > 80 AND grade < 95 
ORDER BY grade DESC;

Execution time: 21-22 seconds

The database performs a parallel sequential scan, reading directly from data files. Multiple processes fetch pages from disk, but those pages contain grades of 10, 12, 13—values outside the desired range. The query removes approximately 14 million filtered rows—pages were fetched with unusable data that had to be discarded.

Even limiting results doesn’t always help with sequential scans:

EXPLAIN ANALYZE 
SELECT id, grade 
FROM students 
WHERE grade > 80 AND grade < 95 
ORDER BY grade DESC 
LIMIT 1000;

Execution time: ~3 seconds

While faster, the database still must search until finding 1,000 matching rows. Without an index, there’s no shortcut.

Creating an Index on Grade

CREATE INDEX g_idx ON students(grade);

Indexes are ordered by default, which databases prefer—predictability over randomness. With the index created, rerun the query:

EXPLAIN ANALYZE 
SELECT id, grade 
FROM students 
WHERE grade > 80 AND grade < 95 
ORDER BY grade DESC;

Execution time: ~16 seconds

Improvement from 21 to 16 seconds, but still slow. The query performs an index scan backward (because of ORDER BY grade DESC), efficiently finding candidate rows. All matching values sit together in the B-tree structure. However, the query requests id, which isn’t in the grade index. After identifying matching rows via the index, the database must jump to the heap to fetch id values from each row.

With a limit:

EXPLAIN ANALYZE 
SELECT id, grade 
FROM students 
WHERE grade > 80 AND grade < 95 
ORDER BY grade DESC 
LIMIT 1000;

Execution time: ~0.5 milliseconds

Much faster, but this speed is misleading—previous queries cached pages in the operating system cache. Running EXPLAIN (ANALYZE, BUFFERS) reveals cache hits:

Shared Hit: 20

All queried pages were cached. Changing the query slightly to avoid cache:

SELECT id, grade FROM students WHERE grade BETWEEN 10 AND 20 LIMIT 1000;

Now the query performs actual disk reads. With 10,000 results:

Execution time: ~25 milliseconds

Without the limit, fetching all matching rows:

Execution time: ~11 seconds
Buffers: 90,000 reads

The database went to disk 90,000 times. The index scan finds rows efficiently, but retrieving id requires accessing the heap for each row—expensive I/O.

Non-Key Column Indexes: Including ID

Drop the existing index and create a new one including id as a non-key column:

CREATE INDEX ON students(grade) INCLUDE (id);

Index creation takes longer because the index is larger—it now stores both grade and id values. Execute the query again:

EXPLAIN ANALYZE 
SELECT id, grade 
FROM students 
WHERE grade BETWEEN 10 AND 20;

Execution time: ~4 seconds

The query plan shows:

Index Only Scan
Heap Fetches: 0

Zero heap fetches. The index contains both grade and id, so the database retrieves everything from the index without accessing the heap. The query still reads from disk because the index lives on disk, but it avoids heap access entirely.

Running the query again:

Execution time: ~1 second
Buffers: 602 reads

With more cached pages, execution accelerates. The index is large and doesn’t fit entirely in memory, requiring disk I/O, but far fewer operations than heap fetches.

Why Disk Reads Still Occur

Even with an index-only scan, the database reads from disk because the index itself resides on disk. With a 16 GB system and only 1 GB shared memory allocated to PostgreSQL, the index can’t fit entirely in memory. Ideally, indexes fit in memory for maximum performance, but this isn’t always possible with large tables.

Visibility Map and Vacuum

Index-only scans depend on PostgreSQL’s visibility map being current. If the visibility map is outdated, the database may perform heap fetches even during index-only scans. Running VACUUM updates the visibility map:

VACUUM VERBOSE students;

This removes dead rows and updates internal structures. PostgreSQL runs autovacuum by default, but manually running vacuum ensures optimal performance, especially after bulk updates or inserts. The VERBOSE option reports what vacuum accomplished, showing if dead rows were removed or pages cleaned.

After vacuum, index-only scans perform better. Queries show Heap Fetches: 0 consistently.

Trade-offs

Non-key indexes increase index size. Including additional columns makes indexes larger, consuming more disk space and memory. If the index doesn’t fit in memory, queries must read it from disk, adding I/O overhead.

Performance gains depend on table size and query patterns. Larger tables benefit more from avoiding heap fetches. If queries frequently select the same columns used in searches, non-key indexes eliminate expensive heap access.

This is a tool to keep available. Non-key indexes aren’t always necessary, but understanding they exist provides options when optimizing specific query patterns. The larger the table and the more frequent the query, the greater the benefit.

Index Scan vs Index Only Scan

An index scan and an index-only scan are both methods for retrieving data using an index, but they differ in whether they need to access the main table to complete the query.

Index Scan

An index scan is a query execution plan where the database uses an index to locate the physical address of the data, but still needs to access the main table to retrieve the full row.

  1. Search the Index: The database uses the index (e.g., a B-tree) to quickly find the key value specified in the WHERE clause. This is much faster than a sequential scan because it avoids searching the entire table.
  2. Fetch from the Heap: Once the index returns the location of the matching row, the database must perform a separate disk read to go to the main table (also called the “heap”) and retrieve the requested columns. This is required because the columns you want to select are not stored in the index itself.

For example, on a table with an index on ID, the query SELECT name FROM grades WHERE ID = 7 performs an index scan. The database uses the index to find the row with ID = 7 but must then go to the main table to fetch the name column, which is not part of the index. This can be less efficient because it involves two I/O operations: one for the index and one for the main table.

Index-Only Scan

An index-only scan is a highly optimized query execution plan where the database can satisfy the query by reading only from the index, without needing to access the main table. This is possible if all the columns requested in the SELECT list are already present in the index.

There are two primary ways to achieve an index-only scan:

  1. Selecting an Indexed Column: If a query only requests the column that the index is built on, the database can return the result directly from the index. For example, the query SELECT ID FROM grades WHERE ID = 7 can be an index-only scan because the ID is the key column of the index, and its value is already stored there.
  2. Using Non-Key Columns: To make this more useful, you can add non-key columns (also called “included” columns) to an index. This allows the index to store additional data that is not used for searching but can be returned in a query. By including frequently selected columns in an index, you can turn a regular index scan into a faster index-only scan. For example, if you create an index on ID and include name as a non-key column with create index od_idx on grades(id) include (name), the query SELECT name FROM grades WHERE ID = 7 will now perform an index-only scan because both pieces of information are available within the index structure.

Combining Database Indexes for Better Performance

To gain the best query performance, you must understand how indexes work and choose the right indexing strategy for your specific use cases. The decision to use a single-column index, a composite index, or a combination of both depends on the nature of your queries.

Scenario 1: Two Separate Single-Column Indexes (on A and B)

Indexes: One index on column a and another on column b: create index on test(a); create index on test(b). Here’s the result:

  • Query EXPLAIN SELECT c FROM test WHERE a = 70: The database uses the index on a to quickly find matching rows. A bitmap index scan is used because we have to jump to the heap to pull c since c is not in the index. Note that we can easily change the decision of the DBMS: for example, if we query SELECT c FROM test WHERE a = 70 LIMIT 2, Postgres doesn’t need to take the overhead to build a bitmap because building a bitmap takes a little bit of time. Indeed with this query, a simple index scan is used:
  • Query EXPLAIN SELECT c FROM test WHERE b = 100: The same process as above, but the database uses the index on B instead.
  • Query WHERE A = 100 AND B = 100: The database performs a bitmap index scan on both the A index and the Bindex. It creates a bitmap for each, performs a logical AND operation on them (with BitmapAnd), and then uses the resulting bitmap to fetch the rows from the table in a single bulk operation. This is very fast.
  • Query WHERE A = 100 OR B = 100: The database performs a similar process, using a logical OR operation on the two bitmaps (with BitmapOr). This is still efficient, though it will likely involve more rows and thus more I/O.

Scenario 2: A Composite Index (on A and B)

composite index is a single index built on multiple columns, in a specific order. The order of the columns matters significantly. Here, the index is on (A, B).

  • Indexes: A single composite index on (A, B): create index on test(a,b).
  • Query WHERE A = 70: The database can use the composite index because A is the leftmost column. This is a very efficient operation. (Non ho capito)
  • Query WHERE B = 100: The database cannot use the composite index. Because B is not the leftmost column, the index is not sorted in a way that allows for an efficient search on B alone. The database must resort to a full table scan, which is very slow.
  • Query WHERE A = 70 AND B = 80: This is the ideal use case for a composite index. The database can use a single index lookup to find all rows matching both conditions. This is the fastest possible execution plan for this type of query.
  • Query WHERE A = 70 OR B = 80: The database cannot use the composite index for an OR query. It defaults to a full table scan, which is inefficient.

Scenario 3: A Composite Index and a Single-Column Index

This approach combines the best of both worlds to handle a variety of query patterns.

  • Indexes: A composite index on (A, B) and a separate single-column index on B: create index on test(b).
  • Query WHERE A = 70: The database uses the composite index on (A, B).
  • Query WHERE B = 100: The database uses the single-column index on B.
  • Query WHERE A = 70 AND B = 80: The database uses the composite index, which is the most efficient plan.
  • Query WHERE A = 70 OR B = 80: The database uses both indexes. It performs a bitmap scan on the (A, B) composite index (for the A condition) and a bitmap scan on the B index (for the B condition). It then performs a logical OR operation on the two bitmaps to find the final result set.

How Database Optimizers Decide to Use Indexes

When a table has multiple indexes, understanding which index the database optimizer will use is essential for determining whether additional indexes are needed. The optimizer does not use all available indexes simultaneously for every query; its choice depends on several factors related to query structure and data distribution.

Consider a table with two indexed columns: column F1 with index F1_IDX and column F2 with index F2_IDX. For a query like SELECT * FROM table WHERE F1 = 1 AND F2 = 4, the optimizer evaluates multiple strategies to determine the most efficient execution path.

Case 1: Using Both Indexes

In the first scenario, the optimizer decides to use both indexes. It queries the F1 index to find all row IDs (or tuples in PostgreSQL parlance) matching the value 1, then queries the F2 index to find all row IDs matching the value 4.

For an AND condition, it performs an intersection of these two sets, identifying only the row IDs present in both results:

For an OR condition, it performs a union instead. Once the effective result set of row IDs is collected, the optimizer accesses the table heap to retrieve the actual column values.

The optimizer chooses this approach when the result sets are neither too small nor too large. If the query returns very few rows, searching both indexes becomes inefficient—using a single index suffices. Conversely, if either index would return an excessive number of row IDs, the cost of traversing both indexes outweighs the benefit, making a full table scan more efficient. The optimizer relies on heuristics and table statistics to make this determination.

Case 2: Using One Index with Filtering

In the second scenario, the optimizer selects a single index and applies additional filtering at the table level. For example, it might use only the F1 index to find all rows where F1 equals 1, retrieve those rows from the table, and then filter the results based on the condition F2 = 4 without consulting the F2 index.

This strategy is most common when one index is significantly more selective than the other and the query uses an AND condition. If F1 returns very few rows while F2 would return many, the optimizer recognizes that using F1 alone is sufficient—any row not matching F1 is already excluded from the result set. Primary key indexes are almost always chosen in AND scenarios because of their high selectivity.

With OR conditions, the situation changes because the result set grows larger, potentially making the single-index approach less attractive. The key insight is that B-trees and other index structures have a cost to traverse. If an index will return most of the table’s rows, searching that index provides little value, and direct table access becomes more efficient.

Case 3: Full Table Scan

In the third scenario, the optimizer bypasses both indexes entirely and performs a full table scan. This occurs when the optimizer’s analysis indicates that the search conditions will return a substantial portion of the table—perhaps three-quarters or more. Since the query must access the table anyway to retrieve non-indexed columns, scanning the entire table directly becomes cheaper than navigating through indexes that offer minimal filtering benefit.

PostgreSQL and other modern databases can optimize table scans through parallel execution, deploying multiple workers to scan different portions of the table simultaneously, which further improves the efficiency of this approach.

The Critical Role of Table Statistics

Table statistics are fundamental to these optimization decisions. The database maintains statistical metadata about each table: approximate row counts, value distributions, and cardinality estimates. While not perfectly accurate, these statistics provide sufficient information for the optimizer to estimate how many rows each index will return for a given condition.

Statistics can be updated manually using commands like ANALYZE in PostgreSQL or GATHER_STATISTICS in Oracle. Keeping statistics current is not merely a performance optimization—it’s essential for correct query planning.

A common pitfall occurs with bulk data loading. Consider a table that starts empty with statistics reflecting zero rows. After inserting three rows, the statistics update asynchronously to show three rows. If you then perform a bulk insert of 300 million rows and immediately execute queries before updating statistics, the optimizer will still believe the table contains only three rows. It will choose a full table scan as the “optimal” strategy, resulting in scanning all 300 million rows when an index would have been far more efficient. After bulk operations, always update statistics before running queries. In PostgreSQL, run ANALYZE (and potentially VACUUM to clean up any dead tuples). In Oracle, use schema-level statistics gathering commands. This practice prevents the optimizer from making decisions based on stale metadata.

Query Hints: Overriding the Optimizer

For advanced users who understand their workload better than the optimizer’s heuristics, database hints allow manual control over index selection. By embedding special comments in the query, you can force the database to use a specific index regardless of what the optimizer would choose. This is useful when you have domain knowledge that the statistics cannot capture—for example, if you know your application’s data distribution has just changed in ways the statistics haven’t yet reflected.

However, query hints should be used sparingly and only when you genuinely understand the implications. The optimizer is generally correct, and overriding it can lead to worse performance if your assumptions are wrong.

Index Selectivity and Data Distribution

The effectiveness of an index depends heavily on data distribution. Consider a customer database with a state field indexed for United States locations. If 99.9% of customers are from California, an index on the state field provides minimal value for queries filtering on California—that condition effectively selects the entire table. However, that same index becomes highly efficient for the rare queries searching for customers from Texas or Florida, where the index can quickly locate the handful of matching rows among millions.

Indexes are most valuable for selective queries that return a small fraction of the table. When an index would return most rows, the optimizer recognizes that direct table access is superior. The decision to use multiple indexes becomes relevant when you need to combine moderately selective conditions—one index narrows the result set, and another index narrows it further.

Understanding your table’s data distribution—how sparse or concentrated values are across different columns—is crucial for predicting optimizer behavior and designing effective indexing strategies.

Create Index Concurrently - Avoid Blocking Production Database Writes

Creating an index on a large production table is a time-consuming operation that traditionally blocks write operations. While reads continue to function normally—queries like SELECT * FROM grades WHERE id < 10 execute without issue—write operations are blocked because the index creation process needs exclusive access to the field being indexed. This behavior is problematic for production systems that cannot tolerate downtime or transaction delays.

The Concurrent Index Creation Solution

PostgreSQL addresses this limitation with the CREATE INDEX CONCURRENTLY command. This approach allows both reads and writes to proceed while the index is being built, making it invaluable for production environments where blocking transactions is unacceptable.

The syntax adds a single keyword to the standard index creation command:

CREATE INDEX CONCURRENTLY G ON grades(field_name);

The concurrent creation process performs multiple table scans rather than a single scan. During these scans, it monitors ongoing write transactions and pauses index construction when necessary, allowing writes to complete before resuming. This ensures that write operations remain unblocked throughout the process.

For example, operations like INSERT INTO grades (field_name) VALUES (1) execute normally during concurrent index creation. Reads continue without interruption, and transactions can commit as usual. The index builder yields to active database operations rather than forcing them to wait.

Trade-offs of Concurrent Index Creation

The non-blocking nature of concurrent index creation comes with costs. The operation takes significantly longer to complete compared to standard index creation because of the multiple scans and periodic pausing. It also consumes more CPU and memory resources, potentially affecting other database processes by competing for these shared resources.

More critically, concurrent index creation can fail and leave the index in an invalid state. This occurs most commonly when creating a unique index on a table where duplicate values are inserted during the build process. Since the index doesn’t yet exist to enforce the uniqueness constraint, the database accepts the duplicate writes. When the index builder subsequently encounters these duplicates, it cannot complete successfully. The resulting invalid index must be dropped and recreated.

Default Behavior Considerations

Despite these trade-offs, PostgreSQL is moving toward making concurrent index creation the default behavior. The reasoning is sound: in production environments, maintaining fast, unblocked reads and writes is more critical than minimizing index creation time. Protecting live database operations takes priority over the speed of background maintenance tasks. If an index takes longer to build but doesn’t disrupt serving traffic, that’s an acceptable compromise for most workloads.

Bloom Filters: An Efficient Solution for Existence Checks

Bloom filters are a probabilistic data structure designed to solve a common performance problem: determining whether an element exists in a set without the overhead of querying a database or searching through large data structures.

The Problem: Expensive Existence Checks

Consider building a web service that checks whether a username is already taken. The straightforward implementation creates an HTTP endpoint that accepts a GET request with a username, executes a query like SELECT username FROM users WHERE username = 'paul' against the database (ideally using an index), and returns whether the record exists.

This approach works but suffers from poor performance. Username availability checks are extremely frequent—every user visiting the registration page tests multiple potential usernames seeking an available option. Each check requires a database query, creating significant load.

A natural optimization is caching usernames in an in-memory database like Redis. Requests first check Redis before falling back to the primary database when needed. However, this solution has drawbacks: it doubles the memory footprint by storing usernames in two locations, introduces synchronization complexity between the cache and database, and still requires substantial memory to store all usernames in Redis.

The Bloom Filter Solution

Bloom filters provide a memory-efficient alternative using a compact bit array.

Consider a 64-bit integer where each bit can be either 0 or 1. This array serves as the filter’s internal representation. When checking if username “paul” exists, the service hashes the string and takes the result modulo 64, producing a value between 0 and 63. This value identifies a specific bit position in the array. If bit 3 is produced and that bit is 0 in the filter, the system can guarantee with 100% certainty that “paul” does not exist in the database. No database query is needed.

Now consider checking if “jack” exists. Hashing “jack” and taking modulo 64 produces bit position 63. If bit 63 is set to 1, the result is different: “jack” might exist in the database. The uncertainty arises because hash collisions are inevitable—other strings may also hash to position 63. When the bit is set, the system must query the database to confirm whether “jack” truly exists.

This asymmetry is the core characteristic of bloom filters: they never produce false negatives but can produce false positives. If the filter says an element doesn’t exist, it definitively doesn’t exist. If the filter says an element might exist, a database query is required to confirm.

Building a Bloom Filter

Starting with a 64-bit array initialized to all zeros, the filter is built by setting bits as new usernames are created.

When creating username “jack”, the system makes a POST request to the server. Before writing to the database, it hashes “jack”, takes modulo 64 to get position 63, and sets bit 63 to 1. The username is then written to the database.

Creating username “paul” follows the same process: hash “paul”, get position 3, set bit 3 to 1, then write to the database.

When creating username “tim”, the hash produces position 63 again—a collision with “jack”. Since bit 63 is already set, nothing changes in the filter. The username is still written to the database. This demonstrates why queries for “tim” or any other string hashing to position 63 will return “might exist”, requiring database confirmation.

Creating username “ali” hashes to position 4, setting bit 4 to 1.

As more usernames are added, more bits become set. The filter gradually fills with 1s, each representing one or more usernames that hash to that position.

Practical Considerations and Limitations

Production bloom filter implementations typically use more sophisticated designs than this simplified example. Common enhancements include using multiple hash functions (often three) to map each element to multiple bit positions, and using larger bit arrays to reduce collision rates. These improvements decrease false positive rates at the cost of additional computation and memory.

The primary limitation occurs when the bit array becomes saturated—all bits set to 1. At this point, every existence check returns “might exist”, forcing a database query regardless of whether the username actually exists. The filter provides no benefit, though it causes no harm beyond the wasted memory.

Sizing the filter involves balancing trade-offs. Larger bit arrays reduce false positive rates but consume more memory. Smaller arrays conserve memory but produce more false positives, diminishing the benefit of using the filter. The optimal size depends on the expected number of elements and acceptable false positive rate.

Bloom filters excel in scenarios where avoiding expensive operations for non-existent elements provides significant value. Cassandra uses bloom filters extensively in its SSTable implementation to avoid unnecessary disk reads when checking for keys that don’t exist. Any system that needs to quickly determine “definitely not present” versus “possibly present” can benefit from this data structure’s memory efficiency and query speed.

Working with Billion-Row Table

When designing a system, a critical question emerges: how much will this table grow over the next few years? Will it reach hundreds of millions or even billions of rows? Understanding how to work with massive data volumes is essential for building scalable systems. There are several strategies for handling tables at this scale, each with distinct trade-offs.

Strategy 1: Parallel Processing Through Brute Force

The first approach is brute forcing the problem through parallelization. When searching for a row in a billion-row table, the table can be divided into multiple segments, with each segment processed concurrently across different threads or machines. This is the fundamental principle behind big data frameworks like Hadoop: use MapReduce to break the table into smaller chunks, distribute the workload across a cluster of machines, and process everything in parallel.

For certain workloads, distributing the problem across a 100-machine cluster can be effective. However, this approach has significant limitations, particularly for transactional databases where data changes constantly. The moment parallel processing begins, concurrent writes may alter the data being processed, creating consistency challenges. Brute force parallelization works best for static or append-only datasets rather than live transactional systems.

Strategy 2: Reducing the Search Space

Rather than processing the entire table, a more efficient approach is narrowing the search space to work with only a subset of the data. This is achieved through several complementary techniques.

Indexing

The most fundamental technique is indexing. An index creates a separate on-disk structure—typically a B-tree or LSM-tree—that organizes data to enable efficient lookups. Instead of scanning all billion rows, a query consults the index to identify exactly which rows match the search criteria. The index itself requires scanning, but it’s substantially smaller than the full table and structured for rapid traversal.

Think of an index like a color-coded filing system in an office. Contracts starting with ‘A’ are in one section, ‘B’ in another, and so on. To find a contract for a company named “Zebra”, you immediately navigate to the ‘Z’ section rather than searching through every document. Similarly, an index lets you work with millions of rows instead of billions.

Partitioning

Partitioning takes this further by physically dividing the table into smaller segments stored in different disk locations. In horizontal partitioning, the table is sliced by row ranges: rows 1-100 million might reside in one partition, rows 100,000,001-200 million in another, and so forth. Each partition maintains its own indexes, but the database typically handles this transparently—querying a partitioned table feels identical to querying a single monolithic table.

The key to partitioning is the partition key, which determines how rows are distributed across partitions. When executing a query, the database uses the partition key to identify which partition(s) to search. If the query conditions align well with the partition key, only one or a few partitions need to be examined. Combined with indexes within each partition, this dramatically reduces the working set from billions of rows to potentially thousands.

Sharding

Sharding extends partitioning across multiple physical database servers. The first 100,000 customers might reside in one database instance, the next 100,000 in another, and these instances operate independently. This distributes both storage and processing load, but introduces significant complexity.

Sharding creates several challenges. Cross-shard transactions become difficult or impossible since the databases don’t communicate. The client must become shard-aware, determining which database instance to query for each request. If searching for customer 500, the client must know that customer resides in shard 1. Additionally, if a shard server fails, all data on that shard becomes unavailable, requiring careful planning for redundancy and failover.

Despite this complexity, sharding can reduce query scope dramatically. Starting with billions of rows, sharding narrows the search to a specific database, partitioning within that shard further reduces the scope, and indexes within each partition narrow it to the final result set—potentially just hundreds or thousands of rows.

Strategy 3: Avoiding Billion-Row Tables Entirely

The most effective approach is often reconsidering whether a billion-row table is necessary at all. Database design choices directly determine table size, and alternative schemas can sometimes eliminate the scalability problem.

Consider a Twitter-like following feature where each follow relationship creates a row: user A follows user B, user C follows user A, and so on. With millions of users and hundreds of follows per user, this quickly becomes a massive table.

An alternative design embeds follower information directly in the user profile. Instead of a separate follows table, each user profile includes a follower count field and a JSON column (or list field) containing follower IDs. When someone follows you, a single row is updated rather than inserting a new row in a separate table. Fetching your followers requires reading just your profile row and extracting the embedded data.

This design shifts the challenge from read scalability to write throughput. Each follow now requires updating a user profile rather than inserting into a separate table. However, this can be addressed through asynchronous processing: follow events are queued and processed gradually, updating follower counts with acceptable eventual consistency. Most users don’t require perfectly real-time follower counts, making this trade-off viable.

The strategies should be considered in this priority order:

  1. Redesign to avoid billion-row tables. Examine whether alternative schemas can eliminate or significantly reduce table size through denormalization, embedding related data, or different modeling approaches.
  2. Apply indexing. If large tables are unavoidable, ensure appropriate indexes exist on columns used in query predicates to minimize the rows examined.
  3. Implement partitioning. When indexes alone are insufficient, partition the table on disk to break it into manageable segments that can be searched independently.
  4. Consider sharding. If a single database server cannot handle the load even with partitioning, distribute data across multiple database instances. Be aware that sharding introduces complexity in client logic, transaction handling, and operational management.
  5. Use parallel processing as a last resort. For workloads where the above strategies are impractical—particularly batch processing of static datasets—MapReduce-style parallel processing across a cluster can be effective.

Sharding should be avoided when possible because it complicates the system architecture substantially. Transactional consistency, schema migrations, rebalancing data as the system grows, and operational overhead all become significantly more complex with sharded databases. Exhausting single-database optimization techniques first is usually the prudent path.

How UUIDs in B+Tree Indexes affect performance

Understanding how random UUIDs affect database performance is critical when designing systems that use universally unique identifiers. The performance impact primarily affects inserts and writes, though reads can also be impacted depending on workload characteristics.

This analysis focuses specifically on UUID version 4, which generates completely random identifiers. Even when generated consecutively from the same machine, version 4 UUIDs are nowhere near each other numerically. This distinction matters because UUIDs, like all data, are ultimately stored as numbers—128-bit values that can be compared and ordered. This applies equally to strings, which are converted to bytes (ASCII or UTF-8) and treated as numeric values.

This discussion excludes UUID version 1 (which provides ordering but leaks MAC addresses), and UUID versions 7 and 8 (which are time-ordered). Version 4 remains the most popular and is typically the default implementation.

Why Randomness Hurts Index Performance

The fundamental property of indexes is that they must maintain entries in sorted order. Whether examining MySQL’s index-organized tables (where the primary index is the table itself with full rows as leaf pages) or standard B-tree indexes, the leaf pages must store entries sequentially. Without ordering, the index becomes useless. The database must know where to insert new entries and must balance the tree accordingly.

When inserting random, unordered values into an index, severe performance degradation occurs. To understand why, consider how indexes physically organize data.

Understanding Page Splits Through an Example

Consider inserting the following random UUID values (represented as simplified numbers): 10, 90, 80, 40, 5, 70, and 60. Assume each leaf page can hold only two elements (each consisting of a key and a value pointer—or in index-organized tables like MySQL and Oracle, the full row).

Starting with an empty index, inserting 10 creates the first entry. Inserting 90 places it after 10 in the same page since the page has capacity and 90 is greater than 10. The page now contains [10, 90] in order.

Now insert 80. The value 80 must be positioned between 10 and 90 to maintain sorted order, but the page is full. This triggers a page split—one of the most expensive operations in index maintenance. The database must create a new page, redistribute the existing values, and insert the new value in the correct position. One implementation might keep [10, 80] in the original page and move [90] to a new page.

Page splits require substantial overhead. The database must update page references (each page has a number corresponding to its disk offset), maintain doubly-linked pointers between pages (B+ trees link leaf pages for range scans), and potentially rebalance internal tree nodes. All of this involves disk I/O and memory operations.

Continuing the insertion sequence demonstrates the cascading impact. Inserting 40 requires another page split since it must be positioned between 10 and 80. Inserting 5 (which must precede 10) triggers yet another split.

Inserting 70 (between 40 and 80) and 60 (between 45 and 70) cause further reorganization.

After inserting just seven values, the index has undergone multiple page splits, scattered data across numerous pages, and touched different pages in random order. Now imagine inserting millions of random UUIDs—the performance degradation becomes severe.

The Buffer Pool Thrashing Problem

Beyond page splits, random UUIDs cause buffer pool thrashing. When updating an index, the database loads pages from disk into the shared buffer pool—a memory area shared across all database processes. This pool has limited capacity and operates as an LRU (least recently used) cache.

With random UUIDs, each insert might require a page that isn’t currently in the buffer pool. The database must fetch that page from disk, potentially evicting an older page to make room. Because UUIDs are random, the next insert might need the page that was just evicted, causing another fetch and another eviction. This continuous cycle of fetching and flushing—called I/O thrashing—devastates performance.

The Ordered Alternative

Contrast this with inserting ordered values: 10, 20, 30, 50, 70, 80. The first page receives [10, 20]. When inserting 30 and the page is full, a new page is created with [30] appended to the end. Subsequent values (50, 70, 80) continue appending to the tail of the index structure.

No page splits occur. No shuffling between pages. No random I/O. The database simply appends new values to the last page, creating new pages only when needed at the end of the index. This is the ideal workload for database indexes—sequential inserts are orders of magnitude faster than random inserts.

Real-World Impact: The Shopify Case Study

Shopify provides a compelling real-world example. They originally used UUID version 4 to label every purchase order, ensuring idempotency—if a customer accidentally submits the same request twice, Shopify’s system detects and rejects the duplicate.

The problem manifested because purchases happen in temporal clusters. When thousands of customers purchase simultaneously, all those transactions occur within the same minute, meaning they should be clustered together in the index timeline. However, because the UUIDs were random, the database was forced to scatter these temporally-related transactions across the entire index structure, causing excessive page splits and I/O thrashing.

Shopify switched from UUID version 4 to ULID (Universally Unique Lexicographically Sortable Identifier), which generates time-ordered identifiers. The impact was dramatic. Write performance improved substantially because new purchases now append to the index tail rather than scattering randomly.

Read performance also improved. When customers make purchases, they often query those same orders shortly afterward—within seconds, minutes, or at most an hour. With ULIDs, these temporally-related reads access the same index region (the tail), keeping relevant pages in the buffer pool. With random UUIDs, each read would require fetching a different random page from disk.

When Ordered IDs Don’t Help Reads

The read performance benefit only applies when read patterns correlate with insertion order. Consider a URL shortener service that generates shortened URLs using ordered identifiers. While writes benefit from sequential insertion, reads are completely random—users click shortened URLs in unpredictable patterns, and the first URL created might be the most frequently accessed. No correlation exists between insertion order and access pattern, so buffer pool efficiency gains from ordered identifiers don’t materialize for reads.

Shopify’s workload was fortunate: their access patterns naturally correlated with temporal ordering, benefiting both reads and writes simultaneously.

Special Consideration for MySQL

MySQL deserves particular attention because of its index-organized table architecture. In MySQL (InnoDB), the primary key is the table—leaf pages of the primary index contain complete rows, not just pointers. All secondary indexes reference rows using the primary key value.

If the primary key is a random UUID, every secondary index stores that random UUID as its pointer. The randomness “poisons” the entire database: not only does the primary index suffer from random insertion, but every secondary index also experiences the same degradation because they all reference random primary key values. This multiplicative effect makes random UUID primary keys particularly problematic in MySQL.

Scope of the Problem

This performance issue specifically affects indexed fields. In PostgreSQL, if a UUID field is not indexed, inserts simply append to the heap without ordering concerns, causing no performance degradation. The problem only manifests when the UUID field has an index—whether as a primary key, unique constraint, or secondary index.

Random UUIDs in indexes cause measurable performance degradation through page splits, I/O thrashing, and buffer pool inefficiency. For write-heavy workloads, consider time-ordered alternatives like ULID or UUID version 7. For read-heavy workloads, evaluate whether read patterns correlate with insertion order before expecting performance benefits from ordered identifiers.

Article - The Cost of Long running Transactions

The cost of a long-running update transaction that eventually failed in Postgres (or any other database for that matter.

In Postgres, any DML transaction touching a row creates a new version of that row. if the row is referenced in indexes, those need to be updated with the new tuple id as well. There are exceptions with optimization such as heap only tuples (HOT) where the index doesn’t need to be updated but that only happens if the page where the row lives have enough space (fill factor < 100%)

If a long transaction that has updated millions of rows rolls back, then the new row versions created by this transaction (millions in my case) are now invalid and should NOT be read by any new transaction. You have many ways to address this, do you clean all dead rows eagerly on transaction rollback? Or do you do it lazily as a post-process? Or do you lock the table and clean those up until the database fully restarts?

Postgres does the lazy approach, a command called vacuum which is called periodically. Postgres attempts to remove dead rows and free up space on the page.

What’s the harm of leaving those dead rows in? It’s not really correctness issues at all, in fact, transactions know not to read those dead rows by checking the state of the transaction that created them. This is however an expensive check, the check to see if the transaction that created this row is committed or rolled back. Also, the fact that those dead rows live in disk pages with alive rows makes an IO not efficient as the database has to filter out dead rows. For example, a page may have contained 1000 rows, but only 1 live row and 999 dead rows, the database will make that IO but only will get a single row of it. Repeat that and you end up making more IOs. More IOs = slower performance.

Other databases do the eager approach and won’t let you even start the database before rolling back is successfully complete, using undo logs. Which one is right and which one is wrong? Here is the fun part! Nothing is wrong or right, it’s all decisions that we engineers make. It’s all fundamentals. It’s up to you to understand and pick. Anything can work. You can make anything work if you know what you are dealing with.

Article - Microsoft SQL Server Clustered Index Design

This is one of the most interesting documentation pages I have read and I really thought I would share it with you. You can read the entire design guide with this link and I also included the pdf of the design.  I took a snippet of the two most important parts that I found interesting from this design guide, pay close attention to the images and how clustered index is persisted compared to non-clustered index in the b-tree architecture. Keep in mind that this is Microsoft way of solving things and it’s not necessarily the way to go. Try to challenge how is this done and come up with a different approach.

Clustered Index Architecture

In SQL Server, indexes are organized as B-Trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom nodes in the index are called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows. Each index row contains a key value and a pointer to either an intermediate level page in the B-tree, or a data row in the leaf level of the index. The pages in each level of the index are linked in a doubly-linked list.

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures; one in each partition.

Depending on the data types in the clustered index, each clustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each clustered index will have one IN_ROW_DATA allocation unit per partition. The clustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. It will also have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit.

The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows.

This illustration shows the structure of a clustered index in a single partition.

Nonclustered Index Architecture

Nonclustered indexes have the same B-tree structure as clustered indexes, except for the following significant differences:

  • The data rows of the underlying table are not sorted and stored in order based on their nonclustered keys.
  • The leaf level of a nonclustered index is made up of index pages instead of data pages.

The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:

  • If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
  • If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row.

Nonclustered indexes have one row in sys.partitions with index_id > 1 for each partition used by the index. By default, a nonclustered index has a single partition. When a nonclustered index has multiple partitions, each partition has a B-tree structure that contains the index rows for that specific partition. For example, if a nonclustered index has four partitions, there are four B-tree structures, with one in each partition.

Depending on the data types in the nonclustered index, each nonclustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each nonclustered index will have one IN_ROW_DATA allocation unit per partition that stores the index B-tree pages. The nonclustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. Additionally, it will have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit.

The following illustration shows the structure of a nonclustered index in a single partition.