Review: Hash Tables and Their Limitations

Last class covered hash tables as an important data structure providing O(1) average time complexity for lookups, matching keys to values. The discussion differentiated between static hashing schemes with fixed-size slots and dynamic schemes like extendable hashing, chained hashing, and linear hashing that can grow incrementally over time to accommodate more keys than originally envisioned. The main focus was on handling conflicts when two different keys hash to the same location.

Hash tables are primarily used in most systems for internal data structures like the page table in Project 1 or the page directory, tracking the state of the database system itself while running. We’ll encounter hash tables again when discussing how to do joins efficiently, but for the most part, these are primarily used for internal data structures.

Today’s Agenda

Today’s class focuses on:

B+ trees, which are primarily the default choice when you want to have an index in a relational database system. When you call CREATE INDEX, 99% of the time in most systems you’ll be getting something that looks like a B+ tree. We’ll first cover a high-level overview of what a B+ tree looks like and what makes it different from a regular B-tree (the “plus” in B+ tree).
Some basic design choices for building one.
The different ways to optimize and improve performance in different systems and examine what real systems are actually doing today.

B-Tree Family

What is a B+ tree? The B+ tree belongs to a category of data structures called B-trees, which creates some confusion in database literature and systems. There’s the class of data structures called B-trees, then there’s a specific data structure called a B-tree, and then some database systems are actually using B+ trees, but call themselves a B-tree.

If you examine the PostgreSQL code, they refer to their data structure as a B-tree, but as far as can be determined, it’s a B+ tree with some modern techniques like the B-link tree. When people say “B+ tree,” it typically encompasses a bunch of these different variants.

Historical Context and Terminology

There is no original paper on the B+ tree specifically. The one everyone cites is from 1979 by researchers at IBM discussing what they call “the ubiquitous B-tree.” In that paper, they describe different variants and note that the most common one useful for database systems is the B+ tree. They cite an IBM tech report that hasn’t been readily found describing the original B+ tree.

The original authors of the B+ tree work, Bayer and McCreight, never actually defined what the “B” means in B+ tree. Typically people say it stands for “balanced,” though other interpretations include “broad”, “bushy”, or even named after Bayer himself (B-a-y-e-r). The data structure was actually developed at Boeing (the airplane company), so it could have been “Boeing tree”. Nobody knows for certain, but typically when people say B-tree, they mean balanced.

Another variant is the B-link tree. There’s the classic B+ tree, but nobody implements exactly as it’s defined in the literature. People borrow bits and pieces, and in particular, they borrow ideas from the B-link tree paper that came from CMU in 1981, written by Phil Lehman.

If you look at the PostgreSQL source code in the directory where they discuss their B-tree (notice they say B-tree instead of B+ tree), they explain it’s an NB-tree because it’s a non-balanced B+ tree (we’ll get to that later). Right in the circuit comments, they say “this is a correct implementation” (correct is always important) “of Lehman and Yao’s paper from the B-link tree from 1981.”

We’ll focus primarily on the classic B+ tree structure. The BW-tree from Microsoft is a lock-free version of a B+ tree that was implemented at CMU (it was not easy), and there’s an open-source implementation available.

B+ Tree

The B+ tree is a self-balanced ordered tree that allows searches, sequential access, insertions, and deletions all in logarithmic time ( $O (lo g (n)$ ), because log n represents the height of the tree.

The difference between what we’ll describe here in a B+ tree versus a generic binary search tree is that nodes in our data structure can have more than two keys. We want this because we want to minimize the amount of random I/O while maximizing sequential I/O. The B+ tree is perfect for this because when we land in a node, we’re essentially fetching a page, and we want to pack as many keys as possible inside that page before having to move on to grab the next page from disk.

Thinking back to the 1970s when hardware was significantly worse, with minimal RAM and extremely slow disks, a B+ tree allowed converting lookups from random I/O into sequential access. Once you follow the tree down to the leaf nodes, you never go back up (not entirely true, but for our purposes now we’ll assume that’s the case). Then you can scan along the leaf nodes to find the data you’re looking for.

B+ Tree Properties

More formally, a B+ tree is an n-way search tree with the following properties:

Perfect balance: Every leaf node in the tree structure has the same depth, meaning the same number of levels down from the root to that leaf node. PostgreSQL violates this slightly, as do some other systems, but initially, we assume perfect balance.
Half-full requirement: Every node other than the root must be at least half full. If you can have M keys in a node, you need to have at least half that number of keys up to the maximum number. If you go below that threshold (less than half full), you have to do some merging. This requirement can be tweaked later on.
Root exception: The root is special and can be ignored for the half-full requirement for now.
Child pointers: Every inner node with at least K keys will have at least K+1 non-null children, meaning you could have some locations with pointers to leaf nodes or nodes below, but you don’t have to have the maximum number of possible pointers.

B+ Tree Example

Let’s examine a really simple B+ tree with three levels:

We define the root node at the top, inner nodes in the middle, and leaf nodes at the bottom. Within a node itself, we have an alternating pattern between a pointer to another node, then a key, then in the leaf nodes there’s the value we’re trying to store for a given key:

For now, we’re not defining what the value is, but you can think of it potentially as the Record ID pointing to the actual tuple (page number and offset), or in the case of MySQL or SQLite, it could actually be the tuple itself, though for now we can ignore that distinction.

Think of these numbers in the inner nodes and root nodes as essentially guideposts that tell you which path to go down:

At the root node, we only have one key (20), so

going left means any key value less than 20
going right means greater than or equal to 20

The same principle applies at the next level—for the key 10, less than 10 goes one direction, greater than or equal to 10 goes the other.

A critical feature borrowed from papers like the B-link tree is sibling pointers at every level.

While textbooks might show them only at leaf nodes, PostgreSQL and the original paper include them in inner nodes as well. These pointers prove essential: when searching for all keys greater than or equal to six, you traverse down one side of the tree to reach the bottom, then scan along the leaf nodes without returning upward. This avoids the expense of acquiring parent locks or latches during scans. If pages are stored sequentially or contiguously on disk, this becomes pure sequential I/O rather than costly random access.

Sibling pointers aid in splits and merges. When deleting a key (say, 10), instead of reorganizing the entire tree, you can follow the sibling pointer to a neighboring node and potentially steal one of its keys, maintaining balance by adjusting only the parent node’s guideposts rather than restructuring everything.

Importantly, pointers only go downward and sideways, never upward. This design prevents deadlocks that would occur if one thread traversed top-down while another went bottom-up. While sibling pointers can still create deadlock scenarios (covered in the next class), avoiding bidirectional pointers between levels eliminates one source of concern. The split and merge algorithms don’t require rotations like AVL trees, simplifying implementation.

Node Layout

Nodes are essentially arrays of key-value pairs. Keys derive from whatever attributes the index is based on. If you build an index on table columns A, B, and C, the index contains copies of values for every tuple in those columns. The index functions as a sorted replica of the table, organized to enable efficient logarithmic lookups. In the relational model, tables are potentially unsorted, so the index provides fast sorted access. The database system must keep the index synchronized with the table—when inserting or updating tuples in the table, indexes update automatically to maintain consistency.

Values differ depending on node type. In inner nodes, values are pointers to pages below. In leaf nodes, values are either pointers to tuples (page ID and offset, or record ID) or the actual tuple data itself. Arrays within nodes are typically kept sorted, though this isn’t strictly required.

Handling null keys presents an important design decision. For non-unique indexes that allow nulls, these values must go somewhere in the tree. Typically, nulls are placed either all at the beginning or all at the end. Some systems let you specify this placement when creating indexes, since query patterns may benefit from seeing nulls first or last depending on the application.

Another crucial aspect: there are only sibling pointers and downward pointers, never upward pointers. This prevents deadlocks when taking latches on nodes. Without upward pointers, you eliminate the risk of one thread going top-down while another goes bottom-up. Sibling pointers still require careful handling, but avoiding bidirectional pointers between levels removes one complexity. The split and merge algorithms don’t need AVL-style rotations.

B+ Tree Leaf Nodes

Within a page representing a node, you store the key-value array plus sibling pointers (just page IDs to the previous and next nodes at the same level):

Key-value pairs can be stored in two main ways:

Interleaved Storage: Keys and values alternate one after another. For inner nodes, values are pointers (record IDs). This is conceptually simple.
Separated Storage: This is the most common approach. Keys are sorted in one array, values sorted separately in another array. The offset in the key array corresponds to the same offset in the value array. This resembles column store organization, allowing simple arithmetic to jump around. Additional metadata tracks the number of available slots, the node’s level in the tree, and other information useful for traversal and recovery.

Leaf node values can be either record IDs (page number and offset pointing to tuple locations) or tuple data itself in the case of index-organized storage. Systems like SQLite and MySQL use this by default. In SQL Server and Oracle, you can specify CREATE TABLE with index organization, which creates a B+ tree where leaf nodes contain the actual tuples. This approach should only be used for the primary key index—otherwise you duplicate data unnecessarily.

An important distinction: only leaf nodes contain keys that actually exist in your table. When you delete a key from the leaf nodes, it may still exist in inner nodes as a guidepost. For example, if key 35 appears in an inner node but not in the leaf nodes, this indicates that 35 was once inserted and then deleted. The rebalancing algorithm didn’t remove it from the inner node, where it continues to serve as a traffic sign guiding traversal downward. You cannot have record IDs in inner nodes because those records may not exist—inner nodes contain only navigational keys.

B-trees versus B+ Trees

The original B-tree from 1972 stored all keys and values throughout the tree, similar to an AVL tree. This approach is more space-efficient because keys only appear once anywhere in the tree—you never have keys that don’t correspond to actual data, and you avoid multiple copies of the same key along the path from inner nodes to leaf nodes.

However, B-trees have a critical disadvantage: values (the record IDs pointing to actual tuples) can be anywhere in the tree. To scan all keys sequentially in order, you must traverse up and down, essentially performing a breadth-first search. When considering latching, this means latching the entire tree as you move up and down. In a B+ tree, since only leaf nodes contain actual values—the exact copy of what’s in the table—once you reach the leaf level, you don’t need to maintain latches on upper tree portions. You can scan along the leaves while other threads operate freely on upper levels, as long as they don’t affect your current operations.

The B+ tree advantage over B-trees: better concurrent access and improved sequential I/O over random I/O.

By keeping all actual data in leaf nodes and using inner nodes purely for navigation, B+ trees enable efficient parallel operations and optimize disk access patterns.

Basic Operations

Insert

The goal when inserting is to find the correct leaf node. Traverse downward following the guideposts to reach the leaf node where your key belongs. If the node has sufficient space, insert the key in sorted order and you’re done.

If there’s insufficient space (the node is full), split the leaf node into two nodes. Divide the keys in half: half go to one side, half to the other. Copy the middle key upward to the parent, creating a new guidepost and a new pointer to the newly created leaf node. This process is recursive—if promoting the middle key causes the parent to become full, you must split the parent as well. This can cascade all the way to the root.

Rather than showing static diagrams, consider an interactive visualization. Using a B+ tree of degree 2 (maximum two keys per node, maximum three pointers), observe the insertion process:

Inserting 2 creates the root node. Inserting 6 adds to the root (which can hold two keys). Inserting 4 attempts to place three keys in the root, which triggers a split. The middle key (4) becomes the new root separator. Keys less than 4 (which is 2) go to the left leaf node, while keys greater than or equal to 4 (which are 4 and 6) go to the right leaf node. Sibling pointers connect the leaf nodes in one direction (some systems use bidirectional pointers).
Inserting 5 requires going right from the root (5 ≥ 4), but the right leaf node already contains two keys. This triggers another split, promoting 5 to the parent as a new separator.

Delete

Deletion reverses the insert process. Start at the root, traverse down until you find the leaf node containing the entry to remove. If the entry doesn’t exist, do nothing—you cannot delete a nonexistent key. If found, remove it.

If the leaf node after deletion remains at least half full, you’re done. If the node drops below the threshold (fewer than ⌈M/2⌉ - 1 keys, where M is the maximum number of keys per node), first attempt redistribution: follow the sibling pointers to find another node at the same level and steal one of its keys, provided the sibling won’t become unbalanced. You may need to adjust the guidepost in the parent node, but since you already hold the latch for it, this isn’t expensive.

If redistribution isn’t possible (stealing a key would cause the sibling to drop below half full), perform a merge: combine the current node with a sibling, putting all keys together and updating the parent accordingly. This is also recursive—if merging removes a guidepost key from the parent and causes the parent to drop below half full, the merge cascades upward.

Using the visualization, deleting 6 from a node leaves it properly balanced. Deleting 4 causes a redistribution or merge depending on the specific state. The animation demonstrates how nodes steal keys from siblings when possible, updating parent guideposts as needed. In some configurations, when a node becomes underfull, it may steal from a sibling to rebalance, adjusting the boundary points in the parent rather than performing a full merge.

Regarding sibling pointers: leaf nodes definitely need pointers to siblings for scanning along the bottom level. Inner nodes at the same level with the same parent benefit from sibling pointers for merge operations. However, inner nodes with different parents may not need cross-pointers because reorganization at that level involves merging up to the common ancestor. At that point, you’re latching larger portions of the tree anyway.

When you don’t store parent pointers, how do you propagate data back to the parent during splits? As you traverse downward, you maintain a stack of visited nodes and track which ones you hold latches for. When you reach a node requiring a split, you recognize this before releasing the parent’s latch. The internal bookkeeping of the worker thread maintains pointers on the stack to navigate back up. This will be covered in detail in the next class on safe versus unsafe traversals—as you descend, you determine whether operations below could cause reorganization. If a node is “safe” (no split or merge will occur regardless of what happens below), you can release its latch when moving past it.

Range and Partial Key Searches

B+ trees support operations impossible with hash tables. Hash tables only handle equality predicates (key = value)—you cannot perform less-than, greater-than, or partial key lookups. With a hash table, you need the entire key: if you build an index on columns A, B, and C but only have values for A and B, you cannot hash that partial key and find anything meaningful because the hash is completely random.

B+ trees enable several search patterns:

Full Key Match: With an index on A, B, C, you can search A = 1 AND B = 2 AND C = 3. This works like a hash table—equality matching on all key components.
Prefix Search: You can search using only A = 1 AND B = 2 without C. The lookup examines the parts of the key you have. Check if 1 ≤ 1 in the first component, then follow downward and scan along the leaf nodes, evaluating your predicate against all keys until you encounter a key where the first component exceeds your value. Since keys are sorted first on the leading component, you know nothing beyond that point can match.
Suffix Search (Skip Scans): Some advanced systems support searching with B = 2 AND C = 3 without A. This is difficult to implement. Very few systems support this—PostgreSQL doesn’t, but Oracle and possibly SQL Server do. Oracle calls these “skip scans.” The implementation requires potentially multiple threads traversing different parts of the tree in parallel, then combining results. Since you lack the first part of the key, you must examine every branch, essentially performing a wildcard search.

Full Key Match Example:

Prefix Search Example:

Suffix Search Example:

The database optimizer evaluates whether using the index is worthwhile. Based on statistics about key distributions, it may determine that an index lookup is valuable. Alternatively, if the query pattern makes the index unhelpful, the optimizer may choose a sequential scan of the entire table, which could be faster than multiple index probes combined with result merging.

Handling Duplicate Keys

Database systems must handle duplicate keys in indexes, particularly for non-unique indexes. Two main approaches exist:

Approach 1: Append Record ID

The most common approach: maintain a hidden column or attribute in the key containing the record ID of the tuple it points to. This guarantees every key is unique. If you have a key value 6 and I have a key value 6, but we point to separate tuples with different record IDs, appending the record ID (page number and offset) to the key makes “your 6” and “my 6” unique. Since B+ trees support prefix searches (you don’t need all key components to perform lookups), this scheme works seamlessly. When searching for 6, you search on just the key value portion, and the tree returns all matching entries regardless of their record IDs.

For example, consider a key of 6 (the system internally converts Insert 6 to Insert <6, (Page,Slot)>). The actual stored key is 6, <page_1, slot_1>. If you insert another 6, it becomes 6, <page_2, slot_3>:

Superficially, it looks like it’s duplicate keys, but the they’re physically unique.

When deleting, the system internally converts delete 6 to delete 6, <specific_record_id>, ensuring precise deletion of the correct tuple.

Approach 2: Overflow Leaf Nodes

The alternative is using overflow leaf nodes. When inserting a duplicate key and the leaf node is full, instead of splitting, you create an overflow page and build a linked list extending downward.

This resembles chained hash tables, where instead of a hash function directing you to the start of a linked list, the tree structure leads you there.

This approach has a critical flaw: it violates the logarithmic guarantee. The tree depth is no longer bounded by log n. You must handle splits and merges in this overflow chain, which complicates the implementation significantly. Additionally, moving keys during reorganization becomes more complex.

This approach is generally inferior and rarely used in real systems. While you save space by not storing the record ID as part of every key, the implementation complexity and loss of logarithmic guarantees outweigh any benefits.

Clustered Indexes

Some database systems support clustered indexes, which allow the actual table tuples (even though the relational model considers tables unsorted) to be physically sorted on disk according to the sort order defined by an index. In systems supporting index-organized tables (like MySQL’s InnoDB and SQLite), where leaf nodes store actual tuple data, the table is automatically clustered. However, in systems where indexes are separate from heap files, you can still enforce sort order through clustered indexes.

The advantage: when scanning along leaf nodes to find tuples, you’re guaranteed to retrieve pages in sorted order as defined by the key order. Scanning across the bottom level retrieves pages sequentially, maximizing sequential I/O and minimizing random access.

Without a clustered index, leaf nodes may be stored sequentially on disk, but when looking up the actual data they point to, you encounter random I/O. In the worst case, with one free frame in your buffer pool, you might repeatedly fetch and evict the same page multiple times if you process tuples in index order rather than page order.

A simple optimization: don’t retrieve tuples as you scan leaf nodes. First, scan the leaf nodes to collect all record IDs, then sort them by page ID, and finally retrieve pages in sorted page order. You still need bookkeeping to ensure you process tuples in the order defined by the index if that matters for your query, but this approach achieves more sequential I/O and reduces redundant random access.

PostgreSQL’s CLUSTER command sorts the table according to an index but doesn’t maintain the sort order. As you modify the table (inserts, updates, deletes), the physical order gradually becomes unsorted. True clustered indexes require index-organized storage or more sophisticated maintenance mechanisms.

Design Choices

Node Size

In diagrams, assume one node corresponds to one page in database files and the buffer pool. However, some systems (like IBM DB2) allow configuring page size differently for different tables and indexes. Depending on your hardware, you may want different node sizes:

Slow spinning disks: Larger pages (e.g., 1 MB) maximize sequential I/O. The number of keys fitting in a 1 MB page depends on key size—8-byte integers allow storing many keys, probably more than actually possible.

Fast SSDs: Roughly 8-10 KB is considered optimal, balancing I/O efficiency with memory usage.

In-memory systems: 512 bytes is considered optimal, as this fits within a cache line for maximum efficiency.

Merge Threshold

Some systems violate the “at least half full” requirement. You can have more keys than the maximum (obviously impossible—you run out of space), but you can temporarily drop below the half-full threshold. This allows you to avoid premature merging in case an insertion soon brings you back above the threshold. PostgreSQL calls their implementation a “non-balanced B+ tree” because they permit this temporary violation.

Variable Length Keys

Several approaches handle variable-length keys:

Store Pointers to Keys: Instead of storing keys directly, store pointers (record IDs) to keys. Since record IDs are always 32 or 64 bits, everything becomes fixed-length. The B+ tree stores pointers rather than copying keys. This is a bad idea for disk-based systems. As you traverse nodes determining whether to go left or right, you don’t have the guideposts in your node—you must follow pointers to retrieve tuples from pages, then perform comparisons. While holding latches on your data structure, this causes significant non-sequential I/O overhead. In-memory databases in the 1980s used this approach (called T-trees, where diagrams show nodes looking like T’s, though the name may come from the inventor’s name). Modern systems avoid this because the lookup expense outweighs the space savings from not duplicating keys.
Variable-length Nodes: Node size varies within the index to accommodate different key sizes. You maintain the same potential number of keys per node, but physical node size changes. This is rare—as far as is known, only academic systems implement this.
Padding: Pad all keys to the maximum length, similar to column store approaches. This is also rare.
Slotted Page Approach: The most common approach uses an array of pointers (offsets within the page) to keys, similar to slotted pages for table storage. This efficiently handles variable-length keys without excessive overhead. Overflow pages handle keys too large for a single node, similar to overflow values in table pages.

Finding Keys Within Nodes

Once you land on a node and bring it into memory, you must find the specific key to determine direction or match. Several approaches exist:

Linear Scan: The simplest approach—scan the array from beginning to end regardless of sort order. It’s simple, works, but not optimal. You can use SIMD (Single Instruction Multiple Data) to vectorize comparisons: Modern CPUs provide vector registers supporting operations on multiple values simultaneously. Instead of checking each key individually, store multiple copies of your search key in a SIMD register (e.g., four copies of 8 in a 128-bit register for 32-bit integers), then perform a single instruction comparing your search key against four keys in the array simultaneously. The result is a bit mask indicating matches (1) or non-matches (0). This is more efficient than sequential checking, still linear but batched. Hardware accelerates this significantly. This technique is used in vectorized execution, which made Snowflake distinctive ten years ago.
Binary Search: Most common approach. Assuming sorted keys, jump to the middle, compare, then jump to the appropriate half repeatedly until finding the match. This is what most systems use.
Interpolation Search: If you know there are no gaps and keys are monotonically increasing (like an auto-increment primary key with no deletions), use simple arithmetic to calculate exactly where in the array your key should be based on minimum value, maximum value, and number of keys. This is the fastest approach—faster than binary search and SIMD—but requires no gaps, making it rare.

Optimizations

Compression

Keys in a B+ tree come from the same value domain (same attributes) and are sorted, making them excellent candidates for compression.

Prefix Compression: Keys close together in lexicographical ordering share overlapping prefixes. Instead of storing complete key copies, store the common prefix once (e.g., “Rob”) then store only the unique suffix for each key.

Deduplication: When non-unique keys repeat in the same node, store the duplicate key once followed by a posting list of all values corresponding to that key. The system knows to extract the record ID appended for uniqueness, so you only store one copy of the key portion. PostgreSQL added this in version 15, yielding significant node size reductions.

Suffix Truncation: Inner nodes don’t need exact key copies—keys in inner nodes might not even exist in leaf nodes. Store only the minimum prefix needed to discriminate between left and right branches. For example, to distinguish between “ABC…K” and “LMNO…V”, you only need the first three characters of each string. If you later insert a key between them, you might need to retrieve the original keys to determine the proper prefix, but in many environments this optimization is worthwhile.

Pointer Swizzling (TODO: get pictures from slides)

Pointer swizzling minimizes buffer pool page table lookups. When traversing the tree, what we call “pointers” are actually page IDs. Each time you follow a pointer, you must query the page table: “If this page exists, give me the pointer to it.”

The technique: If you pin a page in the buffer pool (marking it non-evictable), any page pointing to that pinned page can replace the page ID with the actual memory pointer. Now when traversing the tree, you don’t query the buffer pool—you have the exact memory location. The root node is always accessed, so swizzling its children eliminates page table lookups when descending from the root.

Critical safety requirements: Never store swizzled pointers on disk. When a page is flushed, undo the swizzling to ensure the page on disk contains page IDs, not stale memory pointers that would be invalid when the page loads back in. Never let a page with swizzled pointers be evicted—if the target page gets evicted and another page loads into that frame, the swizzled pointer would create a segmentation fault by interpreting wrong bits.

The B+ tree hierarchy naturally supports this: if you swizzle anything below a node, ensure that node isn’t unpinned before its children are unpinned. This keeps pointers valid. You bypass the buffer pool’s page table lookups, LRU updates, and eviction logic, going directly through memory pointers. You lose metadata about access patterns, but if a page is important enough to pin and swizzle, it should stay in memory anyway.

Bulk Loading (TODO: get pictures from slides)

For fast insertions, the most common trick is presorting. Sort all keys, lay them out as leaf nodes with sibling pointers, then build the tree structure from bottom to top. This differs from inserting keys one by one, which starts from the top, traverses down, and repeatedly splits. Presorting skips all splits and builds the scaffolding on top of pre-arranged leaves. This technique is very common and highly efficient.

Write-Optimized B+ Trees (B-Epsilon Trees)

A significant challenge with B+ trees is maintaining balance. While lookups are fast (logarithmic height, sequential I/O along leaves), updates are expensive because every insertion or deletion must maintain balance. A thread that inserts or deletes might draw the short straw and be responsible for reorganizing the entire data structure.

The ideal approach: delay updates to the data structure, accumulate them, then apply changes in batches. This amortizes the reorganization cost across many operations. Individual writes become faster because they don’t immediately trigger splits.

Write-optimized B+ trees (also called B-epsilon trees, sometimes branded as “fractal trees” by Tokutek) implement this idea. At every root node and inner node, add a modification log, similar to the mod log in column store pages. When a new update arrives, instead of propagating changes down to leaf nodes (violating the property that leaf nodes contain actual values), insert the entry into the mod log.

To insert 7, put it in the root’s mod log.

To delete 10, put that in the mod log rather than traversing down, deleting, and potentially merging.

When a query searches for 10, as you traverse down, check the mod log at each level. If you find a delete entry for 10 in the mod log, you’re done—no need to continue to the leaf nodes.

When buffers fill up, cascade changes downward in batches. This approach doesn’t apply modifications to the tree structure until reaching leaf nodes, doing everything incrementally in batches.

The tradeoff: Reads can become slow because as you scan, you must sequentially search mod logs to check if your key exists. These implementations often use Bloom filters in front of mod logs: if the Bloom filter indicates the key isn’t in the mod log, skip the search. Bloom filters are cheap to maintain and small in size.

This concept dates to 2003—old compared to the 1972 B+ tree, though perhaps not ancient. This idea resembles log-structured storage, where you append log entries and batch them for later application. Tokutek rebranded B-epsilon trees as “fractal trees” and created a storage engine for MySQL that was bought by Percona but deprecated recently. SplinterDB (from VMware, developed by a researcher at CMU Keller working on his MBA, now a professor at Cornell) is a key-value store implementing a highly optimized version. Relational AI uses B-epsilon trees for fast updates. This isn’t common, but expect to see more in the future. RocksDB doesn’t need this because it’s already a log-structured merge tree, inherently providing similar properties.

PostgreSQL Demonstration

To demonstrate differences between hash indexes and B+ trees, consider a dataset of 27 million email addresses. Two indexes exist: one B+ tree on the email column and one hash table. By default, PostgreSQL creates B+ trees unless you specify otherwise with the USING clause:

CREATE INDEX idx_emails_tree ON emails USING BTREE (email);
CREATE INDEX idx_emails_hash ON emails USING HASH (email);

Disable various PostgreSQL optimizations to clearly observe index choices:

SET jit=off; SET max_parallel_workers_per_gather=0;

Running EXPLAIN before a query shows the query plan without executing it.

Query 1: SELECT MIN(email) FROM emails:

PostgreSQL performs an index-only scan using the B+ tree. An index-only scan (sometimes called a covering scan or covering index) means all columns needed to answer the query exist in the index. Even though leaf nodes store record IDs, you don’t need to follow those record IDs to retrieve actual tuple data—everything required is in the index itself. Since the query asks for MIN(email) and the index is on email, scanning to the leftmost entry in the B+ tree provides the answer without accessing the table.

Query 2: SELECT email FROM emails WHERE email LIKE 'a%' LIMIT 1:

PostgreSQL chooses a sequential scan despite the B+ tree index. The system recognizes that searching for all emails starting with ‘a’ (an unbounded range—no specified endpoint) would require scanning a large portion of the tree. The optimizer isn’t smart enough to recognize the LIMIT 1, which should prompt probing down the index to find the first match and immediately returning. PostgreSQL cannot use the hash index because wildcard searches don’t work with hashing—you need the full key.

Query 3: SELECT * FROM emails WHERE email > 'andy@'

PostgreSQL still chooses sequential scan. With an unbounded range starting at ‘andy’ (which is relatively early alphabetically), the optimizer determines a full table scan is more efficient than using the index.

Query 4: SELECT * FROM emails WHERE email LIKE 'zzz%'

Now PostgreSQL uses the index scan. Since ‘zzz’ is far along the tree (near the rightmost end), the optimizer determines that based on data distribution, scanning from that point is more efficient than a full table scan. PostgreSQL never chooses the hash index for these queries because they involve comparisons or wildcards rather than exact equality.

Query 5: SELECT * FROM emails WHERE email IN ('email1', 'email2', 'email3')

PostgreSQL now uses the hash index with a bitmap index scan. Note: this isn’t a true bitmap index—PostgreSQL maintains a bitmap of matching tuple IDs (or page offsets). For each email in the IN clause, PostgreSQL performs a separate probe into the hash index. The results are combined using bitwise OR operations on the bitmaps, producing the exact tuple IDs needed. This approach determines which pages are required from the index first, then retrieves them, avoiding redundant I/O. This is similar to the earlier discussion about sorting page IDs before retrieval to maximize sequential I/O.

Clustered Tables in PostgreSQL

PostgreSQL’s CLUSTER command sorts the table according to an index but doesn’t maintain the sort order as you start modifying the table:

CLUSTER emails_clustered USING idx_emails_clustered_tree;

This command takes about a minute to run. After clustering, examining the first page shows tuples sorted by email in the initial offsets:

The CTID (record ID in PostgreSQL) reflects this ordering:

However, if you delete the first entry, then reinsert it, PostgreSQL doesn’t maintain the sort order—the reinserted tuple doesn’t return to the first page. Instead, it lands in some random page (e.g., page 299). PostgreSQL cannot maintain true sort order because it doesn’t have true clustered indexes—tables aren’t index-organized:

Summary

B+ trees are the most important data structure for database indexes and likely the best choice for your database system. Tries are also useful, and you can even have B+ trees of tries. Numerous optimization techniques can make B+ trees faster. The next class covers making B+ trees thread-safe—traversing safely while performing splits and merges with concurrent access.

Quartz 4

Explorer

08. B+Tree Indexes