Resources:


We started at the bottom, discussing disk management. Then, we moved up and covered the buffer manager, focusing on how we keep track of the pages we bring from disk into memory.

Now, it’s time to start doing something with those pages. We know how they are organized and how to store data in them. Now, we need to build the execution engine to interpret what those pages store, run queries, and handle data as expected in a database system. For the next couple of weeks, we will focus on what we call access methods. These are the internal components of the system that allow us to access data or specific pages we store and derive meaning from them.

First, we will discuss the data structures used in disk pages when bringing data from disk into memory:

  • Today’s class will cover Hash Tables, which are unordered data structures used throughout the system for various purposes.
  • In Monday’s class, we will discuss trees, particularly the B+ tree, which preserves the ordering of values or keys. This is crucial for ranking and sorting operations.

So, we are now in the middle part of the system and are beginning to build more complex structures on the basic primitives covered in previous lectures. First, we will discuss the background of hash tables, why they matter, and briefly cover hash functions. While this is not a cryptography or low-level algorithms class, we will select an appropriate hash function and learn how to evaluate its performance.

Next, we will explore two types of hash tables:

  • Static hashing, where the data structure has a fixed number of key-value pairs.
  • Dynamic hashing, where the hash table can grow over time. Most often, it grows up, not down, but we will also discuss shrinking methods. The goal is to incrementally grow the data structure to accommodate more data.

Data Structures

We need data structures for almost every part of the system. These are four main uses of data structures:

  • Internal metadata, such as a page directory, that is more or less a hash table being used to map page IDs to some location on disk or some location in memory.
  • Index-organized tables. We could use these data structures for the actual storage of the tables themselves. Remember we discussed index-organized tables where the actual tuples themselves would be in the leaf nodes of the B+ tree, so you could have your tables actually be represented directly in a data structure rather than unordered heap files.
  • Temporary data structures. We could use these data structures for query execution to generate ephemeral or temporary collections of data that allow us to execute queries more efficiently. This is basically how we’re going to implement hash joins very fast—how to implement joins very quickly using hash joins. We’ll build a hash table on the fly, populate it with data from the tables we’re scanning, do the join, and then throw the hash table away. Just because we’re building a hash table doesn’t mean it’s going to stick around for a long time.
  • Table indexes. Probably the one you’re most familiar with is using these data structures for table indexes. When you call CREATE INDEX, that’s essentially going to create one of these data structures and populate it with the keys and map them to the tuples, so you can do faster lookups like a glossary in a textbook.

There are two main design decisions to consider when you want to implement data structures for the DBMS:

  1. Data Organization: how we layout data structure in memory/pages and what information to store to support efficient access. Of course we want to avoid NP complete problems.
  2. Concurrency: we need to think how to allow multiple threads or workers or processes in our DBMS to access the data structure at the same time without causing problems. We’ll cover this in the next class, but basically there’s two types of problems:
    • Physical Layout Problems (or Physical Integrity Problems), where two threads are trying to write to the same thing at the same time and they collaborate each other,
    • Logical Problems, where two threads try to insert the same key at the same time. What should be happen? Should I have duplicate keys, or should one of them fail, or what?

Hash Tables

Today’s class, we’re focusing on hash tables. Again, this is because it’s a low-level building block that we can reuse throughout the rest of the system.

A hash table is just an associative array that maps keys to values.

The way a hash table works is that there’s a mapping from keys to values, and we use a hash function that allows us to compute some offset within an array. Essentially, this reduces an arbitrary key to an integer domain, which lets us jump to a specific location in the hash table to find what we’re looking for.

The hash function must be able to take any possible key—whether it’s a column type in a database system or internal metadata—and reduce it to an integer.

  • Space Complexity: Roughly O(N) because we need to store a slot for every possible key.
  • Time Complexity:
    • On average, O(1) lookups—we hash a key, jump to a location, and ideally find what we’re looking for immediately.
    • Worst Case: O(N) due to collisions—if multiple keys hash to the same location, we may have to scan through the hash table to find the correct entry. If the table is full, we may even need to wrap around to continue the search until above the position where we landed thanks to the hash function (basically it’s an entire loop of the table).

To mitigate collisions, we typically size the hash table to be roughly 2N (twice the expected number of keys). A common question is: How do we determine N? The database system attempts to predict how many keys will be present and sizes the table accordingly.

O(1) complexity sounds great in theory, but in real systems, the constants matter. Example: One hash function may take 10ms to compute, while another takes 1ms—the latter is obviously preferable at large scale (e.g., billions of keys). So, even though the algorithmic complexity is O(1) on average, implementation efficiency is critical to achieving optimal performance.

Static Hash Table

Let’s look at a simple hash table implementation and the issues that arise before moving to real-world database systems.

The simplest hash table is a static hash table, where we allocate a giant array using malloc, with one slot per key. To find an entry:

  • Take the modulus of the key by the number of elements in the array.
  • Jump to the corresponding offset.
  • This offset points to a location storing the key-value pair.

Since collisions can occur, we need to store the original keys to verify correctness. The value can be a pointer to a tuple, a record ID, or additional data.

Problems with This Approach

  1. Resizing Issues: If we have N+1 keys, the hash table must be resized and rehashed, which is costly.
  2. Handling Collisions: Two keys may hash to the same location, but the table does not account for this.
  3. Duplicate Keys: Some database use cases require duplicate keys with different values, but this model assumes unique keys.

Real-World Considerations

  • In a database system, we may not know all keys ahead of time (If managing a buffer pool, the number of slots is fixed, making static hashing feasible).
  • If building a hash table index, keys increase dynamically, requiring adaptive resizing.
  • We assume a perfect hash function (one that prevents collisions where the following property is always valid: if key1≠key2, then hash(key1)≠hash(key2)), but this is unrealistic. No real-world system can achieve this since we cannot predict the key domain in advance.

Instead, real database systems use more sophisticated hashing techniques that handle these challenges efficiently.

Hash Table

All right, so there are two decisions we have to make when building a hash table. When someone says they want a hash table, it’s sort of two parts.

  1. Design Decision #1: Hash Function.
    • There’s the hash function itself, which maps a large key space down to a finite, smaller domain, based on the number of slots in my array.
    • Another big trade-off is between how fast we want our hash function to be and how likely it is that two distinct keys will collide. What’s the fastest hash function I could build? A function that return. It’s the worst hash function in terms of collision because everything maps to one, but it’ll be fast. This is the trade-off: figuring out the balance. The perfect hash function would have zero collisions, but would be super slow because you’d need extra lookups. So, we want something in the middle: fast and with a low collision rate.
  2. The hashing scheme will be the mechanism we use to handle collisions after we’ve done our hashing. The trade-off here is classic: storage versus compute. I could allocate a massive two-terabyte hash table and avoid collisions if my key set is super small. But if I allocate a huge hash table, I’d use more memory. Alternatively, I could have a smaller table, but with more collisions, so I’d have to spend more compute to handle those collisions. It’s all about finding the right balance—not over-allocating but also not wasting instructions to deal with collisions.

Hash Functions

Today’s talk will cover:

  • hash functions, just to show you what the state of the art is. I’m not going to explain how they work, just that they exist. We’re database people, not in the business of writing hash functions, so we’ll rely on others to do that for us.
  • We’ll also discuss classic static hashing schemes, where you know the number of keys ahead of time.
  • Dynamic hashing schemes, where the hash table can actually grow and shrink based on the number of keys.

We’re not in the business of writing hash functions. Smarter people in the space have done that for us, so we rely on them. The basic idea of a hash function is that we have some input key, any arbitrary number of bytes of any type, and need to return an integer that represents that key. Typically, it’s 64 bits, but there are also 32-bit hash functions and 120-bit hash functions (though databases usually don’t use the latter).

In this context, we don’t care about protection or privacy mechanisms for hash functions. We’re not using cryptographic hash functions like SHA-256 because we’re running on the inside of the system. No one on the outside can see the data structure, so we don’t care about that. As a result, we can run faster. For example, SHA-256 is really slow compared to something like MurmurHash or XXHash. As we said before, we want something that’s fast and has a low collision rate.

Some systems, like PostgreSQL, roll their own hash function, but many modern systems use something off-the-shelf like XXHash, MurmurHash, or SpookyHash. The main takeaway here is that XXHash (version 3) from Facebook has the fastest performance and the lowest collision rate and it’s considered the State-of-the-art.

Some systems use CRC32 or CRC64 for hashing integers because there are CPU instructions for that on x86, which makes it fast. However, for random strings, XXHash is typically preferred. Remember Hash was written by a random person on the internet; Google took it and made CityHash, then created a newer version called FarmHash with even better collision rates.

There are many different hash functions out there, but XXHash (version 3) is a good default choice. There are repositories on GitHub where people benchmark hash functions and test collision rates and performance. XXHash3 is typically the best choice.

Static Hashing Schemes

Now, let’s talk about what a hash table looks like and how to handle collisions. For this lecture, I’ll focus on the two most common schemes:

  1. Linear probing, which is the simplest and fastest hashing technique
  2. Cuckoo hashing, a variant that uses multiple hash functions.

Other techniques, like Robin Hood hashing, Hopscotch hashing, and Swiss tables (from Google), exist but are not covered in this class. In the advanced class, we’ll cover those. Current research shows that Linear Probing and Swiss tables are the fastest. The fancier versions try to improve performance by avoiding long searches for keys, but the overhead of moving things around in memory can actually hurt performance. It’s better to stick with the simpler approaches, like linear probing.

Linear Probe Hashing

Linear Probe Hashing is really simple. It’s a giant array of slots and we hash into it. Here’s a the description of its functioning: if we want to insert an element, we hash into the table and if the slot is free, we insert the key. If the slot is not free, we look at the next slot and insert there, if possible, or we keep looking until we find a free slot, potentially wrapping around. If we loop back around and realize we’re at the slot where we started, we know the hash table is full and need to double its size and rehash everything.

The data implementation for this is from Google, with their flat_hash_map data structure. It’s well-documented, and they describe how it works and some optimizations.

This method is sometimes called open addressing in hashing because there’s no guarantee that a given key will always be at the same address or slot. Depending on what was inserted before, the key may be moved around. If you get a dictionary in Python, this is essentially what you’re getting as well.

Now, let’s see how it works. If we want to insert key A, we hash it and mod it by the number of slots. If it lands on an empty slot, we insert our key along with its value:

The reason we need the key is that during a lookup looking for A, we will hash to the same location, but we must also perform an equality check to see if the key at that slot matches the one we’re looking for.

If we want to hash key B, we do the same: hash it, mod by the number of slots, and land on an available slot. Now, if we want to insert key C, we hash it, and it lands in the same location where A is. Since that slot is occupied, we move to the next one and insert the key there:

The same process applies to D, E and F:

Say if F lands in an occupied space, it wraps around and starts at the top. Think of it as a giant circular buffer—pretty simple.

What are some potential problems with this? If we delete a key, we may lose the chain. For example, if you want to delete C:

  • you hash C and you land where A is (because as we saw before, hash(A) and hash(C) land at the same location)

  • then the equality checks fails of course (because A≠C)

  • then keep going until we find C:

  • finally we delete it:

The problem is en empty space is created. So if I try to do a lookup on D (remember that hash(D)lands at the position where C was), this is going to hash to this empty space and you you might think the key D isn’t there:

A potential solution could be to rehash everything, but it’s a bad idea because there could be potential billions of rehashing operations.

The correct solution is Tombstones—a marker that indicates the slot once held a key that has been deleted:

When doing a lookup (to get D for example), if a slot is marked as a tombstone, we know something was there, but was deleted. The lookup continues until we find the desired key:

With tombstone marker we can reuse the slot for new keys. For instance, when inserting G, we could place it in the slot where C was, as long as it’s marked as a tombstone:

We can reuse these tombstone-marked slots for new keys without breaking the hash table’s flow. Eventually, if we accumulate many tombstones, it might be worth performing garbage collection because they take up space. But for now, we can ignore that.

There’s also a challenge in how to represent tombstones, empty slots, and null keys, especially in a database system. One way is to use slotted pages, where we keep track of the state of each slot at the top of every page. For simplicity, let’s say each page holds two slots. In the page header, we could track which slots are empty, which are null, and which are marked with tombstones.

This requires some additional metadata to track the state of each slot. However, you don’t want to do this on a per-key basis, as that could mess up the alignment of the data and waste space.

Non-Unique Keys

There are two main approaches to handle this:

  1. Separate Linked List: Instead of storing the value directly in the hash table, you store a pointer (like a page ID) that points to another location where the list of values for the key is stored. For example, for the key “XYZ”, there would be a pointer to a linked list that contains all the possible values for that key and the same thing for “ABC” key:

    The advantage here is that as you insert new keys (including duplicates), you’re not modifying the main hash table. Instead, you’re appending values to the linked list.

  2. Storing Redundant Keys together: In this method, all instances of a key are stored in the hash table itself. This doesn’t interfere with the open addressing and linear probing scheme because you hash the key, land on a slot, and insert the key-value pair:

    The challenge here is during lookups when you need to retrieve all values for a key. You would need to continue scanning until you hit an empty slot, which marks the end of the search. Whereas in the first scenario, with pointers to a list, you would just follow the pointer and retrieve all the values for that key in one go.

For simplicity, many systems use the second approach, where redundant keys are stored together in the hash table. This avoids the need for a separate linked list for each key and reduces complexity.

Some additional things:

  • Handling updates vs. inserts: When differentiating between an update and an insert, hash tables typically don’t perform direct updates. Instead, they perform a delete followed by an insert. The tricky part is handling cases where you delete one of multiple values associated with a key. For example, if you want to delete a specific value for a key like “XYZ” with value 2, you need to ensure you’re only deleting that specific pair, not all of them.
  • Rehashing when full: As the hash table fills up, its performance can degrade. Different systems have varying thresholds for when rehashing should occur. Typically, once a certain load factor (e.g., 80% full) is reached, the table is resized. The table will usually double in size, and all the existing keys are rehashed into the new table. Although this resizing operation is expensive, it is better to do it earlier rather than waiting for the table to become completely full (100%), as performance can degrade significantly in that case. The general approach is to have a tunable threshold that dictates when the hash table should resize. This threshold is often configurable by the user or set by the implementation. The trade-off is that resizing (rehashing) can be expensive, but doing it when the table is moderately full (e.g., 80%) is generally more efficient than waiting until it’s nearly full, as you’ll avoid excessive collisions and lookup times.

Optimizations

Other optimizations and approaches for hash tables can be based on data type-specific implementations. The idea here is that the design of the hash table can be tuned depending on the types of data you’re storing.

  1. Handling Different Key Sizes (e.g., Strings):
    • Small Strings: If you have small strings (like 64 bits or 64 bytes), you can store the string inline in the hash table itself.
    • Large Strings: For larger strings, instead of storing the whole string in the hash table, you could store a pointer to the actual string elsewhere in memory. However, this introduces the challenge of performing lookups efficiently, as you need to follow the pointer to the string. To mitigate this, you might store the hash of the string along with the key in the hash table. This way, you avoid performing a full string comparison every time you need to look it up.
  2. Metadata Storage:
    • As we’ve discussed, metadata like whether a slot is a tombstone, null, or empty can be important in managing the hash table efficiently. One approach is to store this metadata separately in the page header. Instead of packing metadata into every slot in the hash table itself, this technique uses packed bits in a compact metadata structure that can quickly tell you the state of each slot.
    • Google’s HashMap does something similar by maintaining a separate hash table just for metadata. This table is smaller and compact, and it’s used to indicate the state of the corresponding entry in the main hash table (whether it’s null, empty, or a tombstone).
  3. Versioning for Memory Reuse:
    • ClickHouse, an OLAP system, uses a clever method for efficiently managing hash table memory. Allocating memory for a hash table is expensive, so instead of allocating fresh memory each time, ClickHouse reuses memory by marking slots as deleted instead of deallocating them. Rather than explicitly clearing out slots, ClickHouse maintains a version counter for the hash table. When you want to “clear” the table, you simply increment the version counter. Then, when you do a lookup in the table, you check if the version ID of the slot is older than the current version number. If the version IDs don’t match, it means the entry has been “deleted,” and it can be ignored.

ClickHouse is known for its highly optimized hash table implementations. They claim to have over 30 different implementations of hash tables tailored to specific use cases and data types. Their code is templated in C++, and they make heavy use of compiler tricks to eliminate unnecessary code. For example, if you know a value can never be null or if it’s a string of a certain size, the system will optimize out any unnecessary checks. From the systems I’ve seen, ClickHouse’s approach to hash tables is among the most sophisticated. They combine various optimizations to ensure their hash tables are both space- and time-efficient.

Chuckoo Hashing

Cuckoo hashing is an alternative hashing technique to linear probing. The idea behind cuckoo hashing is to use multiple hash functions instead of just one, with the aim of improving lookup efficiency.

In linear probing, you hash a key and check if the slot is available. If it’s not, you scan sequentially until you find an empty slot. With cuckoo hashing, instead of scanning for an empty slot, you hash the key to multiple locations using different hash functions. You then choose the first available slot and use it for your key.

Key Features of Cuckoo Hashing:

  • Guaranteed Lookup and Deletion Time: Cuckoo hashing guarantees that every lookup or deletion operation will be O(1), as you only need to check a constant number of positions for a key—no scanning or iteration over the table.
  • Insertions Can Be More Expensive: Insertion can become more expensive, as keys may need to be evicted from their slots to make room for new keys, leading to reorganization of keys.

The only system we know for sure that does Chuckoo Hashing is an OLAP accelerator from IBM called Blue.

Let’s see how Cuckoo Hashing Works:

  1. Multiple Hash Functions: We use two or more hash functions. Each key is hashed to multiple possible locations in the hash table.
  2. Inserting Keys:
    • When inserting a key, you check both locations generated by the hash functions. If one of those slots is empty, you insert the key there.
    • If both slots are occupied, the current key displaces (or kicks out) the key already in one of those slots.
  3. Eviction and Rehashing: When a key is evicted, you need to insert the displaced key into one of its potential locations. If this causes a chain reaction (i.e., evicting keys repeatedly), it can result in a loop. To handle this, you need to detect these loops and resize the table by doubling its size and rehashing all the keys.

Example Walkthrough:

Let’s look at an example:

  • Assume we have two hash functions and a hash table that starts empty.
  • Let’s say we want to insert key A:
    • Hash Function 1 may place it in Slot 1, while Hash Function 2 may place it in Slot 3.
    • Since Slot 1 is empty, A is placed there:
  • Now, let’s insert key B:
    • Hash Function 1 places B in Slot 3 (where A is already), and Hash Function 2 places it in Slot 1.
    • Since Slot 3 is occupied, B is insert in the free slot 1:
  • Now, let’s insert key C:
    • Hash Function 1 places C in Slot 3 (where A is already), and Hash Function 2 places it in Slot 1 (where B is already).
    • Whichever specific protocol we use to decide if C should go in slot 1 or 3 (in this case we arbitrarily choose to insert it in slot 1), C displaces B and goes into Slot 1:
  • But now B has been evicted, so we must insert B somewhere else. Remember that hash1(B) pointed to slot 1 and hash2(B) pointed to slot 3; since B has just been evicted from slot 1, we now put it in slot 3 (where A is):
  • But now A has been evicted! So now we need to repeat the same process also for A, that is checking other A’s hash functions:
  • If this causes another eviction, this process continues until we either find a free slot or detect a loop (where the same key is being pushed back to its original slot). If an infinite loop is detected, this process will be abortedand the hash table will be resized and all keys will be rehashed.

When you perform a lookup (get B for example), you hash the key to its two possible locations using the two hash functions:

You check each of the slots:

  • If the key is found in one of those slots, you return the value.
  • If the key is not present in either slot, it’s guaranteed that the key doesn’t exist in the table, eliminating the need for scanning (scanning that is necessary, instead, in linear probe hashing).

One of the trade-offs in cuckoo hashing is that while the table does random accesses (because of multiple hash functions), the lookup process within a slot is sequential (i.e., after landing on a slot, you don’t scan for the next available slot, you directly check the value). In contrast, linear probing involves sequential scans even when you’ve landed at the correct slot initially, potentially causing inefficiency as the table fills up.

So again, all of these protocols I’ve shown so far are static hashing schemes. If we run out of space or loop back around, then we need to double the size of the hash table and repopulate it—and that’s expensive. Now, we’ll talk about different techniques to incrementally resize the hash table without having to rebuild the entire thing. The most common one is chain hashing, and this is what most people think of when they hear hash table. However, we’ll also look at two more advanced techniques that are actually used in real systems.

Dynamic Hashing Schemes

So, chain hashing—the basic idea is that instead of having a giant array of all the slots where we insert keys, our array will just be pointers to linked lists, chains, or buckets. All the keys that map to a specific slot in our hash table will be found in that linked list.

If you allocate a hash map in Java, this is essentially what you get. The linked list can grow infinitely because, in the worst-case scenario—if all my keys hash to the same slot—I’m just appending to this giant list, which ends up being a sequential scan. Ideally, with a good hash function, you’ll get a good distribution of keys.

Think of it as partitioning our giant hash table into smaller hash tables. We can manage duplicate keys by appending them to the linked list. We can still use tombstones, but in this case, compaction is often faster.

Now, we have our bucket pointers, and this is where hash functions will hash into. These pointers lead to the different buckets.

  • If we want to insert A, we hash it mod N (the number of bucket pointers), land in that bucket, find the first free slot, and insert it. Same for B — it goes to the next slot:
  • If C hashes to the same bucket as A (collision), we scan through sequentially until we find a free slot:
  • D hashes to A’s bucket, scans, and if no slots are free, we expand by adding another page. E follows the same logic. Finally F:

The nice thing about this is that I can grow the key list within a bucket without affecting other parts of the table. You can even think of this as a two-level hash table, but for simplicity, we show it as a linked list.

When you create a new bucket, its size depends on the page size of the database:

  • Postgres: 8 KB
  • MySQL: 16 KB

If lots of keys hash to the same location, this linear scan can get expensive. A simple optimization is to store a Bloom filter in your bucket pointer list, that just tells you whether a key exists in the linked list. For example, if you want to look up key G:

  1. Check the Bloom filter — if it says NO, skip the scan.
  2. If YES, follow the pointer and scan the list.

Bloom filters

A Bloom filter is a probabilistic data structure that supports set membership queries.

A filter is different from an index. An index tells you where a given key is—whether it’s in a specific record ID or a page. A filter, however, can only say whether the key exists (yes or no). It cannot tell you where it is.

The Bloom filter was introduced by Bloom in the 1970s. It is a probabilistic data structure, meaning it can tell you with 100% certainty that a key does not exist. However, if it says a key does exist, it might be wrong—it can produce false positives.

There are only two operations on a basic Bloom filter:

  1. Insert
  2. Lookup
  3. You cannot delete from a standard Bloom filter.

A Bloom filter is essentially a bitmap, where bits are set based on the keys that get inserted. For example, let’s insert members of the Wu-Tang Clan into a Bloom filter:

  1. Insert RZA → Hash the key using multiple hash functions → Mod by the number of bits in the filter → Set those bits to 1.
  2. Insert GZA → Same process → Set the corresponding bits to 1.

Now, let’s perform a lookup:

  • If we look up RZA, we hash the key, check the corresponding bits, and if all are 1, we think it could exist (but it could be wrong because something else might have set those bits):
  • If we look up Raekwon, but one of the bits is 0, we know for certain that he was not inserted:
  • If we look up ODB and all bits are 1, the Bloom filter says he was inserted, but this might be a false positive (because other keys may have set those bits):

Bloom filters can be placed in front of a bucket chain and updated incrementally whenever a new key is inserted.

There are different variations of Bloom filters, such as:

  • Decaying Bloom filters
  • Multi-level Bloom filters
  • Adjustable size and different hash functions

The false positive rate depends on the size of the Bloom filter and the number of hash functions. There are online Bloom filter calculators that can determine the required parameters for a desired false positive rate.

A basic Bloom filter does not support deletions. However, variations exist that allow for this using multi-level techniques.

Example Use Case: Hash Joins

Instead of checking if a key exists on disk, it’s much cheaper to first check a Bloom filter. If the key is not in the Bloom filter, we can skip the expensive disk lookup.

Extendible Hashing

A more sophisticated scheme is called Extendable Hashing. This is similar to chained hashing, but it allows us to split buckets dynamically to prevent infinitely long bucket lists. Instead of rehashing everything, we only update a small part of the hash table incrementally.

The key idea is to increase the number of bits we use to locate a bucket in our bucket hash table. This allows us to split buckets only when necessary rather than resizing the entire table. In some cases, multiple slots in the bucket array may point to the same bucket list, but these can split dynamically as needed.

Interestingly, while this method might seem complex, it is actually used in some real-world systems:

  • GDBM (GNU Database Manager) – A key-value store similar to RocksDB or SQLite that uses external hash tables.
  • AsterixDB – A big data project from UC Irvine, which uses Extendable Hashing in its implementation.

Let’s see how it works:

  1. We start with a slot array that points to our bucket list.
  2. A global identifier determines how many bits we need to examine in a hash value to locate the correct bucket:
  3. Each bucket list also keeps track of its own local bit size, defining how many bits it considers.

For example, consider a case where:

  • The first two slots in the array point to the same bucket list.
  • The bottom two slots point to different locations.
  • If the significant bit is 0, two slots reuse the same bucket list.

Let’s say we want to do a lookup (A):

  • Hash the key.
  • Check the top N bits (as defined by the global identifier).
  • Follow the corresponding pointer to the correct bucket.
  • Do the linear search to find the thing we’re looking for.

Here’s a visual representation of it:

Now, say you want do an insertion (B):

  • Hash the key.
  • Check the top N bits (as defined by the global identifier).
  • Follow the corresponding pointer to the correct bucket.
  • Insert the key.

Here’s a visual representation of it:

Now, let’s insert C with the same steps of B, and now it lands in the same location where we inserted B, but this bucket is now full and we cannot put any more entries in.

The solution is:

  • Increase the global counter (e.g., from 2 bits to 3 bits) and this implies doubling the number of pointers in the bucket array.
  • Split the full bucket into two new buckets and the local counter goes up from 2 to 3, then I need to consider 3 digits in the hash

Importantly, this approach only resizes the affected bucket, without modifying unrelated parts of the table.

Performance Considerations:

  • Resizing the slot array is relatively cheap.
  • The method is clever because it avoids full rehashing, but it requires careful metadata management to track which bits to examine.
  • Essentially, this is chained hashing with an extra splitting mechanism, preventing infinitely growing linked lists.
  • Compared to linear probing, which locks the entire table and doubles its size, Extendable Hashing allows for incremental resizing.

Linear Hashing

Linear Hashing is another dynamic hashing technique, and it is used in PostgreSQL or something very similar. It was also implemented in Berkeley DB, which was developed by Sleepycat Software (later acquired by Oracle). The key person behind Berkeley DB was originally involved in PostgreSQL’s linear hashing implementation in the early 1990s.

The core idea of Linear Hashing is to incrementally split buckets whenever an overflow occurs, rather than performing a global resize. This ensures that the cost of resizing is amortized over multiple insertions and doesn’t burden a single unlucky transaction with an expensive rehashing operation. Basically the hash table maintains a pointer that tracks the next bucket to split and, when any bucket overflows, split the bucket at the pointer location.

Let’s give an example:

  1. Assume we start with 4 buckets (n = 4), and our first hash function is for exmplicity hash(key)=key % n.
  2. We do a lookup for 6, which hashes to bucket 2. No issue:
  3. We insert 17, which hashes to bucket 1, but the bucket is full, so we create an overflow bucket (like in chained hashing):
  4. Because we overflowed, we now split whatever the split pointer was pointing at (bucket 0 in this case), even though that didn’t overflow. We rehash all entries in bucket 0 using mod 8 (instead of mod 4).
    • 8 % 8 = 0 (stays in bucket 0).
    • 20 % 8 = 4 (moves to a new bucket).
  5. Now we have 5 buckets, and the split pointer moves to bucket 1 and we continue operating on the hash table.

Now let’s make a lookup for 20. When I first hash it, I would get 0, but then I know that the location in my bucket list is above where the split pointer is pointing. So, I have already split everything above it. After I mod it by 4, I now mod it by 8 to figure out where it really is (hash(20) = 20 % 8 = 4). That’s how I can find it down here at the bottom.

Say I want to do a lookup to get 9. In this case, it’s pointing exactly where the bucket of the split pointer is pointing, so I know I haven’t split it yet. I can just hash it once and scan along the linked list until I find what I’m looking for (hash(9) = 9 % 4 = 1).

At some point, the split pointer will reach the bottom, and I’ll have eight slots. Then, I just loop back around and start all over again.

This seems counterintuitive—I’m not splitting the thing that overflowed; I’m splitting whatever the split pointer points at. But the idea is that if a particular slot is super hot and keeps overflowing, eventually, it will get split. So, in the long run, everything gets split, and the system resizes correctly.

Questions & Answers: One thing I’m confused about: it seems like every time we overflow by one, the split pointer moves down by one, and we add one new page. So when would the split pointer ever wrap around? Yeah, so the question is when it would actually wrap around. You’d get to the point where it’s at five, six, seven, and then you’d reach seven and loop back around to zero. When it’s only from zero to three, once you get past seven, you know that’s 2N from where you started, so then you loop back around. But don’t we add a new page every time? Yeah, you add a new page, but I know I should wrap around when I reach eight. When I started, I had four, so 2 × 4 = 8. Once I get past eight, I loop back around. Then I keep doing that until I get to 16, and I loop back around again.

Linear Hashing - Resizing

Splitting buckets is based on the split pointer and eventually distributes all overflow buckets. When you reach the bottom, you just drop the first hash function and loop back around. This technique also allows for contraction or coalescing because you can identify when a bucket list is empty. In that case, you can throw it away, consolidate the structure, and move the split pointer back up. That allows you to shrink the size of the hash table.

For example, if I delete 20, I mod it by 4, but then I realize that’s below the split pointer, so I go down to the bottom and delete it:

Now, this page is empty:

If I wanted to, I could move the split pointer back up, drop the last entry, and remove the last hash table:

Obviously, you need to be careful to avoid oscillation—inserting 20, deleting 20, inserting 20 repeatedly, causing constant splitting and coalescing. That would be inefficient. However, this technique does allow contracting the data structure when necessary.

You wouldn’t want to insert 21, overflow, and split everything again unnecessarily. As far as I know, PostgreSQL doesn’t support shrinking the size of the hash table without rebuilding the whole thing.

Conclustion

Hash tables are extremely useful. Most systems implement linear probing hashing, but specialized versions can be optimized for specific data types and use cases. ClickHouse is a great example of a system that does this well.

For commercial database systems, it’s often hard to determine exactly what hash table implementation they use unless there’s a published paper or inside knowledge from engineers working on the system.

As an application developer using SQL, you generally don’t need to worry about this, but it’s useful to understand how these systems work under the hood.

The advantage of hash functions is their speed, providing O(1) lookups in the best-case scenario. However, they need to grow efficiently if the initial size estimate is incorrect. We’ll cover resizing strategies in a later discussion.

Some systems allow you to create hash tables explicitly. In PostgreSQL, if you use CREATE INDEX and specify USING HASH, you get a linear hash table implementation. However, this is not the default for most systems.

Why Aren’t Hash Tables the Default for Indexing? The reason is that hash tables don’t support range scans. They only allow equality lookups, meaning you must have the entire key.

For example, if my key is on column A and column B, I can create composite keys. However, if I’m missing A or B, I can’t perform a lookup.

By contrast, a B+ Tree (which we’ll discuss in the next class) supports prefix lookups and is considered the best data structure for databases. Tries are also useful, and you can even integrate tries into B+ Trees for additional functionality.

Because of this flexibility, most database systems use B+ Trees as the default index structure. We’ll discuss single-threaded implementations on Monday and then explore multi-threaded versions on Wednesday.