Searching an unordered array takes (linear time), while searching an ordered array takes (logarithmic time). We can achieve even better performance: lookup time using a special data structure called a hash table.

Hash Tables

A hash table (also called a dictionary in Python, or a hash, map, hash map, or associative array in other languages) is a data structure designed for fast reading.

A hash table consists of paired values:

  • The key: The item used to look up the data (e.g., "french fries").
  • The value: The data associated with the key (e.g., 0.75).

Example Menu Hash Table:

menu = { "french fries": 0.75, "hamburger": 2.5, "hot dog": 1.5, "soda": 0.6 }

Looking up a value by its key (e.g., menu.get("french fries")) has an efficiency of on average, meaning it generally takes just one step regardless of how many items are in the table.

Hashing with Hash Functions

The fast performance of a hash table is based on hashing, a process of converting a key (like a string) into a number (an index). The code or logic used for this conversion is called a hash function.

A simple hash function takes each letter’s numerical equivalent (A=1, B=2, C=3, etc.) and returns the product of those numbers.

WordStep 1 (Numbers)Step 2 (Product)Hash
BAD2, 1, 48
DAB4, 1, 28

A hash function must meet one critical criterion to be valid:

A hash function must convert the same string to the same number every single time it’s applied.

Functions that use random numbers or the current time are invalid because they produce inconsistent results for the same key. Our multiplication hash function is valid because the numerical equivalents of the letters are always the same, guaranteeing a consistent output.

A potential issue is that different keys can result in the same hash (e.g., BAD and DAB both hash to 8). This is called a collision and will be addressed later.

Armed with the concept of hash functions, we can proceed to understand how a hash table utilizes this mechanism for lookup.

Building a Thesaurus for Fun and Profit, but Mainly Profit

To build your innovative thesaurus app, Quickasaurus, which returns a single synonym for any given word, a hash table is the ideal data structure. Hash tables store data in a list of cells, similar to an array, where each cell is accessed by a numerical index.

The key to understanding a hash table’s speed is how it uses a hash function to determine the storage location (index) for a value based on its key.

In this scenario, we use the multiplication hash function () to convert the word (the key) into a numerical index.

  1. Adding "bad": "evil"
    • Key Hashing: The key "bad" is hashed: .
    • Storage: The value "evil" is placed directly into cell 8 of the hash table.
  2. Adding "cab": "taxi"
    • Key Hashing: The key "cab" is hashed: .
    • Storage: The value "taxi" is placed directly into cell 6.
  3. Adding "ace": "star"
    • Key Hashing: The key "ace" is hashed: .
    • Storage: The value "star" is placed directly into cell 15.

The hash table internally looks like a sparse array where the index is calculated directly from the key. This direct mapping from key to index is why lookup is so fast.

When a user looks up a word, the process is:

  1. The word is instantly fed into the hash function to get the index. (E.g., “bad” 8).
  2. The program goes directly to that index (cell 8) and retrieves the value (“evil”).

Since this process only requires a few constant-time mathematical and access steps, regardless of how many items are in the thesaurus, hash table lookup achieves an average efficiency of (constant time).

One-Directional Lookups

The tremendous speed of a hash table is strictly one-directional: it is used to find a value given its key.

  • Key to Value (Fast): If you know the key, you can achieve an average lookup because the hash function instantly calculates the value’s storage location (index).
  • Value to Key (Slow): If you try to find a key based on its value, you must search through every single key-value pair in the hash table, which reverts the lookup to an (linear time) operation. The value does not determine the key’s location, so there is no shortcut to finding the key.

Key Storage and Uniqueness:

  • Key Storage: Keys are often stored right next to their corresponding values inside the hash table’s cells, which aids in handling collisions.
  • Key Uniqueness: Each key can exist only once in a hash table. If you attempt to add a new key-value pair where the key already exists, the new value will overwrite the old one.
  • Value Duplication: Conversely, a value can appear multiple times in the hash table, associated with different unique keys (e.g., two different menu items can have the same price).

Dealing with Collisions

Hash tables are fast, but they face a fundamental problem: collisions. A collision occurs when two different keys hash to the same index (cell) in the underlying storage.

When a collision occurs, a common technique for resolution is separate chaining. Instead of storing a single value in the cell, the cell stores a reference to an array (or list), which then holds all the key-value pairs that hash to that index.

Let’s trace adding "dab": "pat" to the thesaurus, where "bad": "evil" is already stored at cell 8:

  1. Key Hashing: "dab" hashes to .
  2. Collision Detected: Cell 8 is occupied by "evil".
  3. Separate Chaining Applied: The content of cell 8 is replaced by an array containing subarrays, where each subarray stores the key-value pair.
    • Initial content of cell 8: ["bad", "evil"]
    • New content after adding "dab": "pat": [["bad", "evil"], ["dab", "pat"]].

When looking up a key that has resulted in a collision (e.g., thesaurus.get("dab")), the process involves:

  1. Hashing: Hash the key (e.g., "dab" 8).
  2. Cell Inspection: Go to cell 8 and note that it contains a list of subarrays.
  3. Linear Search: Perform a linear search within that small array to find the specific key ("dab") and retrieve its associated value ("pat").

If, in the worst-case scenario, every single key in the hash table hashes to the exact same cell (a catastrophic collision), that cell would contain a single, large array holding all key-value pairs. Looking up any key would then require a linear search through that entire array.

Therefore, the worst-case performance for a hash table lookup is .

To maintain the blazing-fast average-case performance, it is crucial that the hash table is designed to have few collisions. Understanding the mechanism of collisions and chaining helps appreciate how hash tables maintain their speed in typical usage.

Making an Efficient Hash Table

The efficiency of a hash table—and its ability to maintain its average-case lookup speed—depends on three critical factors:

  1. How much data we’re storing ().
  2. How many cells are available in the hash table ().
  3. Which hash function we’re using.

A good hash function is vital for efficiency because it must distribute its data across all available cells.

If a hash function is flawed (e.g., one that always produces a single-digit result, 1 through 9), it will cause all data to be clustered in a small range of cells, leaving many cells unused:

This concentration of data leads to a high number of collisions, degrading performance from toward . The more evenly the data is spread out, the fewer collisions there will be.

To minimize the worst-case performance, a hash table must perform a balancing act between:

  1. Avoiding Collisions: More cells mean fewer collisions.
  2. Avoiding Memory Hogging: Too many unused cells waste memory.

This balance is quantified using the load factor.

  • Load Factor: The ratio of the number of data elements stored to the number of available cells.
  • Ideal Load Factor: The rule of thumb developed by computer scientists is that the ideal load factor is (7 elements for every 10 cells).

When a hash table’s load factor exceeds the ideal ratio, the programming language’s internal implementation will automatically expand the hash table by allocating more cells and potentially changing the hash function to redistribute the existing data. This ensures the hash table consistently aims for peak performance.

Since most programming languages manage these internals (size, hash function, expansion) for us, we can generally assume our hash table is implemented for peak performance and leverage its superior lookup efficiency.

Hash Tables for Organization

Hash tables are excellent for organizing data because they intrinsically store information in key-value pairs. This makes them a natural fit for many common organizational structures

Any data where one piece of information is consistently associated with another can be efficiently represented by a hash table:

  • Dictionaries/Thesauri: Words paired with definitions or synonyms. This is why many languages, like Python, call hash tables dictionaries.
  • Tallies/Counts: Tracking systems where an identifier is paired with a numerical count:
    • Votes: Candidate name number of votes (e.g., {"Candidate A": 1402021}).
    • Inventory: Item name quantity in stock (e.g., {"Yellow Shirt": 1203}).

Hash tables can be used to dramatically simplify complex conditional logic (like extensive if/elif/else structures) when the logic is based on paired data.

A function checking an HTTP status code requires multiple comparisons:

def status_code_meaning(number):
    if number == 200:
        return "OK"
    elif number == 301:
        # ... and so on

The conditional logic is eliminated entirely. The hash table lookup replaces the potentially slower sequential checks.

status_codes = {200: "OK", 301: "Moved Permanently", 404: "Not Found"}
def status_code_meaning(number):
    return status_codes.get(number)

Hash tables are a standard way to represent objects and their various attributes where the attribute name becomes the key and the attribute value becomes the value.

  • Single Object:
    {"name": "Fido", "breed": "Pug", "age": 3, "gender": "Male"}
  • List of Objects: An entire list of complex items can be created by placing multiple hash tables inside an array:
    [
        {"name": "Fido", "breed": "Pug", ...},
        {"name": "Lady", "breed": "Poodle", ...},
        {"name": "Spot", "breed": "Dalmatian", ...}
    ]

Hash Tables for Speed

Hash tables are not just for naturally paired data; their lookup speed can be leveraged to dramatically optimize code, even for simple membership checking.

When you have an unordered array, like [61, 30, 91, 11, 54, 38, 72], searching for any number requires a linear search, steps. To optimize this, you can convert the array into a hash table where each number from the array is stored as a key, and an arbitrary value (like True) is assigned to it:

hash_table = {61: True, 30: True, 91: True, ...}

Now, searching for a number (e.g., 72) involves checking if it exists as a key:

hash_table.get(72)

This is a single-step hash table lookup. If the lookup returns any value (not None), the number is present in the original array. If it returns None, the number is absent.

By pre-processing the array into a hash table “index,” you transform an search into an search.

Example

Let’s make an example. Now the task is to determine if one array is a subset of another (i.e., if every element of the smaller array exists in the larger array).

1. The Nested Loop Approach (Inefficient)

The initial, non-optimized approach uses nested loops:

  1. Iterate through every element of the smaller array.
  2. For each element, iterate through the entire larger array to find a match.

If is the size of the larger array and is the size of the smaller array, this results in comparisons, giving a time complexity of .

2. The Hash Table Approach (Efficient)

The optimized algorithm uses the hash table “index” technique to replace the inner search loop.

Step 1: Build the Hash Table (The Index). We iterate through the entire larger array once and store every element as a key in the hash table.

hash_table = {}
for value in larger_array:
    hash_table[value] = True  # O(1) insertion

If the larger array has elements, this step takes time.

Step 2: Check for Subset. We then run a second, non-nested loop through the smaller array. For each element, we check for its existence in the hash table.

for value in smaller_array:
    if not hash_table.get(value): # O(1) lookup
        return False
return True

If the smaller array has elements, this step takes time.

Total Efficiency: The total time complexity is the sum of the time for the two sequential steps: . If we define as the total number of items in both arrays combined, the algorithm is simply (linear time).

This is a huge performance win, as we’ve reduced the complexity from quadratic () to linear (). The trick relies on converting multiple, slow searches into a single, fast lookup for each item in the smaller array.

This technique of using a hash table as an “index” comes up frequently in algorithms that require multiple searches within an array; that is, if your algorithm will need to keep searching for values inside an array, each search would itself take up to N steps.