In the modern data landscape, traditional relational databases, which have long been the standard for structured data, are facing limitations. This is due to the exponential growth of data, the diversity of new data formats, and the need for high-performance, scalable solutions. As a result, a new class of databases known as nonrelational or NoSQL databases has emerged. These databases have become popular because they can handle the demands of modern web applications, real-time analytics, and large-scale distributed systems.

Nonrelational databases depart from the rigid, relational model and introduce new data models and storage mechanisms. They are designed to prioritize scalability, fault tolerance, and low-latency access. This allows organizations to effectively manage massive data volumes and support dynamic, evolving data requirements.

Nonrelational Database Concepts

Schema Flexibility: Unlike relational databases that require a predefined, strict schema, nonrelational databases have a flexible schema. This allows them to store and accommodate data with varying and evolving structures.
Data Models: Nonrelational databases support a variety of data models, each optimized for specific use cases:
- Document Stores: Store data in flexible, semi-structured documents (e.g., JSON), ideal for complex data.
- Key-Value Stores: Store data as simple key-value pairs, making them highly efficient for quick reads and writes.
- Column-Family Stores: Organize data into column families, perfect for analytical and warehousing scenarios.
- Graph Databases: Focus on the relationships between data entities, making them suitable for interconnected data.
Scalability: Nonrelational databases are designed for horizontal scalability, distributing data across multiple servers. They can handle massive datasets and high concurrent traffic through techniques like sharding and replication.
High Availability and Fault Tolerance: Many nonrelational databases are built to handle hardware failures and network issues without compromising data integrity or accessibility.

BASE Principles

The BASE acronym, which stands for “Basically Available, Soft state, Eventually consistent”, is an alternative approach to data consistency that nonrelational databases often adopt. It contrasts with the strict ACID properties of relational databases and is suited for large-scale, distributed systems where availability and scalability are critical.

Basically Available: The system is designed to always be available and respond to client requests, even in the event of network partitions or hardware failures.
Soft State: The system’s state can change over time due to various factors, such as node failures, network delays, or concurrent updates. It acknowledges that the data on different nodes might be temporarily inconsistent. Applications need to be designed to handle such variability gracefully.
Eventually Consistent: While a temporary inconsistency may exist between data replicas, the system guarantees that all data will eventually converge to a consistent state once the network issues or updates are resolved.

The choice between an ACID and a BASE model depends on the specific needs of an application. An ACID model is better for financial systems that need strict consistency, while a BASE model is ideal for applications like social media platforms that prioritize availability and scalability.

Key-Value Databases

Key-value databases are a type of nonrelational database that use a simple data model to store and retrieve data: the key-value pair. They are highly scalable and flexible, relying on distributed hash tables for efficient operation.

Data Model

The basic unit of data in a key-value store is an item, which is a key-value pair. The key is a unique identifier, and the value is the data associated with it. Values can be of various types, including strings, numbers, or complex JSON objects.

No Fixed Schema or Indexes

Unlike relational databases, key-value stores are schemaless. This means different items can have varying structures, and you don’t need to pre-define the data’s attributes. They typically lack complex indexing, as they rely on the key for fast lookups. Indexing is usually limited to the keys themselves.

Key Types

In a distributed key-value store, keys are used to organize and retrieve data efficiently.

Primary Key: This is the unique identifier for each key-value pair. It provides a direct map to the value and ensures that each item is identifiable.
Partition Key: A subset of the primary key, the partition key determines where data is stored across multiple storage partitions or nodes. The partition key is used to partition the dataset into smaller, more manageable subsets that can be distributed across different physical or virtual storage nodes. Choosing a good partition key is essential for distributing data evenly and avoiding hot spots (overloaded partitions).
Sort Key: An optional attribute, the sort key (or range key) orders data within a partition. It allows for efficient range queries and retrieval of data in a specific order, for example, retrieving records by timestamp.

Data Access and Retrieval Operations

Key-value stores have a simple set of operations for manipulating data based on the key:

GetItem: Retrieves an item and its attributes using the primary key.
PutItem: Inserts a new item or replaces an existing item with the same key.
UpdateItem: Modifies the attributes of an existing item. If the item doesn’t exist, it creates a new one.
DeleteItem: Removes an item from the store using its primary key

Scaling Key-Values Stores

Key-value stores are designed for horizontal scalability, allowing them to handle massive volumes of data. They achieve this through a distributed architecture where data is partitioned across multiple nodes. This section describes how they are scaled using leaderless replication and consistent hashing.

Leaderless Replication

Leaderless replication is a distributed system technique that provides high availability and fault tolerance without a single leader node. Instead, all nodes are equal and can independently accept client requests for read and write operations.

How it works: Data is divided into partitions (shards), and each shard is replicated across multiple nodes using a peer-to-peer approach. When a client sends a request, it can go to any node, and that node handles the operation, eliminating a single point of failure.
Quorums: To ensure consistency, leaderless systems use quorums. A quorum defines the minimum number of successful responses needed for an operation to be considered complete. For example, a write operation might require an acknowledgment from a majority of nodes to be successful (N/2+1). A common quorum model is W + R > N, where W is the write quorum, R is the read quorum, and N is the total number of nodes.
Conflict Resolution: When conflicting updates occur on different nodes, tools like vector clocks and Merkle trees are used to resolve them. Merkle trees, for instance, can quickly identify where discrepancies exist in a dataset, simplifying the conflict resolution process.

Leaderless replication provides high availability and fault tolerance because the system can continue operating even if some nodes become unavailable. Clients can send requests to any available node, and the system handles the replication and coordination transparently. However, ensuring strong consistency across the replicas can be more challenging in a leaderless replication model compared to systems with a designated leader. Hence, Amazon DynamoDB is based on a leader-follower architecture, which we discussed in Replication Patterns.

Leaderless Replication, Gemini 2.5 Flash

I asked Gemini 2.5 Flash to clarify this technique. I think it’s clearer, so I’ll post here.

Leaderless replication relies on every node being able to accept both read and write requests. When a client wants to write data (like adding a new user to a database), it sends the request to any available node. This node then becomes responsible for replicating that data to a certain number of other nodes, known as the replication factor.

The Write Process

A client sends a write request to a node. Let’s call this node A.

Node A writes the data locally.

Node A then forwards the write request to other nodes (e.g., Node B and Node C).

The system waits for a certain number of acknowledgments before telling the client the write was successful. This number is often called the quorum. A quorum is a majority of nodes (more than half) that must confirm a write. For example, if your replication factor is 3, your write quorum might be 2.

The Read Process

Similarly, when a client wants to read data, it can send the request to any node.

A client sends a read request to a node.

The node might forward the request to other nodes to ensure it gets the most up-to-date version of the data.

It then waits for a read quorum of responses. The system can then compare the data from multiple nodes and return the most recent version.

Dealing with Conflicts

Since there’s no single leader to enforce a perfect order, it’s possible for conflicts to arise. For instance, two nodes might receive conflicting updates to the same piece of data at the same time. Leaderless systems handle this using various techniques:

Vector clocks: A way of tracking the causal history of data.

Last-write-wins: Simply accepting the last update received.

Conflict-free Replicated Data Types (CRDTs): Data structures that are designed to handle conflicts automatically.

Consistent Hashing

Consistent hashing is a technique used to distribute data evenly across a set of nodes while minimizing data movement when nodes are added or removed. In traditional hashing, data is hashed to determine the node responsible for storing or serving it. However, this approach becomes problematic when nodes are added or removed from the system. In such cases, the distribution of data needs to be recalculated, resulting in significant data movement and potential disruptions.

How it works: Nodes are conceptually arranged on a ring-like structure using a hash function. Data is also hashed to a position on this ring. A piece of data is then assigned to the node whose position is the closest clockwise to the data’s position.
Advantages:
- Scalability: When a node is added or removed, only a small fraction of the data needs to be remapped to a new location.
- Load Balancing: Data is distributed evenly around the ring, ensuring that each node handles a similar load.
- High Availability: The technique makes scaling easier and more efficient, minimizing disruptions.

Consistent Hashing, Gemini 2.5 Flash

I asked Gemini 2.5 Flash to clarify this technique. I think it’s clearer, so I’ll post here.

Imagine you have a distributed system with three servers (A, B, and C) and you want to store user data across them. A simple way to decide where to put a user’s data is to use a hash function: server = hash(user_id) % number_of_servers

If hash(user_id) % 3 is 0, the data goes to server A.

If hash(user_id) % 3 is 1, the data goes to server B.

If hash(user_id) % 3 is 2, the data goes to server C.

This works fine as long as the number of servers doesn’t change. But what happens if you add a new server, D? When a new server is added, the formula for locating all data (old and new) changes to hash(key) % 4. This means the system will now look for existing data in the wrong place, effectively making it inaccessible. To fix this, you must initiate a massive and expensive data redistribution, moving almost every piece of data to its new, correct location.

The Consistent Hashing Solution. Consistent hashing solves this by introducing a “hash ring” or a continuous circle of numbers. Both the servers and the data keys are placed on this ring.

Map the Servers: First, a hash function is used to map each server (using its IP address or a unique ID) to a point on the ring. Let’s say Server A is at 60, Server B is at 120, and Server C is at 300.

Map the Data: Next, when you want to store a piece of data, its key (like a user_id) is also hashed to a point on the same ring.

Assign the Data: The data is then assigned to the first server it encounters by moving clockwise from its position on the ring.

A key at position 100 would be assigned to Server B (at 120).

A key at position 130 would also go to Server C (at 300).

A key at position 320 would wrap around the ring and be assigned to Server A (at 60).

What Happens When a Server is Added or Removed? This is where the magic happens.

Adding a Server: If you add a new Server D at position 110, only the data that was between Server A’s position and the new Server D’s position needs to be moved. All other data stays exactly where it is.

Removing a Server: If Server B fails, all the data that was previously assigned to it is simply re-assigned to the next server in the clockwise direction, which is Server C. Again, only a small fraction of the data is affected.

By using this ring structure, consistent hashing ensures that changes to the number of servers result in a minimal number of keys being remapped, making it a highly scalable and resilient approach for distributed systems.

Look at the following image: when we insert a new server s4 to the left of s0 on the ring, only k0 needs to be moved from s0 to s4. This is because s4 is the first server k0 encounters by going clockwise from k0’s position on the ring. Keys k1, k2, and k3 are not affected.

Availability in Key-Value Stores (TODO)

In a distributed key-value store, high availability is achieved through several mechanisms that prioritize responsiveness and fault tolerance over strict, immediate consistency.

Mechanisms for High Availability:

Optimistic Replication: In this technique, a write operation is considered successful and acknowledged to the client before all replicas have confirmed it. Changes are propagated asynchronously to other nodes. This reduces latency and maintains availability even if some replicas are temporarily offline. The downside is that it can lead to temporary inconsistencies between replicas, which are resolved in the background.
Sloppy Quorum and Last Write Wins (LWW): These techniques ensure availability during network partitions.
- Sloppy quorum allows a relaxed quorum requirement for read and write operations, meaning only a subset of replicas needs to acknowledge an operation. This allows the system to continue operating when some replicas are unavailable.
- LWW is a conflict-resolution strategy. If conflicting updates for the same key arrive at different replicas, the one with the latest timestamp is given precedence. This sacrifices strong consistency for availability, ensuring the most recent change is eventually propagated.
Hinted Handoff: When a write operation is sent to a replica that can’t communicate with its intended destination, the write is temporarily stored as a “hint.” Once the destination replica becomes available, the hint is delivered, ensuring that no data is lost and all updates are eventually applied.

Advantages, Trade-offs, and Considerations

Key-value stores offer high performance and low-latency access because data retrieval is a direct key-based lookup. This makes them ideal for applications that need fast reads and writes, such as caching, session management, and real-time analytics.

However, they come with significant trade-offs:

Limited Querying: They are optimized for simple key-based lookups and lack the complex query capabilities of relational databases.
Complex Operations: Transactions and operations involving multiple keys are challenging to implement without a predefined schema or relationships.
Application-Level Complexity: Developers must manage complex data relationships and business logic at the application layer, which can complicate the design.

Dynamo: A Case Study

Dynamo, developed by Amazon for its shopping cart service, is a notable example of a key-value store. It’s a highly available and scalable system that influenced other open-source implementations like Riak and Voldemort.

Key features of Dynamo include:

Data Model: A simple key-value model where the value can be any binary data, and a key uniquely identifies each item.
Consistency Model: Uses tunable eventual consistency, meaning updates are not immediately reflected across all replicas.
High Availability: Prioritizes availability, ensuring data remains accessible even with failures or network partitions.
Horizontal Scalability: Scales by distributing data across many commodity servers, increasing capacity and performance at a lower cost.

Quartz 4

Explorer

03. Nonrelational Stores