This chapter is a guide to data storage solutions in system design, focusing on different storage types and relational databases. It starts by explaining the three main data storage formats—file, block, and object—and their characteristics. It then covers relational databases, exploring techniques to scale them for large-scale applications. The goal is to provide a solid foundation for understanding various data storage options and how to choose the right one for a scalable, reliable system.

Data Storage Format

The evolution of data storage hardware led to the need for software that could work across different hardware types. This resulted in the creation of storage drivers and three primary data storage formats: file-based, block-based, and object-based. Each format logically organizes data differently on the underlying hardware.

File-Based Storage

File-based storage is the oldest and most widely used format. It organizes data in a familiar hierarchy of files and folders, much like a filing cabinet. To find a file, you need its specific path string.

How it works: A filesystem provides the hierarchical structure. Data within the files is stored in blocks or pages on the disk and is organized and retrieved using metadata. When you access a file on your computer, you’re using this type of storage, as it appears as a local directory.
Benefits: It’s great for storing complex files and easier to navigate than other formats because of the logical hierarchy.
Limitations: It can only scale out by adding more systems, not by upgrading the capacity of a single one. This can be a limitation for large-scale applications.

Amazon Elastic File System (EFS) is a scalable, fully managed file storage service offered by AWS that provides shared file storage for Amazon Elastic Compute Cloud (EC2) instances, enabling multiple instances to access the same data concurrently, which makes it suitable for applications requiring shared access to files.

Block-Based Storage

Block-based storage organizes data into fixed-size blocks, each with a unique address. This format separates data from the user’s environment, allowing it to be distributed across different operating systems like Windows and Linux. When data is needed, the storage software reassembles the blocks and delivers them.

How it works: Unlike file-based storage, there is no single file path, which allows for quicker data retrieval. Each block is independent, giving users more flexibility to configure their data. It is often used in Storage Area Networks (SAN) environments and requires a functioning server to operate.
Benefits: It offers better performance, reliability, and scalability than file-based storage. It’s a great choice for large enterprises, big transactions, and massive databases.
Limitations: It can be expensive and has limited capabilities for handling metadata, which adds complexity because developers must manage metadata at the application or database level.

Amazon Elastic Block Store (EBS) offers scalable block storage volumes on the AWS cloud that can be attached to EC2 instances, providing durable, high-performance storage for applications that require low-latency access to data, such as databases and applications that need block-level storage

Object-Based Storage

Object-based storage is an architecture that stores unstructured data as self-contained units called objects. These objects are kept in a single repository and are distributed across different hardware. They are accessed using a unique identifier and rich metadata that describes the data (e.g., security policies, age, and file details).

How it works: The retrieval of data is done using the unique identifier and metadata, which distributes the workload and allows for efficient searches. It uses a simple HTTP API that works with most programming languages.
Benefits: It is highly scalable and cost-effective, making it ideal for static and unstructured data.
Limitations: Objects cannot be modified. Any change requires writing the entire object again. It’s also not well-suited for traditional databases due to its slow write process and the complexity of its API compared to file-based storage.

Amazon Simple Storage Service (S3) is an object storage service offered by AWS that provides highly scalable and durable storage for a wide range of data types, accessible via APIs. It is suitable for storing and retrieving large amounts of data, backups, and static website content as well as a data lake for analytics.

The choice of storage format depends on the type of data and the system’s performance, reliability, and scalability needs. File-based storage works well for structured data, while block and object-based storage are better for unstructured data. Block-based storage offers superior performance, and object-based storage excels in scalability and cost-effectiveness.

Relational Databases

A database is a structured collection of data. A database management system (DBMS) is the software that sits on top of it, acting as a bridge for users to access, store, and manipulate that data. Databases can be classified as either relational or nonrelational. In this section, we’ll focus on relational databases.

Relational databases organize data into a set of tables with defined relationships between them. They are the most common type of database and they are designed to store large amounts of structured data. They are based on the relational model introduced by Edgar F. Codd in the 1970s.

The logical components of a database schema design are:

Tables: The fundamental building blocks of a relational database. They store data in rows and columns. Each table represents a specific entity (e.g., a “Customer” or an “Order”).
Rows (Records or Tuples): Each row is a unique instance of an entity in a table. It contains a set of data values that correspond to the table’s columns.
Columns (Fields or Attributes): These represent specific properties of the data, with each column having a defined data type (e.g., integer, string, date).
Relationships: This is a key feature that allows tables to be connected. Relationships are established using keys. The most common types are one-to-one, one-to-many, and many-to-many.
Keys: These are used to uniquely identify records and create relationships between tables.
- A primary key is a unique identifier for each row within a table.
- A foreign key is a column in one table that references the primary key of another table, establishing a link between them.
Indexes: These are data structures that improve query performance by providing fast access to specific data. They’re built on one or more columns to speed up search and retrieval operations.
Constraints: These are rules that enforce the integrity of the data. They ensure that data is valid and consistent, preventing the entry of bad information. Examples include primary key, foreign key, unique, and check constraints.
Views: These are “virtual tables” based on a query. They don’t store data themselves but provide a simplified or customized way to present data from one or more underlying tables. They’re often used for security or to simplify complex queries.
Transactions: A transaction is a single, indivisible unit of work, ensuring a series of operations are either all completed successfully or all rolled back. The behavior of transactions is governed by the ACID properties (Atomicity, Consistency, Isolation, Durability), which we’ll cover in more detail later.

These components work together to form the structure of a relational database, providing a flexible and efficient way to store, organize, and retrieve structured data.

Here’s a sample implementation of a logical schema design of a database for storing customer orders:

Relational Database Concepts

elational databases are built on a few core concepts that ensure data is managed correctly. These include Structured Query Language (SQL), the ACID properties, and the Entity-Relationship (ER) model.

SQL

SQL is the standard language used to interact with relational databases. It allows you to create, modify, and query data using commands like SELECT, INSERT, UPDATE, and DELETE. SQL is divided into four main types of query languages:

Data Definition Language (DDL): Used to define and modify the structure of database objects. Commands include CREATE TABLE, ALTER TABLE, and DROP TABLE.
Data Manipulation Language (DML): Used to manage the data within the tables. This includes INSERT to add new data, UPDATE to change existing data, and DELETE to remove data.
Data Control Language (DCL): Used to control access and permissions for users. Commands like GRANT and REVOKE are part of DCL.
Transaction Control Language (TCL): Used to manage transactions. COMMIT makes changes permanent, while ROLLBACK undoes them.

ACID Properties

The ACID model is a set of properties that guarantee database transactions are processed reliably.

Atomicity: This means a transaction is an “all or nothing” operation. Either all the operations within the transaction are successfully completed and committed, or none of them are applied. This prevents partial updates and ensures the database remains in a consistent state.
Consistency: A transaction must move the database from one valid state to another. It ensures that the database follows all defined rules and constraints, such as ensuring a bank account balance never goes negative. The application is responsible for defining these rules.
Isolation: This ensures that concurrent transactions do not interfere with each other. The result of multiple transactions running at the same time should be the same as if they were executed one after another. Different isolation levels provide different trade-offs between concurrency and data integrity.
Durability: Once a transaction is committed, its changes are permanent and will survive any system failures, such as a crash or power outage. This is typically achieved by writing the data to non-volatile storage like a disk.

Entity-Relationship (ER) Model

The ER model is a conceptual tool used to design relational databases. It provides a graphical representation of the database structure.

Entities: Represented by rectangles, these are the main objects or concepts in the database (e.g., “Customer,” “Order”).
Attributes: Represented by ovals, these describe the properties of an entity (e.g., a “Customer” entity might have an “address” attribute).
Relationships: Represented by diamonds, these show how entities are connected (e.g., a “Customer” places an “Order”). Relationships can be one-to-one, one-to-many, or many-to-many.

The ER model helps designers visualize and communicate the database structure clearly before they begin implementation.

Schema Normalization

Schema normalization is the process of organizing data in a database to reduce data redundancy and improve data integrity. It works by breaking down a large table into smaller, more manageable ones, each with a specific purpose. The main goal is to ensure that each piece of data is stored only once.

For example, consider a Customers table with columns like customer_id, customer_name, customer_email, customer_address, and customer_phone. If a customer places multiple orders, their name and contact information would be repeated in every row.

To normalize this, you can split the table into two: CustomerInfo and CustomerContact. The CustomerInfo table would hold customer_id and customer_name, while CustomerContact would hold customer_id along with email, address, and phone number. This way, customer information is stored only once, saving space and preventing inconsistencies that can arise from duplicate data.

Keys

Keys are essential in relational databases for uniquely identifying rows and establishing relationships between tables. A key is a single column or a set of columns that uniquely identifies each record. There are three main types of keys:

Candidate Key: A column or a set of columns that can uniquely identify a record in a table. It’s a “candidate” to be chosen as the primary key. For instance, a customer’s email address could be a candidate key because each email is unique.
Primary Key: The candidate key that is chosen to be the main identifier for a table. It uniquely identifies each row and is used by other tables to link to it. For example, order_ID could be the primary key in an Orders table.
Foreign Key: A column or a set of columns in one table that links to the primary key of another table. It’s used to establish relationships and ensure data integrity. In an Orders table, customer_id would be a foreign key referencing the Customers table, linking each order to the correct customer.

Relational Database management System Architecture

A Relational Database Management System (RDBMS) architecture is a complex system of interconnected components that work together to manage and process data. The core components are the query processor, execution engine, storage engine, and several managers that handle tasks like transactions, caching, and security.

Query Processor

The query processor is responsible for understanding and preparing a user’s query for execution. It consists of two submodules:

Query Parser: This component takes the user’s query and translates it into an abstract syntax tree (AST). The parser validates the query’s syntax and semantics, ensuring it is a valid request before proceeding.
Query Optimizer: This module uses the AST to create an efficient execution plan, which represents a series of steps organized in a directed dependency graph. It considers various factors, like data size and location, to find the most cost-effective way to execute the query.

Core Execution Components

Execution Engine: This is the component that executes the plan from the query optimizer. It interacts with the storage engine to retrieve data, performs operations like joins and filtering, and returns the final results to the user. In a distributed database, it orchestrates the execution across different nodes.
Storage Engine: This is the layer that manages the physical storage and retrieval of data. It handles tasks like data compression, file allocation, and indexing, acting as the interface between the database and the physical disk.
Buffer Manager: This component manages data buffers in memory, acting as a cache for disk data. It optimizes disk I/O operations by storing frequently accessed data in memory, which significantly speeds up data retrieval.
Cache Manager: Similar to the buffer manager, this component focuses on caching frequently accessed data in memory to reduce the need for disk reads and improve query performance.

Transaction and Data Integrity Managers

Transaction Manager: This component ensures that a series of operations are treated as a single, indivisible unit. It guarantees that a transaction is either fully completed or completely rolled back, preventing partial updates and maintaining data integrity. It works closely with the concurrency control and recovery managers.
Concurrency Control Manager: This component manages simultaneous access to the database by multiple users or transactions. It uses mechanisms like locking to prevent conflicts and ensure that transactions are executed in an isolated and consistent manner.
Recovery Manager: This component ensures durability and reliability by managing transaction logging and checkpointing. It maintains an immutable log file that records every write operation. This log allows the system to be restored to a consistent state after a crash. Before a write is confirmed to the user, it is first written to this log file on disk to prevent data loss.

Auxiliary Components

Security Manager: This component enforces data security by handling user authentication, managing permissions, and controlling access. It protects the database from unauthorized access.
Catalog: Also known as the data dictionary or metadata repository, this component stores information about the database’s structure, including details about tables, columns, indexes, and constraints. The RDBMS uses the catalog to correctly interpret and manipulate data.

Optimizing Relational Databases

Optimizing relational databases is crucial for improving query performance. Here are key methods for doing so, with a focus on indexes.

Indexes

An index is a data structure that speeds up data retrieval in a database. Instead of a slow, full table scan, an index allows the database to quickly find the data it needs. They are especially useful for columns used in SELECT, GROUP BY, ORDER BY, and JOIN operations. Indexes are typically implemented as B-trees, which are self-balancing structures that keep data sorted for efficient searches and modifications.

There are two main types of indexes:

Primary Indexes: These are created on a table’s primary key. They are essential for quickly finding a specific row because each row’s primary key is unique.
Secondary Indexes: These are created on non-primary key columns. They are used to improve queries that filter or sort data based on specific attributes, such as a customer’s phone number.

B+ trees are a type of B-tree widely used for indexing in RDBMSs due to their efficiency in handling large datasets. A B+ tree organizes data in a sorted, tree-like structure composed of internal nodes and leaf nodes.

Internal nodes contain keys that act as guides for searches.
Leaf nodes hold the actual data records or pointers to them, and they are linked together, which allows for efficient sequential or range scans.

B+ trees offer several benefits:

Efficient Searching: The balanced nature of the tree ensures that finding a specific value or a range of values is very fast, with a logarithmic time complexity.
Fast Scans: Since the leaf nodes are linked, the system can quickly scan through a range of values, which is useful for queries that involve sorting or aggregations.
Dynamic Operations: B+ trees efficiently handle insertions and updates by maintaining their balance and sorted order without significant performance degradation.

While indexes are powerful, using too many can have drawbacks. They require additional storage space and can slow down write operations because the index must also be updated every time data is modified. For bulk data loading, it can be more efficient to temporarily disable indexes and then rebuild them afterward.

Here’s a B+ tree data structure representation:

SQL Tuning Techniques

Minimize Large Write Operations: Writing, modifying, or deleting large amounts of data can significantly impact performance. These operations can cause table blocking, which prevents other transactions from accessing the data. Minimizing these heavy write tasks helps maintain performance and reduces the size of log files.
Schedule During Off-Peak Hours: Resource-intensive queries, such as those with large joins or complex subqueries, should be scheduled during off-peak hours. This is because these queries place locks on tables, limiting access for other users. Running them when traffic is low reduces server strain and ensures data is available for other users.
Streamline Joins: Queries with many tables and joins can lead to inefficient execution plans. Join elimination is a technique that simplifies a query by breaking it into smaller, more efficient queries. This reduces unnecessary joins, subqueries, and extraneous tables, thereby improving performance.

Denormalization

Denormalization is a technique that improves read performance by duplicating data across multiple tables to avoid costly joins. This is especially useful for read-heavy workloads (where reads outnumber writes by 100:1 or more).

Advantages: It eliminates the need for complex, resource-intensive joins during read operations, which can be particularly slow. This is beneficial in distributed systems where joins across different servers are highly complex.
Disadvantages: It introduces data redundancy, which can lead to data inconsistencies if not carefully managed. Maintaining consistency requires complex mechanisms like constraints and triggers (e.g., a change to a customer’s email must be automatically reflected across all tables where it’s duplicated). This can make a denormalized database perform worse than a normalized one under a heavy write load.

Query Federation

Query federation is a technique for optimizing queries that involve large amounts of data or complex joins. It works by splitting a large query into smaller ones that can be executed independently on different database servers. This requires a federated schema, where the database is partitioned by function across multiple servers. By executing these smaller queries in parallel on different servers, the overall execution time is reduced, leading to faster results.

Scaling Relational Databases

As businesses grow and data needs increase, scaling databases becomes necessary to handle the increased workload. Scaling refers to increasing the capacity of a database to accommodate more data, users, and transactions. Two major approaches to scaling relational databases are partitioning and replication.

Partitioning

Partitioning is the process of dividing a large database table into smaller, more manageable parts called partitions. Each record belongs to one and only one partition, and each partition can operate as an independent database. Queries can be directed to a single partition for specific data or distributed across multiple partitions for broader processing.

Vertical Partitioning

This involves splitting a table by its columns. For example, a Customers table could be split into one table with general customer information (like name and ID) and another table with contact details (email, phone number, etc.).

Horizontal Partitioning

This involves splitting a table by its rows. A large Customers table, for instance, could be split into smaller tables based on a range of customer records, like last name or zip code. There are two main approaches to horizontal partitioning:

Hash partitioning is a strategy that distributes data evenly among partitions by using a hash function. A hash of a record’s key is generated, and this hash value determines which partition the record will be placed in. This approach relies on a deterministic hash function, which ensures that the same key always maps to the same partition.
- How it works: A hash function (like MD5 or SHA-256) is applied to the key, and the resulting hash is used to assign the record to a specific partition bucket. A single physical host can contain multiple partitions.
- Advantage: This method effectively prevents hot spots and skewed partitions, which are common issues when data access patterns are uneven. The hashing mechanism ensures a balanced distribution of both data and workload.
- Disadvantage: It’s inefficient for range queries (e.g., finding all users with a key between A and C). To perform a range query, the database must query all partitions and then combine the results, a process known as a scatter and gather pattern.
Range partitioning divides a table into partitions based on a continuous range of key values. The data within each partition is stored in a sorted order.
- How it works: A specific range of key values is assigned to each partition. For example, a customer table could be partitioned alphabetically by last name, with all last names from A-C in one partition, D-F in another, and so on.
- Advantage: It is highly efficient for range scan queries. Since data is sorted and grouped by range, the system can quickly retrieve all records within a given range by only querying a single partition.
- Disadvantage: This method is susceptible to hot spots and skewed partitions. If a particular range of keys experiences a disproportionately high number of queries or data, that single partition will become overloaded and act as a performance bottleneck. An example is if a large portion of a company’s customers have last names starting with ‘S’, the partition holding ‘S’ data would become a hot spot.

Sharding

Sharding is the process of distributing a large database across multiple servers. Each server, or shard, contains a subset of the data, and queries are distributed across all servers. This technique is especially useful for scaling databases that have become too large to be managed on a single server. A common example is sharding a customer database as the number of customers grows. Sharding can be done using a customer’s last name, initial, or geographic location.

There are two types of sharding:

Vertical Sharding: This involves splitting a table’s columns and moving the data to separate servers.
Horizontal Sharding: This involves splitting a table’s rows and distributing them across different servers. It has various approaches, including hash-based, range-based, and round-robin sharding.

Sharding provides several advantages, similar to federation. It reduces read and write traffic on any single server, decreases replication needs, and improves cache utilization. It also helps reduce the size of indexes, leading to faster query performance. In case one shard fails, the others remain operational. Sharding also removes the need for a central master for write serialization, allowing for parallel writes and increasing overall throughput.

However, sharding also comes with its disadvantages:

Application Complexity: The application logic needs to be updated to handle shard-specific operations, which can result in more complex SQL queries.
Unbalanced Data Distribution: Data can become unbalanced, especially if a shard has a subset of “power customers,” leading to an increased load on that particular shard.
Rebalancing Complexity: Rebalancing shards can be challenging, though using a consistent hash-based sharding function can minimize the data that needs to be moved.
Cross-Shard Joins: Joining data from multiple shards becomes more difficult.
Increased System Overhead: Overall, sharding introduces additional hardware requirements and increases the complexity of the system.

Replication

Replication is the process of copying data from one database server to another, with each copy known as a replica. This is a key feature of distributed databases and offers several advantages.

Benefits of Replication:

High Availability: Replication ensures a system can stay online even if a host machine fails. Since data is copied to multiple replicas, read and write operations can be redirected to other live machines, preventing service disruptions and providing fault tolerance.
Load Distribution: By having multiple replicas, a database can distribute read and write queries across different hosts. This prevents any single machine from being overloaded, improving overall system performance and scalability by efficiently using computing resources.
Reduced Latency: Replicas can be placed in different geographical locations, closer to end users. This proximity reduces the network latency users experience, leading to faster response times, which is crucial for real-time applications.
Disaster Recovery: Replication is a foundation for disaster recovery and data resilience. Multiple copies of data can withstand catastrophic events, hardware failures, or natural disasters. If data is lost on one machine, it can be recovered from a replicated copy.
Scalability: Replication facilitates horizontal scaling. As data or user load increases, additional host machines can be added, each hosting a replica. This allows the system to handle larger workloads and accommodate growing user demands.

There are two main types of replication for relational databases:

Single-Leader Replication. In this model, one database server acts as the leader and handles all write operations. Other servers, called followers, handle read operations. Changes from the leader are replicated to the followers either synchronously or asynchronously, ensuring all servers have the same data. This approach is effective for scaling read-heavy databases.
Multileader Replication. In this model, every database server can both read and write data. Changes made to one server are replicated to all others. While this provides a highly available setup by avoiding the lag of leader election after a failure, it comes with a significant trade-off: conflict resolution. Since writes can occur on multiple leaders simultaneously, conflicting updates to the same data may arise, which must be managed.

Replication can be implemented in various ways on follower nodes, including full replication, snapshot-based replication, transactional replication, or key-based incremental replication. It can also be done either synchronously or asynchronously, each with its own advantages and disadvantages

Here’s an illustration of the two types of replication for RDBMS:

Synchronous Replication

Synchronous replication is a mechanism for copying data from a primary (leader) database replica to one or more secondary (follower) replicas in real time. A write operation is considered complete only after both the leader and the synchronous follower confirm the write. This strict synchronization ensures that data is consistently up-to-date across all replicas.

Consistent Writes: A client’s write is only acknowledged as successful after it has been written and confirmed by both the leader and at least one follower. This guarantees that all replicas are in sync and have the most recent data.
Immediate Failover: Since all synchronous replicas are always up-to-date, if the leader fails, one can be immediately promoted to the new leader without any data loss. This ensures continuous availability and high resilience.
Data Durability: Data is durably stored across multiple machines. If one replica fails, the data is still safely persisted on the others.
Read Consistency: You can perform read operations on any replica with the assurance that you’re getting the most recent, consistent data. This is great for load balancing read-heavy applications.

Synchronous replication is crucial for applications that require strict data integrity and high availability, such as financial systems.

Asynchronous Replication

Asynchronous replication copies data from a leader to followers with a potential delay. The leader acknowledges a write operation as complete before the data has been confirmed by the followers. This approach prioritizes performance and scalability over immediate consistency.

Near Real-Time Updates: Changes from the leader are replicated to followers in near real time, but there is a lag. The followers may not reflect the latest data immediately. This results in temporary data inconsistency, though the replicas will eventually converge.
Data Lag and Staleness: The delay in replication can cause followers to have outdated, or stale, data. This is a trade-off that’s a key consideration for applications that need the most current information.
Risk of Data Loss: If a leader fails, an asynchronous follower might not have received or applied all the latest changes. Promoting this follower to a new leader could result in data loss. To mitigate this, the replication lag must be carefully monitored.
Scalability and Performance: Asynchronous replication is preferred in systems that prioritize high write throughput and scalability. It allows the system to handle a large volume of writes without waiting for follower acknowledgment, enabling horizontal scaling and better response times.

In essence, asynchronous replication provides flexibility and performance but at the cost of potential data staleness and a risk of data loss during failover.

Open Source Relational Database Systems

MySQL and PostgreSQL are two popular open-source relational database management systems (RDBMSs). Both are powerful, feature-rich, and support modern database requirements, but they have key differences that make them suitable for different use cases.

MySQL

MySQL is known for its speed, scalability, and ease of use. It’s a great choice for applications that handle a high volume of data and traffic, particularly those that are read-heavy

Architecture: It uses a thread-per-connection model, which has low overhead and handles concurrent connections efficiently.
Performance: Prioritizes speed by not including some complex SQL features. It excels at highly concurrent, read-only operations.
Replication: Primarily uses one-way, asynchronous replication, where one primary server sends changes to multiple secondary servers.
JSON Support: Stores JSON data in a converted internal format for fast read access but has limited indexing capabilities on JSON columns.

PostgreSQL

PostgreSQL is known for its robustness, reliability, and advanced features. It is often referred to as an object-relational database management system (ORDBMS) because it supports object-oriented concepts like inheritance. It’s well-suited for applications that require advanced data management and handle write-heavy workloads.

Architecture: Uses a process-per-connection model, which can be more resource-intensive and struggles with a large number of connections.
Performance: Supports complex query plans that can leverage multiple CPUs for faster processing. It is ideal for data warehousing and online transaction processing (OLTP), especially with concurrent writes.
Replication: Supports synchronous replication (called 2-safe replication), where a primary database is synchronized with a secondary one, minimizing data loss risk.
JSON Support: Offers two JSON data types: JSON (exact copy of text) and jsonb (a decomposed binary format that is faster to process and supports indexing).
Advanced Features: Supports advanced data types (e.g., arrays, JSONB), expression indexes (indexing the result of a function), and partial indexes (indexing only a part of a table).

Amazon RDS offers multiple options by providing various database engine versions (commonly referred to as “flavors”), instance classes, and storage types. These flavors are essentially different configurations of managed database engines, each optimized for specific use cases and workloads. RDS supports several popular relational database engines, including MySQL, PostgreSQL, Oracle, SQL Server, and MariaDB.

Quartz 4

Explorer

02. Storage Types and Relational Stores