This chapter introduces the fundamental concepts of system design, which is essential for building large-scale software systems. It emphasizes that a good design is crucial for ensuring a system works correctly, scales with more users, and remains maintainable. The process involves understanding business requirements, evaluating trade-offs, handling errors, and planning for future changes.

System Design Concepts

To understand how large-scale systems are built, it’s important to grasp core concepts. System design relies on abstraction, which simplifies complex details to help us focus on the bigger picture. Key concepts include communication, consistency, availability, reliability, scalability, system maintainability, and fault tolerance.

Communication

Large systems are made of smaller parts called servers that communicate with each other over a network. This communication can be synchronous or asynchronous.

Synchronous Communication: In this model, the sender waits for a response from the receiver before continuing. It’s like a phone call where both parties speak and respond in real-time. In system design, this means the application blocks until it gets a response. While this allows for quick issue resolution, it can cause latency or performance lag for the user if the response is slow. This type of communication is often used when a real-time response is necessary, such as between a user interface (frontend) and the server (backend).
Asynchronous Communication: Here, the sender sends a message and continues its work without waiting for an immediate response. It’s like sending an email where you wait for a reply later. In system design, the sender does not block. When the receiver’s response arrives, a callback function is executed. This approach is more flexible and can tolerate delays or failures. It’s often used for tasks that don’t require an immediate response, like checking the status of a long-running job.

Consistency

Consistency is a crucial requirement in software systems, ensuring data is accurate and follows a set of rules. Consistency can refer to a variety of concepts and contexts in system design.

Consistency in Distributed Systems

Distributed systems are software systems that are physically separated but connected over a network to achieve common goals by sharing computing resources.

In distributed systems, which are physically separated but networked, consistency means all data copies (called replica nodes) have the same view of the data at the same time.

Ensuring consistency in distributed systems is challenging because of network delays and potential failures. Several techniques are used to address this:

Data Replication: Multiple copies of data are kept on different replica nodes. Updates are made on all nodes at the same time using synchronous communication to ensure all nodes have the same data view.
Consensus Protocols: These protocols, such as voting or leader election, make sure all replica nodes agree on data updates before they are applied.
Conflict Resolution: When multiple nodes try to update the same data, conflict resolution algorithms (e.g., “last writer wins”) determine which update is applied.

Consistency in Data Storage and Retrieval

Consistency in data storage ensures that a read request always returns the most recent value that was written. This is vital for data integrity. For example, in a bank account, a withdrawal must be immediately reflected in the balance to prevent errors. Techniques to ensure this include:

Write-Ahead Logging: Writes are first recorded in a log. If the system fails, the log can be replayed to restore the data to a consistent state.
Locking: This mechanism ensures only one write operation can happen at a time, preventing conflicts and guaranteeing reads always get the latest value.
Data Versioning: Each write gets a version number. Reads always return the data with the highest version number, ensuring they get the most recent data even with concurrent writes.

Consistency Spectrum Model

The consistency spectrum model helps us understand the different levels of consistency a system can offer. It ranges from strong consistency to eventual consistency. The choice depends on the specific needs of the system, balancing data accuracy with system performance and complexity.

Strong Consistency: This is at one end of the spectrum. It guarantees that all replica nodes have the same data at all times, and updates are immediately reflected everywhere. While this ensures data is always accurate and up-to-date, it is difficult to achieve in practice and can increase latency due to the need for constant communication and synchronization.
Monotonic Read Consistency: A client will not see “stale” data. Once a client reads a value, all future reads by that same client will return the same value or a more recent one.
Monotonic Write Consistency: After a replica node acknowledges a write, all future reads from that node will return the updated value.
Causal Consistency: Preserves the order of causally related operations. If operation A must happen before B, all systems will see A before B. This is stronger than eventual consistency because it maintains the order of related events.
Eventual Consistency: At the other end of the spectrum, this model guarantees that all replica nodes will eventually have the same data, given enough time. This allows for more flexibility and tolerance for delays but can lead to temporary data inconsistencies.

The following image shows the difference in the results of performing a sequence of actions under either strong consistency or eventual consistency. As you can see on the left, showing strong consistency, when x is read from a replica node after updating it from 0 to 2, it will block the request until replication happens and then return 2 as the result. On the right, illustrating eventual consistency, you can see that when the replica node is queried, it will give a stale result of x as 0 before replication completes.

There is a trade-off between consistency and system complexity. Stricter models like strong consistency require more complex coordination, increasing latency and operational costs. Relaxing consistency to an eventual model simplifies the system, allowing for higher scalability and availability, but at the cost of temporary data inconsistencies.

Availability

Availability refers to a system’s ability to process requests and respond promptly, even when some parts fail or when the system is under heavy load.

It’s measured as a percentage of uptime. Higher availability is often represented using “nines” (e.g., “five nines” for 99.999% availability).

Measuring Availability

Availability is calculated as the percentage of total time the system is up divided by the total time it should have been running:

Availability % = \frac{(Total Time - Sum total of time system was down)}{Total Time} \times 100

Availability percentages are represented in nines based on this formula over a period of time. You can see a breakdown of what these numbers really work out to in the following table:

Achieving higher availability percentages becomes much more difficult and costly. Each additional “nine” requires more redundancy, better fault tolerance, and constant monitoring. For example, 99.9% (3 nines) allows for about 8.76 hours of downtime per year, while 99.999% (5 nines) allows for only 5.26 minutes.

Availability in Sequential vs. Parallel Systems

The overall availability of a system depends on how its components are arranged.

Sequential Systems: The total availability is the product of the availability of each component. If one component fails, the entire system fails. For example, two components with 99.9% availability in sequence result in a total availability of 99.8%.
Parallel Systems: The total availability is calculated differently, resulting in a much higher number. If one component fails, another can still serve the request. Two components with 99.9% availability in parallel result in a total availability of 99.9999%.

This shows that arranging components in parallel can significantly increase a system’s availability.

Ensuring Availability

Several techniques help increase system availability:

Redundancy: Having multiple copies of critical components ensures the system can continue to work even if one component fails. Possible solutions: redundant load balancers, failover systems, or replicated data stores.
Fault Tolerance: Designing a system to handle failures gracefully, able to continue to function when there are unexpected events, often with error-handling mechanisms and self-healing systems.
Load Balancing: Distributing incoming requests across multiple servers to prevent any single server from becoming overwhelmed and to handle heavy loads effectively. Possible solutions: multiple load balancers or distributed systems.

Availability Patterns

Two main patterns support high availability: failover and replication.

Failover Patterns

Failover is the process of switching to a backup system when the primary one fails.

Active-Active Failover: Multiple systems are all actively processing requests in parallel. If one fails, the others continue working. This setup is more complex but offers better resource utilization.
Active-Passive Failover: One system is active, and one or more are passive backups. If the active system fails, a passive one takes over. This is simpler to implement but can cause a delay during the switch, potentially reducing availability.

Failover patterns can require additional hardware and can add complexity to the system. There is also the potential for data loss if the active system fails before newly written data can be replicated to the passive system.

These failover patterns are employed in relational data stores, nonrelational data stores, caches, and load balancers.

Replication Patterns

Replication is maintaining multiple copies of data to improve availability and fault tolerance.

Multileader Replication: All systems can read and write data. This offers flexibility and better resource use but is complex, as it requires managing conflicts and ensuring consistency.
Single-Leader Replication: One system (the leader) handles all data writes and reads, while other systems (followers) only handle reads. The leader replicates the data to the followers. This is simpler but can lead to reduced availability if the leader fails, as writes cannot happen until a new leader is chosen. We will cover how relational and nonrelational data stores ensure availability using single-leader and multileader replication in Chapters 2 and 3.

Reliability

Reliability in system design is the ability of a system to perform its function consistently without failing over a period of time. It measures the system’s trustworthiness. Reliability is often expressed as a probability or a percentage. For example, a system with 99% reliability will only fail 1% of the time.

Measuring Reliability

Reliability is measured using two key metrics: Mean Time Between Failures (MTBF) and Mean Time to Repair (MTTR).

Mean Time Between Failures (MTBF): This is the average time a system operates before a failure occurs. A higher MTBF means the system is more reliable. For example, high-reliability servers might aim for an MTBF of over 50,000 hours, while consumer laptops might have an MTBF of 5,000-10,000 hours. A high MTBF allows for better planning of maintenance and can indicate design improvements. $MTBF = \frac{(Total Elapsed Time - Sum total of times system was down)}{Total number of failures}$ MTBF helps estimate the frequency of failures, allowing for better planning and budgeting of maintenance activities. If MTBF is low for a critical component, redundancy can be implemented to minimize downtime. In high-availability setups, multiple servers or components with lower MTBFs may be clustered to achieve overall reliability. As MTBF increases with design changes, you can quantitatively assess reliability improvements, ultimately leading to fewer failures and reduced operational costs.
Mean Time to Repair (MTTR): This is the average time it takes to fix a system after a failure. A lower MTTR means the system can be restored to service more quickly. $MTTR = \frac{(Total Maintenance Time)}{Total number of repairs}$

Together, MTBF and MTTR provide a complete picture of a system’s reliability. A system is considered more reliable if it has a high MTBF (it fails less often) and a low MTTR (it’s fixed quickly when it does fail).

Reliability and Availability

While often confused, reliability and availability are distinct but related concepts. A system can be:

Reliable but not available: It might not fail often, but it might not be accessible when needed.
Available but not reliable: It’s usually accessible, but it fails frequently.

To achieve both high reliability and high availability, systems must be designed with redundant components and robust failover mechanisms. Regular maintenance and testing are also important to ensure the system performs at its best.

Scalability

Scalability is a system’s ability to handle increasing workload by adding more resources. This workload can be either more user requests or more data storage. There are two main ways to scale a system: vertical scaling and horizontal scaling.

Vertical Scaling

Vertical scaling means increasing the capacity of a single server. This is done by adding more resources like CPU, RAM, or storage to the existing machine. This approach is good for predictable traffic because it makes the most of the existing server. However, there is a limit to how much a single server can be upgraded, and the cost of powerful, high-end hardware is usually very high.

Horizontal Scaling

Horizontal scaling involves adding more servers to a system to handle the increased load. This method is effective for unpredictable traffic, as you can easily add more servers as demand grows. This is generally more cost-effective than vertical scaling because you can use less expensive, commodity hardware. The main challenge with horizontal scaling is the complexity of managing and coordinating many servers to act as a single, unified system.

Initially, a system can start with vertical scaling. When it reaches the limits of a single server, it can then transition to horizontal scaling to handle further growth.

Maintainability

Maintainability is a system’s ability to be easily modified, adapted, or extended to meet changing needs. A maintainable system is flexible and easy to work with, which ensures smooth operations. It involves three key aspects: operability, lucidity, and modifiability.

Aspects of Maintainability

Operability: This is about a system’s ability to run smoothly under normal conditions and recover quickly from failures. Good operability reduces the time and effort needed to keep the system running, which in turn contributes to its overall stability, reliability, and availability.
Lucidity: A system is lucid if it is simple and clear to understand. This makes it easier for a team to add new features, fix bugs, and collaborate effectively. A lucid design also simplifies debugging and reduces the risk of new errors being introduced during changes.
Modifiability: This requires a system to be built in a modular way, meaning its parts can be changed or extended easily without affecting other parts. Modifiability is crucial for a system to evolve with new business needs and technology. A system that is not easily modifiable can become outdated and difficult to improve.

By focusing on these three aspects, organizations can reduce costs, improve productivity, and increase the long-term value of their software.

Fault Tolerance

Fault tolerance is a system’s ability to recover from hardware or software failures and continue to function. This means avoiding a single point of failure by having a plan to redirect requests to working parts of the system. Data safety is a critical part of fault tolerance, and it can be ensured through replication and checkpointing.

Replication

Replication is a fault tolerance mechanism where both the service and the data are copied across multiple servers. If a server or a data store fails, a functioning replica takes over, ensuring the system remains available.

Checkpointing

Checkpointing ensures data is safely stored and backed up even after its initial processing. It works by periodically saving the system’s state. If a failure occurs, the system can be restored to its most recent saved state, preventing data loss. Checkpointing can be done in two ways:

Synchronous Checkpointing: The system pauses all data-changing requests and only allows read requests while the checkpointing process is completed. This guarantees a consistent data state across all servers.
Asynchronous Checkpointing: The system continues to serve all requests (including data mutations) while the checkpointing happens in the background. This can lead to temporary data inconsistencies across servers.

Key Metrics for Checkpointing

Checkpointing’s effectiveness is measured using two key metrics:

Recovery Point Objective (RPO): This defines the maximum acceptable amount of data loss. It’s the time between the last successful checkpoint and the time of failure. A smaller RPO (more frequent checkpoints) means less data is lost, but it increases the system’s workload.
Recovery Time Objective (RTO): This defines the maximum acceptable downtime after a failure. Checkpointing helps reduce RTO because the system can be restored from the last checkpoint instead of starting from the beginning. A fast RTO is crucial for large-scale systems.

Fallacies of Distributed Computing

The Fallacies of Distributed Computing are a set of false assumptions that developers often make when designing and building distributed systems. These assumptions, if not addressed, can lead to serious system failures and performance problems.

The Eight Fallacies

The network is reliable: Networks are inherently unreliable and can fail due to various factors, like hardware issues or power outages. Systems must be designed with network fault tolerance to handle these failures.
Latency is zero: Data transfer is limited by the speed of light. There will always be some delay. Systems should be designed to minimize this latency by, for example, placing servers in data centers closer to clients.
Bandwidth is infinite: The amount of data that can flow through a network at one time is limited. High data volumes can cause congestion and delays. To avoid this, use lightweight data formats to reduce network traffic.
The network is secure: Networks are vulnerable to various security threats like malware and unencrypted communication. Systems must be built with a security-first mindset, including defense testing and threat modeling.
Topology doesn’t change: The arrangement of nodes in a distributed system is dynamic. Nodes can be added or fail. Systems should be built to be unaware of the underlying topology and be able to adapt to its changes.
There is one administrator: In large systems, there are multiple teams and administrators. The system should be built in a decoupled way, making it easier for different teams to manage and troubleshoot their specific parts.
Transport cost is zero: There are significant costs associated with network hardware, software, and the people to manage it. These transport costs must be accounted for in the budget.
The network is homogeneous: A network is made of diverse devices with different protocols and configurations. Systems must focus on interoperability, meaning they can communicate and work together despite these differences.

Neglecting these fallacies can lead to serious issues, including performance bottlenecks, data inconsistencies, and security vulnerabilities. It is crucial to acknowledge and plan for them during the system design process.

System Design Trade-offs

System design involves balancing various factors to meet user needs, including cost, scalability, reliability, and maintainability. Making the right trade-offs is crucial. For example, a highly reliable system may require more expensive components, but these components may also be more robust and scalable.

Time Versus Space

The time-space trade-off is a fundamental concept in computer science. It means that to improve an algorithm’s speed (reduce time), you might need to use more memory or storage (increase space). A common example is using pre-calculated values stored in a lookup table instead of re-calculating them every time. This saves time but uses more memory.

Latency Versus Throughput

Another trade-off involves two key performance metrics:

Latency
Throughput

Latency, processing time, and response time

Response time is the total time from sending a request to receiving a response, including:

Latency: The time a request waits to be handled.
Processing time: the time taken by the system to process the request once it is picked up.

Response Time = Latency + Processing time

Throughput and bandwidth

Throughput and bandwidth are metrics of network data capacity and are used to account for network scalability and load. Bandwidth refers to the maximum amount of data that could theoretically travel from one point in the network to another in a given time. Throughput refers to the actual amount of data transmitted and processed throughout the network. Thus, bandwidth describes the theoretical limit, and throughput provides the empirical metric. The throughput is always lower than the bandwidth unless the network is operating at its maximum efficiency.

The relationship between latency and throughput

Since latency measures how long the packets take to reach the destination in a network, while throughput measures how many packets are processed within a specified period of time, they have an inverse relationship. High latency causes data packets to pile up in the network, which slows down processing and ultimately reduces the system’s throughput. To measure latency under heavy load, system designers use percentiles (such as p50, p90, and p99). For example, the p90 latency indicates the slowest response time for the fastest 90% of requests. In other words, 90% of requests have responses that are equal to or faster than the p90 latency value. This is a better metric than using an average, which can be easily skewed by a few extremely slow requests (outliers).

Because of this inverse relationship, as you increase the system’s load to achieve higher throughput, the latency will also increase. Therefore, systems should be designed to achieve maximum throughput while staying within an acceptable latency range.

Performance Versus Scalability

Performance is about how fast a system responds to a single request. A system has a performance problem if it is slow for one user (e.g., p50 latency = 100 ms). Scalability, on the other hand, is the system’s ability to handle a growing workload and respond to increased demand. A system has a scalability problem if it is fast for a small number of users but becomes slow when the number of users increases significantly (e.g., fast for 100 requests but slow for 100,000 requests). A truly scalable system improves performance in proportion to the resources added.

Consistency Versus Availability

In a distributed system, where network failures are a real possibility (i.e., packets get dropped or delayed due to the fallacies of distributed computing leading to partitions), there’s a fundamental trade-off between strong consistency and high availability.

Strong Consistency: Guarantees that every read request returns the most recent write. All users see the same, up-to-date data at the same time.
High Availability: Ensures that the system always responds to a request, even if some parts of the system have failed.

The CAP Theorem

The CAP theorem (also known as Brewer’s theorem) states that it’s impossible for a distributed system to simultaneously guarantee all three of the following properties:

Consistency: All nodes have the same data.
Availability: Every request gets a response.
Partition Tolerance: The system continues to operate despite network failures (partitions) that prevent nodes from communicating.

Since networks are unreliable, partition tolerance is a necessity in any distributed system. Therefore, the practical implication of the CAP theorem is that, in the presence of a network partition, a system must choose between consistency and availability.

A common misunderstanding is that you must abandon one of the three properties all the time. In reality, this trade-off only applies during a network partition. When the network is healthy, a different trade-off comes into play, which is explained by the PACELC theorem.

The PACELC Theorem

The PACELC theorem provides a more detailed framework than the CAP theorem. It says that:

P (if there’s a Partition), you must choose between Availability and Consistency.
E (if there’s an Else, i.e., no partition), you must choose between Latency and Consistency.

This second part (ELC) highlights a key trade-off that is always present in distributed systems, even during normal operation. To achieve strong consistency when replicating data across multiple nodes, the system must use synchronous communication. This means a node must wait for confirmation from all other replica nodes before completing a write, which adds high latency. On the other hand, a system that prioritizes low latency might use asynchronous replication, which leads to eventual consistency, where a read might temporarily return stale data.

System Design Guidelines

1. Guideline of Isolation: Build It Modularly

The core idea is to break a complex system down into smaller, independent components or modules. These modules can work together to form a larger system but can also function and be managed on their own. This modular approach provides several benefits:

Maintainability: Modules can be updated or replaced individually without affecting the rest of the system.
Reusability: Modules can be used in different projects, reducing the need to build new components from scratch.
Scalability: Modules can be added, removed, or scaled independently as needed to handle growth.
Reliability: Since modules can be tested on their own, it reduces the risk of system-wide failures.

This guideline helps manage complexity and improve the overall quality of a large-scale system.

2. Guideline of Simplicity: Keep It Simple, Silly (KISS)

This guideline emphasizes that a system design should be as simple as possible, avoiding unnecessary complexity and over-engineering. To follow this principle, you should:

Identify core requirements: Focus on the essential features that the system absolutely needs.
Minimize components: Use only the necessary number of components, ensuring each has a clear purpose.
Avoid over-engineering: Do not add extra features or complexity that aren’t required.
Make it easy to use: The system should be intuitive for both users and developers.
Test and refine: Continuously test the system and simplify it if necessary.

Building a simple system makes it more efficient, easier to maintain, and less prone to failure.

3. Guideline of Performance: Metrics Don’t Lie

This guideline highlights the importance of measuring system performance using metrics and observability. You can’t just guess how a system is performing; you have to measure it.

Metrics are quantitative measures (e.g., resource utilization, response times, error rates) that provide a way to track a system’s performance and identify trends.
Observability is the ability to understand the internal state of a system from its external outputs. This helps in monitoring system health and diagnosing problems in real-time.

Together, metrics and observability give you the information needed to make informed decisions, detect performance bottlenecks, and resolve issues before they become serious problems.

4. Guideline of Trade-offs: There Is No Such Thing as a Free Lunch

This guideline, often abbreviated as TINSTAAFL, means that every design decision comes with a trade-off. Optimizing for one aspect, such as performance, often comes at the cost of another, like maintainability or cost. A system designer must carefully weigh these competing factors and make informed choices. There is no single perfect solution for every situation; a design must be tailored to the specific requirements and constraints of the project.

5. Guideline of Use Cases: It Always Depends

This final guideline emphasizes that there is no single “best” way to design a system. The right design always depends on a variety of factors, including user needs, technical constraints, budget, scalability, and regulations. Because of these complex and competing factors, a system designer’s goal is not to find a perfect solution, but rather a reasonable and effective one that meets the project’s specific needs.

Quartz 4

Explorer

01. System Design Trade-Offs And Guidelines