Key Takeaways
- CAP theorem: during a network partition, you choose between consistency and availability
- Eventual consistency is appropriate for many use cases but not for financial or inventory systems
- Raft is the consensus algorithm most modern systems use — it powers etcd, CockroachDB, and TiKV
- Leader-follower replication is simple but has a single write bottleneck; multi-leader and leaderless trade complexity for write scale
- Hash-based and range-based partitioning distribute data across nodes — picking the wrong key causes hot spots
Single Machines Hit Physical Limits — Distribution Is the Only Answer
A single machine can only scale so far: add more RAM, faster CPUs, bigger disks. Beyond a certain point, a distributed system is the only solution. But distribution introduces new problems that don't exist on a single machine: partial failures, network delays, inconsistent clocks, and the fundamental challenge of making multiple nodes agree on the state of the world.
Distributed systems power everything at scale: Google Spanner, Amazon DynamoDB, Meta's Cassandra clusters, financial trading systems, and every major cloud platform. Understanding the underlying principles — not just the tools — is what separates a senior engineer from everyone else.
CAP Theorem: The Fundamental Tradeoff
The CAP theorem states that any distributed data store can provide at most two of three guarantees:
- Consistency (C) — Every read returns the most recent write or an error. All nodes see the same data at the same time.
- Availability (A) — Every request receives a response (not necessarily the most recent data).
- Partition Tolerance (P) — The system continues operating when network partitions (communication failures between nodes) occur.
In practice, P is not optional — network partitions happen. The real choice is: when a partition occurs, do you sacrifice consistency (allow stale reads) or availability (refuse to respond)? Different systems make different choices:
| System | CAP Choice | Trade |
|---|---|---|
| PostgreSQL (single node) | CA | No partition tolerance — not distributed |
| Cassandra, DynamoDB | AP | Available during partitions, may return stale data |
| HBase, Zookeeper | CP | Consistent but unavailable during partitions |
| CockroachDB, Spanner | CP (near) | Strong consistency with global distribution, higher latency |
Consistency Models: A Spectrum
Consistency is not binary. There are multiple levels, from strongest to weakest:
- Linearizability (strong consistency) — Operations appear to happen instantaneously at a single point in time. The strongest guarantee. Expensive — requires coordination across nodes.
- Sequential consistency — All operations appear in the same order to all nodes, but not necessarily in real-time order.
- Causal consistency — Causally related operations are seen in order by all nodes. Unrelated operations may be seen in different orders. Good balance of correctness and performance.
- Eventual consistency — Given no new updates, all nodes eventually converge. Reads may be stale. Used by DNS, shopping carts, social media counters.
- Read-your-writes — After you write a value, your subsequent reads will see that write (or a newer one). Not guaranteed for other users. Common in social media.
Replication Strategies
Replication copies data across multiple nodes for fault tolerance and read scaling.
Single-Leader (Primary-Replica): One leader accepts all writes. Followers replicate and serve reads. Simple, strong consistency possible, but write bottleneck at leader. Used by PostgreSQL streaming replication, MySQL replication.
Multi-Leader: Multiple nodes accept writes. Changes are propagated to all leaders. Enables geographic distribution of writes. But: write conflicts occur when the same data is modified on two leaders simultaneously. Conflict resolution is hard. Used by CouchDB, some Cassandra configs.
Leaderless (Dynamo-style): Any node accepts reads and writes. Client or coordinator writes to W nodes and reads from R nodes. If W + R > N (total nodes), reads overlap writes and you get the latest value. Used by Cassandra, DynamoDB, Riak.
Consensus Algorithms: Getting Nodes to Agree
Consensus is the problem of getting a group of nodes to agree on a value even when some nodes fail or messages are delayed. This is the hard core of distributed systems.
Raft is the dominant consensus algorithm in modern systems. It works in phases:
- Leader Election — Nodes start as followers. If no heartbeat from leader arrives within a random timeout, a node becomes a candidate and requests votes. If it gets a majority, it becomes leader.
- Log Replication — All writes go to the leader. Leader appends entry to its log and sends AppendEntries RPCs to followers. When a majority acknowledges, the entry is committed.
- Safety — Raft guarantees only one leader per term, and the leader always has all committed log entries.
Raft is used in: etcd (Kubernetes distributed state), CockroachDB, TiKV (TiDB), Consul, and many other systems.
Data Partitioning: Spreading Data Across Nodes
When data is too large for one node, partition it (also called sharding):
Range partitioning — Divide data by key ranges. Good for range queries. Risk: if data isn't evenly distributed (e.g., names starting with A-C are more common), some nodes become hot spots.
Hash partitioning — Hash the key to determine the node. Even distribution by default. Bad for range queries — consecutive keys go to different nodes. Used by Cassandra, DynamoDB.
Consistent hashing — Hash ring where nodes and keys are placed on the ring. Adding/removing nodes only affects adjacent keys — minimizes data movement. Used by Cassandra, Akamai CDN, Amazon Dynamo.
Modern Distributed Databases Worth Knowing
| Database | Model | CAP | Best For |
|---|---|---|---|
| CockroachDB | SQL (distributed) | CP | Global SQL with strong consistency, OLTP |
| Google Spanner | SQL (distributed) | CP | Global transactions, TrueTime GPS clocks |
| Cassandra | Wide-column | AP | High write throughput, time-series, IoT |
| DynamoDB | Key-value/document | AP | AWS-native serverless, predictable latency |
| TiDB | SQL (distributed) | CP | MySQL-compatible scale-out HTAP |
Learn Systems Design at Precision AI Academy
Our bootcamp covers database design, distributed systems fundamentals, and cloud architecture — the skills that define senior engineers. Five cities, October 2026.
Frequently Asked Questions
What is the CAP theorem and does it still matter?
CAP says distributed systems can guarantee at most two of: consistency, availability, partition tolerance. Since partitions are inevitable, the real choice is between consistency and availability during a partition. The PACELC model extends CAP by also considering latency vs. consistency during normal operation.
How does the Raft consensus algorithm work?
Raft elects a single leader via randomized timeouts. The leader accepts all writes, replicates them to followers, and commits once a quorum acknowledges. If the leader fails, a new election occurs. Used by etcd, CockroachDB, and many other production systems.
What is eventual consistency and when should you use it?
Eventual consistency allows stale reads temporarily, with all nodes converging over time. Good for social media counters, shopping carts, DNS. Not appropriate for financial account balances, inventory systems, or anywhere stale data causes real-world harm.