07. Replication Strategies — Spreading Data Across Machines¶

~13 min read. One database server dies at 2 AM. Replication decides whether users notice.

Built on the ELI5 in 00-eli5.md. The branch library — multiple physical locations holding copies of the same catalogue — maps directly onto every replication topology here.

1. Why Replicate at All¶

Three reasons drive replication decisions in every system design interview.

┌──────────────────────────────────────────────────────────┐
│  Availability   One node fails; others serve reads/writes│
│  Read scale     Route read queries to replica nodes      │
│  Geo latency    Serve users from the nearest replica     │
└──────────────────────────────────────────────────────────┘

A single branch library location burns down. Every book is lost unless copies exist elsewhere. Same logic applies to database replicas. Durability and availability are separate problems but replication solves both.

2. Leader-Follower (Primary-Replica)¶

The most common topology. One node accepts all writes. One or more follower nodes receive changes and apply them.

         Writes
           │
           ▼
    ┌─────────────┐
    │   Leader    │──── replication stream ────┬─────────────────┐
    └─────────────┘                            ▼                 ▼
                                       ┌─────────────┐  ┌─────────────┐
                                       │  Follower 1 │  │  Follower 2 │
                                       └─────────────┘  └─────────────┘
                                           Reads             Reads

Synchronous vs Asynchronous Replication¶

Synchronous: Leader waits for at least one follower to confirm the write before acknowledging the client. Zero data loss on leader failure. Higher write latency.

Asynchronous: Leader acknowledges as soon as it writes locally. Followers catch up eventually. Lower latency. Risk: if the leader fails before followers replicate, you lose committed data.

Most systems use semi-synchronous: one follower is sync, the rest are async. PostgreSQL streaming replication with synchronous_standby_names does exactly this.

Replication Lag¶

Followers are always slightly behind the leader. This is replication lag. Reading your own write from a follower immediately after writing to the leader can return stale data.

Client writes to leader at T=0.
Client reads from follower at T=1ms.
Follower is 50ms behind. Client sees old value.

Mitigation strategies: - Read-after-write consistency: route a user's own reads to the leader for a short window after they write. - Monotonic reads: always route a given session to the same replica so time never goes backwards. - Bounded staleness: reject reads from replicas lagging more than N seconds.

3. Multi-Leader Replication¶

Multiple nodes accept writes. Each leader replicates to all others.

    ┌─────────────┐        ┌─────────────┐
    │  Leader A   │◄──────►│  Leader B   │
    └─────────────┘        └─────────────┘
           │                      │
     Datacenter 1           Datacenter 2

Use case: multi-datacenter deployments where cross-datacenter write latency is unacceptable.

The problem is conflict resolution. Two leaders accept conflicting writes to the same row at the same time.

Leader A: UPDATE price SET value=10 WHERE id=1;
Leader B: UPDATE price SET value=20 WHERE id=1;
-- Both commit locally. Who wins?

Conflict Resolution Strategies¶

Strategy	Description	Risk
Last-write-wins (LWW)	Higher timestamp wins	Clock skew causes data loss
On-write hooks	Application merges at replication time	Complex logic
On-read merge	Return both values; app resolves	UX complexity
CRDTs	Data structures that merge automatically	Limited data types

The branch library scenario: two branch catalogues both mark a book as available simultaneously. LWW picks one; CRDT could merge counts instead.

4. Leaderless Replication¶

No designated leader. Client writes to multiple nodes directly. Popularised by Amazon Dynamo, used in Cassandra and Riak.

         Write(val=X)
              │
    ┌─────────┼─────────┐
    ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Node1 │ │ Node2 │ │ Node3 │   W=2 quorum → success
└───────┘ └───────┘ └───────┘
    ✓         ✓        ✗

Quorum reads and writes: with N replicas, write to W nodes, read from R nodes. If W + R > N, at least one node in every read set overlaps a write set.

Common configuration: N=3, W=2, R=2. Tolerates one node failure.

Read repair: when a read returns stale data from some nodes, the coordinator writes the fresh version back to those nodes in the background.

Anti-entropy: background process that compares nodes using Merkle trees and syncs differences. Does not guarantee ordering.

5. Failover — What Happens When the Leader Dies¶

Automatic failover steps in a typical leader-follower setup:

1. Health check detects leader is unresponsive (timeout ~30s)
2. Remaining nodes hold an election (Raft or Paxos)
3. Most up-to-date follower becomes new leader
4. Other followers re-point replication stream to new leader
5. Application reconnects (DNS TTL or proxy flip)

Split-brain risk: old leader recovers while a new leader is already serving writes. Two nodes think they are leader. Both accept writes. Data diverges.

Prevention: - STONITH (shoot the other node in the head): forcibly fence the old leader before promoting a new one. - Leader lease expiry: old leader stops accepting writes when its lease expires.

In practice, teams pair failover with fencing and orchestration tools so the old leader cannot keep writing after promotion.

Where this lives in the wild¶

MySQL at GitHub (Storage Engineering) — Uses leader-follower with semi-sync replication. Orchestrator manages automatic failover. Read replicas serve git web traffic. Replication lag is monitored via Seconds_Behind_Master metric.

Cassandra at Netflix (Streaming Infrastructure) — Leaderless with N=3, W=1, R=1 for maximum write throughput. Eventual consistency is acceptable for viewing history. Repairs run nightly.

PostgreSQL at Shopify (Platform Engineering) — Each merchant shard has a primary and two streaming replicas. Logical replication feeds the analytics warehouse asynchronously.

CockroachDB at DoorDash (Backend) — Multi-leader across availability zones using Raft consensus per range. Engineers asked about replication lag trade-offs in their system design interviews.

Google Spanner (Infra) — Multi-leader globally consistent replication using TrueTime. Synchronous replication to a quorum with atomic clocks enabling external consistency without conflict resolution.

Pause and recall¶

What is replication lag and what are two strategies to mitigate its effects on read consistency?
Draw the leader-follower topology. Where do reads go? Where do writes go?
In leaderless replication with N=5, W=3, R=3, how many node failures can you tolerate on writes?
The branch library has two locations that both update the same catalogue record offline. What conflict resolution strategy would you use and why?

Interview Q&A¶

Q: What is the difference between synchronous and asynchronous replication, and when would you choose each?

Synchronous replication waits for follower acknowledgement before confirming the write. Zero data loss on leader failure but adds write latency. Asynchronous replication confirms immediately; followers catch up later. Choose synchronous for financial data and asynchronous for high-throughput write paths where small data loss is tolerable.

Common wrong answer to avoid: Saying synchronous replication has "no performance impact." Every synchronous standby adds at least one network round-trip to each write. At intercontinental distances this is 100+ ms per write. That is a major performance impact.

Q: How does a system achieve read-after-write consistency when using read replicas?

Route a user's reads back to the primary for a short window (typically 1 minute) after they perform a write. Alternatively, track the replication position of the last write and only route reads to a replica that has caught up past that position.

Common wrong answer to avoid: Saying "just wait a few seconds." Replication lag is not deterministic. Under load it can grow to minutes. A correct solution uses position-tracking, not time-based guessing.

Q: Explain split-brain and how you prevent it.

Split-brain occurs when a failed leader recovers while a new leader is already active, causing two nodes to accept writes simultaneously. Prevention requires fencing: the old leader must be forcibly shut down (STONITH) before the new leader is promoted, or leader leases that expire and prevent the old leader from accepting writes.

Common wrong answer to avoid: Saying "use ZooKeeper for leader election" without mentioning fencing. ZooKeeper elects a new leader but does not guarantee the old leader stops. Without STONITH you still get split-brain.

Q: What are CRDTs and when are they useful in replication?

Conflict-free Replicated Data Types are data structures with mathematically guaranteed merge semantics. Two replicas can accept diverging writes and always produce the same merged result. Useful for counters, sets, and last-write-wins registers where you control the data type. Not useful for arbitrary relational updates.

Common wrong answer to avoid: Claiming CRDTs solve all conflict resolution. They solve it only for data types that fit the CRDT model. A general SQL row update does not fit. You still need application-level conflict resolution for most relational data.

Apply now (5 min)¶

Exercise: Design the replication setup for the branch library system. There are three physical branches, each needing to accept book checkouts locally even during network partitions. Choose a replication topology, quorum values, and a conflict resolution strategy for the case where two branches check out the last copy of the same book simultaneously. Write your answer as a numbered design decision list.

Sketch from memory: Draw the three replication topologies (leader-follower, multi-leader, leaderless) side by side. Label write paths, read paths, and the main failure mode of each.

Bridge. You can now spread one dataset across many machines. Next: how do you split a dataset that is simply too large for any single machine? → 08-partitioning-and-sharding.md