10. Consensus algorithms — how distributed systems agree without trusting timing¶
⏱️ Estimated time: 18 min | Level: advanced
ELI5 callback: The town crier pins one notice on the bulletin board. The board rules decide delivery, and the town directory helps readers find it.
Why agreement is hard once machines can delay or disappear¶
Distributed systems fail by delay, duplication, and split views. Two nodes may each think the other one died. See. Without agreement, both may accept conflicting writes. Consensus exists to let a cluster choose one safe history. It requires quorums, durable logs, and careful leadership rules. That cost is high, but the alternative is worse. Diagram:
node A sees timeout node B sees timeout
each side suspects failure independently
write accepted here? write accepted there?
+--------- network partition ---------+
quorum rule decides which side may continue
the losing side must stop pretending
- A client asks to append a new value.
- One replica times out talking to another.
- Local guesses are no longer enough.
- The cluster needs a shared safety rule.
- Consensus supplies that rule through quorum-backed decisions.
- Failure detection is never perfect.
- Safety matters more than immediate progress.
- Durable logs matter because memory forgets on restart.
- Quorums prevent one noisy node from inventing history. Consensus begins where trust in timing ends.
Raft makes the core flow easier to reason about¶
Raft organizes nodes into leader, followers, and candidates. One elected town crier proposes the next state change. Each replicated notice carries term, index, and command data. Followers append entries and acknowledge them. An entry commits after a majority stores it. Raft is popular because the flow is teachable. Now watch. Diagram:
client -> leader -> followers
append entry -> replicate -> ack majority
commit index advances after quorum
followers apply only committed entries
old leader with stale term must step down
elections choose a new leader when needed
- The leader receives a command.
- It appends the entry locally first.
- It sends AppendEntries to followers.
- A majority acknowledges the entry.
- Only then does the leader mark it committed.
- Terms prevent stale leadership from lingering.
- Commit needs a majority, not good feelings.
- Followers stay mostly simple by design.
- Heartbeats are lightweight proofs of current leadership. Raft is simple enough to teach, not simple enough to wing.
Paxos solves the same safety problem with a different mental model¶
Paxos talks in proposers, acceptors, and learners. It focuses on choosing one value safely despite unreliable timing. The language feels heavier than Raft, but the safety goal is similar. See. Proposal numbers prevent older proposals from overwriting newer promises. Acceptors promise before they accept. Learners observe what value actually won. Diagram:
proposer -> prepare(n) -> acceptors
acceptors -> promise(n) -> proposer
proposer -> accept(value,n) -> acceptors
majority accepted -> learners discover chosen value
higher proposal numbers can supersede stalled attempts
safety holds even when retries happen repeatedly
- A proposer sends Prepare with a proposal number.
- Acceptors promise not to accept lower numbers later.
- The proposer picks the safe value to continue with.
- It sends Accept requests using that same number.
- A majority acceptance makes the value chosen.
- Paxos prizes safety under ugly timing assumptions.
- The protocol is smaller than its reputation suggests.
- Multi-Paxos adds a stable leader for efficiency.
- Implementation clarity matters as much as theory knowledge. Do not fear Paxos; fear an incomplete implementation of Paxos.
Quorums, terms, and logs are the non-negotiable safety machinery¶
The replicated log acts like a guarded bulletin board. But not every appended entry is committed immediately. Quorum checks, term comparisons, and commit indices are board rules. These rules stop stale leaders from overwriting safe history. Leases alone are not enough. Durability depends on what a majority has accepted. Simple, no? Diagram:
append locally != committed globally
majority stored entry -> eligible to commit
stale term -> reject leadership claim
conflicting suffix -> overwrite only under protocol rules
snapshot and compaction keep logs manageable
safety comes from intersections of majorities
- Entry 42 reaches one follower only.
- The leader crashes before quorum acknowledgment.
- Entry 42 is not safely committed yet.
- A new leader may replace that uncommitted tail.
- Committed entries survive because quorums intersect.
- Committed and merely replicated are different states.
- Majority intersection is the heart of safety.
- Compaction must preserve the committed prefix.
- Client acknowledgments should follow commit, not wishful append. The protocol sounds strict because data loss is expensive.
Consensus gives ordered truth, not total system perfection¶
Clients still need a town directory or redirect path to the leader. Consensus picks a safe history, not a perfect user experience. It does not remove latency from distant regions. It does not make non-idempotent clients magically safe. Snapshotting, compaction, and membership changes remain hard. So what to do? Use consensus only for data that truly needs one agreed order. Diagram:
client hits follower -> redirect to leader
leader changes -> clients refresh routing info
snapshots trim old log safely
membership changes follow controlled joint rules
wide area latency still shapes commit speed
consensus protects order, not every product requirement
- Put metadata, locks, or config changes behind consensus.
- Keep bulky analytics streams outside that hot path.
- Test leader failover under load.
- Rehearse snapshot restore and member replacement.
- Explain to clients what errors are retryable.
- Consensus is for the small set of truly critical decisions.
- Overusing it adds latency and operational weight.
- Client retry logic still matters a lot.
- Safe ordering is valuable precisely because it is expensive. Use the hammer where nails exist, not on every screw.
Where this lives in the wild¶
- etcd, where Raft protects cluster metadata and service coordination.
- Apache ZooKeeper, where Zab-style consensus coordinates shared state.
- Kafka KRaft mode, where controller metadata needs agreed ordering.
- CockroachDB, where consensus-backed replicas protect transactional ranges.
- Consul, where cluster membership and key metadata rely on agreement.
Pause and recall¶
- Why can two healthy-looking nodes still disagree about the world?
- What does a majority give that a timeout cannot give?
- Why is a replicated entry not always a committed entry?
- When should you avoid putting a workload behind consensus?
Interview Q&A¶
Q: What does a consensus algorithm guarantee at a high level? A: It guarantees that replicas can agree on one safe ordered history despite crashes and timing uncertainty. Common wrong answer to avoid: "It guarantees zero latency and perfect availability under every failure."
Q: Why is quorum central to consensus? A: Because intersecting majorities preserve safety when leaders change and messages arrive late. Common wrong answer to avoid: "Quorum is only about performance tuning, not correctness."
Q: Why do people say Raft is easier than Paxos? A: Because its roles and leader-driven flow are easier to visualize, even though implementation still demands rigor. Common wrong answer to avoid: "Raft is trivial, so production code can skip many edge cases."
Q: What does consensus not solve for you? A: It does not solve client idempotency, global low latency, or every application-level workflow concern. Common wrong answer to avoid: "Once you add consensus, the distributed system becomes simple everywhere."
Apply now (5 min)¶
Take a three-node metadata service. Write the quorum size and the client write acknowledgment rule. Sketch one leader failover where an uncommitted entry disappears. Then sketch one committed entry that must survive the failover. List which data in your system truly deserves consensus. Add one client redirect rule for follower responses. Now watch safety and latency become separate design questions.
Bridge. Consensus gives us agreement. But no system is perfect — next we tackle leader election. → 11