11. Leader election — choosing one boss without accidentally choosing two¶
~14 min read. One cluster, one loud voice, or the whole square argues.
Built on the ELI5 in 00-eli5.md. The town crier — choosing one crier among many candidates — becomes leader election.
1) Why one leader exists at all¶
See, some jobs should have exactly one decider at one moment. A shard coordinator, a scheduler, or a lease owner should not have twins.
If every replica writes first, you get duplicate work and conflicting state. That is why a cluster first chooses one town crier before announcing decisions.
- Pick one leader when someone must assign partitions, issue sequence numbers, or own a failover lease.
- Avoid a leader when the job can stay partitioned, commutative, or safely parallel without coordination.
- Remember the cost. Election adds latency, quorum traffic, and temporary pauses during failover.
clients
|
v
+---------+ replicate log +---------+
| node A | <-----------------> | node B |
+---------+ +---------+
\ /
\_____ choose one leader _____/
|
v
leader only
Worked example. Three nodes manage a payout batch pointer. Only one node should move pointer 417 to 418. Two leaders can pay the same merchants twice.
So the real question is not ego. It is coordination cost versus duplicate-work risk. If the operation is non-idempotent, one leader often saves pain.
2) The Bully algorithm: highest priority pushes others aside¶
Now take the Bully algorithm. Every node has a priority, often the highest numeric identifier. When a node suspects the leader died, it contacts all higher-priority nodes.
If nobody stronger answers, the caller declares itself leader. If a stronger node answers, that stronger node takes over the election and eventually wins.
nodes: 1 2 3 5 6
leader was 6
3 times out on 6
3 -> 5 : election?
3 -> 6 : election?
5 replies: alive
5 -> 6 : election?
6 silent
5 -> all : coordinator = 5
Worked example. Node 6 was leader and crashes. Node 3 notices first, but node 5 is alive. Node 5 bullies lower nodes and becomes the new leader.
- Good part: simple mental model, fast when higher nodes are reachable, easy to simulate.
- Weak part: chatty during failures, depends on timeouts, and assumes higher-priority nodes can always be contacted.
- Hidden trap: during a partition, each side may believe stronger nodes are dead and elect separately.
That last trap matters. The algorithm answers, “Who is strongest among nodes I can see?” It does not alone answer, “Is the cluster globally safe to follow?”
3) ZooKeeper and etcd make election practical with leases¶
Real systems usually avoid handwritten election loops everywhere. They move election into a coordination store and let applications campaign there.
In ZooKeeper, candidates create ephemeral sequential znodes under an election path. The smallest sequence number wins. Everyone else watches only the node just ahead.
/election
/candidate-0003 <- watch predecessor
/candidate-0004
/candidate-0005
smallest node = leader
session ends => znode disappears
Why watch only the predecessor? Because watching every node creates a herd. One failure wakes everyone, and the coordination service gets stampeded.
In etcd, leadership is usually tied to a lease. A candidate keeps renewing the lease while healthy. If renewals stop, the key expires and others can campaign.
- ZooKeeper ephemeral nodes tie leadership to a live session, not a hopeful process flag.
- etcd leases tie leadership to periodic keepalives, so stale holders naturally lose ownership.
- Both patterns work better when clients use quorum reads and monotonic leadership changes.
So your application code does not keep shouting, “I am the town crier.” The store records who currently holds the speaking stick.
Worked example. A cron coordinator renews a ten-second lease every three seconds. Missing four renewals means followers may start a fresh campaign safely.
4) Split-brain, stale leaders, and fencing tokens¶
Now the ugly part. A node can think it is leader after losing contact. Another side of the partition may elect a new leader. That is split-brain.
Quorum rules reduce this risk because only the majority side can confirm leadership. Still, old leaders may continue talking to storage or workers for a while.
partition happens
old leader ----X---- quorum
|
| still sends writes
v
shared storage
^
| accepts only highest token
new leader ---- quorum ---- followers
This is where fencing tokens save the day. Each new leader gets a strictly increasing token. Downstream systems reject commands carrying an older token.
Worked example. Old leader writes with token 41. New leader gets token 42. Storage accepts 42 and forever rejects 41, even if 41 arrives late.
- Lease alone says who should lead now. Fencing token says whose side effects may still be accepted.
- Use both when the resource is external, slow, or unable to observe your quorum directly.
- Never trust local memory alone. Leadership must be validated by a shared quorum-backed source.
Simple, no? Choosing one town crier is only half the job. Stopping yesterday’s crier from issuing orders is the other half.
Where this lives in the wild¶
- Kubernetes control-plane engineer at Google or a cloud vendor — relies on etcd quorum and lease-based leadership so only one controller-manager actively reconciles certain loops.
- Stripe reliability engineer — uses leader election for scheduled billing or recovery jobs where duplicate execution could create customer-facing mistakes.
- LinkedIn Kafka platform engineer — depends on one active controller to coordinate partition leadership and cluster metadata changes.
- Elastic search infrastructure engineer — needs one elected master-eligible node to publish cluster state changes safely.
- Swiggy platform engineer — may elect one worker to own time-based housekeeping jobs instead of every pod running the same cleanup.
Pause and recall¶
- Why does one leader reduce risk for non-idempotent or globally ordered work?
- What exact assumption makes the Bully algorithm fragile during network partitions?
- Why do ZooKeeper ephemeral nodes and etcd leases beat local in-memory leader flags?
- What problem does a fencing token solve that a plain lease alone cannot?
Interview Q&A¶
Q: Why do distributed systems often want one leader even when replicas exist? A: Because some actions need one authoritative writer, scheduler, or sequencer. Replication gives availability, but leadership reduces conflicting decisions.
Common wrong answer to avoid: "Replication means every node should independently coordinate the same job."
Q: Why is the Bully algorithm not enough for production safety by itself? A: Because it is timeout-driven and visibility-limited. During partitions, each side may elect based only on who appears reachable locally.
Common wrong answer to avoid: "Highest node ID automatically guarantees global correctness."
Q: Why do ZooKeeper and etcd elections usually feel safer than custom heartbeats? A: Because leadership is attached to quorum-backed sessions or leases. When the session dies or the lease expires, ownership disappears centrally.
Common wrong answer to avoid: "A process can just set leader=true in memory and everyone will respect it."
Q: Why use fencing tokens after electing a new leader? A: Because stale leaders may still send delayed commands. Monotonic tokens let downstream systems reject old authority even after network recovery.
Common wrong answer to avoid: "Once a new leader wins, old commands cannot possibly arrive later."
Apply now (5 min)¶
Take a job from your current system, like billing, partition assignment, or cron ownership. Decide whether it truly needs one leader. Then write the lease duration, renew interval, quorum rule, and fencing-token check you would enforce.
Sketch from memory:
- the three-node picture showing why one writer protects a shared pointer,
- the Bully algorithm message flow from failed node 6 to elected node 5,
- and the split-brain diagram where token 42 defeats stale token 41.
Bridge. A leader helps only if other services can find it and other healthy peers quickly. → 12-service-discovery.md