13. CAP theorem in practice — when branch copies disagree, who gets to answer¶
~16 min read. CAP is not a slogan; it is a failure-time decision about truth.
Built on the ELI5 in 00-eli5.md. The branch library — when branches disagree, who is right? — becomes our model for replicas, partitions, and consistency choices.
1. What CAP really says, and when it actually matters¶
See. CAP is about Consistency, Availability, and Partition tolerance. But the theorem bites only during a network partition. That detail is the most commonly skipped part. Healthy networks are not the decisive CAP moment. Broken communication is.
Now imagine two branch library locations cannot talk. A user returns a book at branch A. Another user asks branch B whether that copy is available. What should branch B say? Latest truth? Old truth? Or a clear refusal? That is the CAP question.
Consistency here means every read sees the latest successful write, or receives an error instead of stale success. Availability means every request to a non-failed node gets some response. Partition tolerance means the system continues despite network splits. In real distributed systems, partition tolerance is not optional. Networks can and do fail.
So the useful reading is this. During a partition, you choose between consistency and availability for that operation. You do not choose "any two forever" like a sticker. Simple, no? That is why serious answers always mention the failure scenario.
2. Common myths make CAP sound simpler than real systems are¶
Myth one says, "This whole database is AP". Look. Real systems expose many operations with different damage profiles. A feed refresh and a payment confirmation should not behave identically. One operation may allow stale data. Another may reject during uncertainty. So CAP choices are often path-specific.
Myth two says, "CA is a valid distributed stance". Not really. If the system spans machines, partitions are possible. The moment they happen, you must react. Pretending the network never breaks is not architecture. It is wishful thinking.
Myth three says, "CP means correctness with no downside". Also wrong. CP systems usually pay with latency, reduced write availability, or coordination overhead. They are not free truth machines. They are deliberate tradeoff machines.
partition happens
├─ choose C → reject or delay some requests
└─ choose A → answer, but maybe with stale state
One more myth matters. Students hear availability and think uptime percentage only. But CAP availability means each non-failed node must respond. That response can still be stale or incomplete. So define the term carefully in interviews. Words matter here.
3. CP and AP are business choices hiding inside technical language¶
Let us make this concrete. Suppose we run a global ticketing service. Inventory for the last seat is sensitive. Browse pages for concert details are not equally sensitive. The same product therefore needs different consistency behavior.
For seat reservation, choose a CP-leaning path. If required replicas cannot coordinate, reject the reservation or ask the user to retry. Why? Selling one seat twice creates refunds and trust damage. Brief unavailability is cheaper than false success.
For event browsing, choose an AP-leaning path. If one region is isolated, show cached event details and perhaps slightly stale availability badges. Why? Reading older browse data is usually acceptable for a few seconds. The user can still continue the journey.
Worked example. Local read latency is 12 milliseconds. Cross-region quorum adds 85 milliseconds. Browse reads can stay near 12 milliseconds on local replicas. Seat reservation writes may cost 12 + 85 = 97 milliseconds, or fail if quorum is unavailable. That is a real business trade. Not a theorem tattoo.
The branch library lesson is simple. Some branches may answer from local memory. Some branches must call headquarters before promising the last copy. Your choice depends on user harm. The branch library is not always wrong when it refuses. Sometimes refusal is the honest answer.
4. PACELC adds the tradeoff people feel on healthy days¶
CAP speaks about partitions. But most days, the network is healthy enough. Still, design choices continue hurting or helping latency. PACELC captures that extra reality. If there is a Partition, trade Availability versus Consistency. Else, trade Latency versus Consistency.
Suppose a user in Mumbai writes profile data. Local commit time is 8 milliseconds. Global coordination adds 70 milliseconds. A strongly consistent path may therefore take about 78 milliseconds. An eventually consistent local write may finish near 8 milliseconds, then replicate outward later. That difference exists even when the network is healthy.
Now ask the right question. Is the product willing to pay 70 extra milliseconds every time? For bank balance updates, maybe yes. For social likes, maybe no. PACELC prevents shallow answers like, "CAP matters only when the network is broken". Failure-time tradeoffs matter, but healthy-time latency tradeoffs matter daily.
This also explains why one company mixes models. A control plane may be stricter. An analytics stream may be looser. A user profile photo may be looser still. Systems are portfolios of choices, not one religious position. Simple, no?
5. Choosing a consistency model should start from user-visible harm¶
Now let us make a practical framework. First, ask what wrong answer harms the user most. Second, ask how long stale data can remain tolerable. Third, ask whether the operation is reversible. Fourth, ask whether retries are safer than silent inconsistency.
Use strong consistency when duplicate success is dangerous. Examples include money transfer, quota enforcement, and scarce inventory. Use eventual consistency when stale reads are annoying but survivable. Examples include likes, recommendations, and some profile mirrors. Use causal or session consistency when user actions must preserve personal order. Examples include chat threads or read-your-write profile edits.
Worked mini-table:
operation preferred model
wallet debit strong / CP-leaning
cart badge count eventual / AP-leaning
profile update view read-your-write or session consistency
activity feed eventual with bounded freshness
One more practical point. Consistency is not only a database property. Caches, queues, retries, and client reads also change the outcome. You can use a strong database and still show stale cached pages. You can use an AP store and still make one path strict with quorum. So answer at the end-to-end path level. That sounds much more senior.
Where this lives in the wild¶
- Google Spanner engineer pays coordination cost for globally ordered financial or configuration updates.
- Amazon retail backend engineer may keep browse reads available while keeping inventory reservation stricter.
- Netflix personalization engineer tolerates temporary divergence in recommendation surfaces rather than blocking viewers.
- Airbnb booking platform engineer prefers correctness over false success for scarce-room reservation writes.
- Meta social graph engineer may allow some feed-path staleness while keeping identity or permission paths tighter.
Pause and recall¶
- Why is CAP mainly about partition-time behavior, not sunny-day throughput?
- Why is calling a whole distributed system simply CA misleading?
- In one product, why might browse reads and reservation writes choose differently?
- What extra question does PACELC add after the network stays healthy?
Interview Q&A¶
Q: Why describe CAP tradeoffs per operation instead of labelling the whole database once? A: Different operations tolerate different failure outcomes. A social feed, a booking write, and a profile badge should not all sacrifice the same thing during partition.
Common wrong answer to avoid: "Because interviewers like nuance" — the deeper point is that user harm differs by path, so the design must differ too.
Q: Why is CA not a serious answer for a real distributed deployment? A: Once multiple nodes depend on a network, partitions are possible. When they happen, the system must either reject some work or answer without full coordination.
Common wrong answer to avoid: "Because partitions are rare" — rarity does not remove the need for a failure-time policy.
Q: Why choose CP for seat inventory but AP for browse pages? A: Selling the same seat twice creates direct business damage, while a slightly stale browse page usually causes less harm. The two paths face different correctness stakes.
Common wrong answer to avoid: "Because inventory is more important" — importance is too vague. The real issue is the specific cost of being wrong.
Q: Why bring PACELC into a design discussion after explaining CAP? A: Because even without partitions, stronger consistency often costs extra latency. PACELC captures the everyday tradeoff teams keep paying in healthy conditions.
Common wrong answer to avoid: "Because CAP is outdated" — CAP is still correct. PACELC simply adds the else-case engineers feel every day.
Apply now (5 min)¶
Exercise: Pick one app with a browse path and a commit path. Write what each path should do during a partition. Then write the extra healthy-network latency you are willing to pay for the stricter path. State why in business language.
Sketch from memory: draw two replicas, a broken network link, one request that chooses A, and one request that chooses C. Then add the PACELC line below it.
Bridge. We can now discuss tradeoffs honestly. The final lesson says where confidence should stop, and where careful uncertainty should begin. → 14-honest-admission.md