Skip to content

14. Honest admission — the distributed truths that still hurt after the diagrams look neat

~15 min read. Mature design means saying where the guarantees really stop.

Built on the ELI5 in 00-eli5.md. The bulletin board — shared notices without direct phone calls — becomes the event backbone and its limits.


1) Exactly-once is usually at-least-once plus idempotency wearing makeup

See, a distributed system can lose acknowledgments, retry handlers, and redeliver messages. So the same event may be processed more than once even when everyone behaved reasonably.

That is why “exactly-once” often means something narrower than people first imagine. Usually it means at-least-once delivery plus careful deduplication or idempotent side effects.

  • Broker may deliver a message again if consumer handled it but crashed before committing the offset.
  • Producer may send again if it never receives the acknowledgment and cannot prove the first write succeeded.
  • Application can survive duplicates only when handlers are idempotent or when dedupe keys are enforced durably.
receive order-created
   -> charge card
   -> crash before ack
broker redelivers
   -> charge card again?
fix: idempotency key / payment fence

Worked example. A consumer charges a card, then dies before offset commit. The bulletin board did its job. Your handler still needs idempotency to avoid a double charge.

So honest admission is simple. Delivery guarantee alone never proves business guarantee. Your data model finishes the promise.

2) Distributed debugging remains painful even with logs, metrics, and traces

Now look at debugging. One request may cross services, queues, retries, caches, and regions. By the time the bug appears, the original cause may be minutes earlier.

Tracing helps, but traces can be sampled, broken, or missing across async hops. Logs help, but clocks differ and correlation IDs still go missing.

  • A trace can tell you where latency happened, but not always why a handler made the wrong domain decision.
  • A log can show a retry, but not whether the first attempt actually committed before timing out.
  • A metric can show backlog growth, but not which exact event sequence created the bad state.
user clicks pay
   -> api trace id T1
   -> publish event E9
   -> worker retries twice
   -> db writes succeed once
   -> callback lost
who exactly lied first?

Worked example. Payment succeeded in storage, callback failed, UI timed out, and support ticket arrived three hours later. Good luck proving the precise sequence quickly.

A bulletin board removes tight coupling. It does not remove detective work. Distributed debugging is still slow, human, and humbling.

3) Ordering across partitions and saga rollback stay imperfect

People love neat stories about event order. Reality is stricter. Order is easy inside one partition, one key, or one sequencer. Global order across partitions is expensive and fragile.

If events for the same business entity land on different partitions, arrival order and processing order may diverge. Consumer lag, retries, and rebalances make the picture uglier.

  • Per-key ordering is often achievable by partitioning all related events by the same key.
  • Cross-key ordering is often a business redesign problem, not a broker feature problem.
  • Saga compensation is also not true rollback. It issues new actions that try to semantically undo earlier ones.
partition 1: reserve-seat   ---> arrives second
partition 2: cancel-seat    ---> arrives first

consumer view:
cancel before reserve

saga compensate:
refund issued, email maybe already sent

Worked example. Hotel booking saga sends confirmation email, then payment compensation triggers refund. Money is corrected, but the customer still received a misleading confirmation.

So be honest. The bulletin board can preserve some orders well. It cannot cheaply preserve every order your product manager can imagine.

4) Consensus is slow, tradeoffs are real, and uncertainty should be named

Consensus gives safety, but safety has a bill. Quorum writes, durable logs, leader changes, and WAN distance all increase latency and operational complexity.

That does not make consensus bad. It means you should spend it where disagreement is deadly, not where eventual convergence is perfectly acceptable.

client write
   -> leader append
   -> follower replicate
   -> quorum ack
   -> leader commit
   -> client hears success

Worked example. A local cache update can tolerate eventual sync. A lease owner for shard movement probably needs quorum confirmation before anyone acts.

  • Use consensus for locks, metadata ownership, membership, and other decisions where two winners are dangerous.
  • Avoid consensus on every user-visible click when the domain can tolerate asynchronous reconciliation.
  • When unsure, say what guarantee you need, what it costs, and what risk remains unsolved.

This is the final honest admission. Distributed systems are full of tradeoffs, not fairy tales. Seniority is naming the limit before production names it for you.


Where this lives in the wild

  • Stripe payments engineer — treats duplicate delivery as normal and depends on idempotency keys because financial side effects cannot trust transport slogans alone.
  • Uber marketplace engineer — uses sagas for long workflows while accepting that compensation may leave temporary business-visible inconsistency.
  • Datadog observability engineer — helps teams trace cross-service failures, while still knowing some incidents remain hard to reconstruct end to end.
  • LinkedIn streaming platform engineer — explains partition ordering carefully and warns product teams against assuming magical global order.
  • Google or cloud-vendor control-plane engineer — spends consensus only on metadata and coordination paths because quorum on everything would be too costly.

Pause and recall

  • Why does at-least-once delivery force idempotency even when the broker sounds reliable?
  • What makes distributed debugging harder than debugging a single process with one stack trace?
  • Why is global event ordering across partitions usually a design compromise rather than a simple checkbox?
  • Why should consensus be used selectively instead of on every path?

Interview Q&A

Q: Why is “exactly-once” often described as marketing in practical systems? A: Because end-to-end business correctness usually still depends on idempotent handlers, dedupe keys, and careful state transitions outside the broker.

Common wrong answer to avoid: "If the broker claims exactly-once, duplicate side effects cannot happen anywhere."

Q: Why is distributed debugging still painful despite modern observability tools? A: Because the truth is spread across clocks, retries, async hops, partial traces, and delayed side effects. Tools help, but reconstruction still needs reasoning.

Common wrong answer to avoid: "A tracing dashboard automatically explains every cross-service bug."

Q: Why is saga compensation not the same as a database rollback? A: Because compensation is a new business action that tries to offset prior effects. It cannot erase every external observation or irreversible side effect.

Common wrong answer to avoid: "A saga simply gives distributed ACID transactions with prettier names."

Q: Why is consensus considered expensive? A: Because nodes must coordinate through leaders, durable logs, and quorum acknowledgments. That adds latency, complexity, and failure-handling overhead.

Common wrong answer to avoid: "Consensus is just another function call with better branding."


Apply now (5 min)

Write one uncomfortable truth for each distributed guarantee you use today: delivery, ordering, rollback, and coordination. Then rewrite one design note so it explicitly says what the system really guarantees and what the application layer still must handle.

Sketch from memory:

  • the duplicate-delivery flow where business idempotency prevents a second charge,
  • the partition-ordering example where cancel arrives before reserve,
  • and the consensus write path showing why quorum adds latency.

Bridge. We leave the square now and move into cloud infrastructure, where these tradeoffs become platform choices. → ../07_cloud_infrastructure_for_ai/00-eli5.md