11. Failure modes and resilience — survive the bad day on purpose¶
~16 min read. Reliability starts when you stop saying, "that dependency is usually fine."
Built on the ELI5 in 00-eli5.md. The kitchen — the backend doing the real work — must keep serving even when one prep station goes weird.
1) First picture: failure is not one thing¶
See.
One order ticket can fail in several different shapes. The user only sees, "request failed." You must see the pattern.
client ──→ gateway ──→ checkout ──→ inventory
│ │
│ ├── crash
│ ├── omission
│ ├── timing
│ └── Byzantine
└──→ payment
Look. Crash is easy to notice. Timing is often worse. A slow prep station can quietly stall the whole restaurant.
Quick examples. - A pod exits after OOM. Crash. - A network flap drops 2% of packets. Omission. - A database answers in 4 seconds instead of 40 ms. Timing. - One cache node serves corrupted values. Byzantine.
So what to do? Name the failure first. Then choose the control. Restart helps crash. Timeouts help timing. Checksums and validation help Byzantine cases. Simple, no?¶
2) Cascading failure: one slow dependency becomes everyone’s outage¶
Now what is the real danger? Not only hard failure. Slow failure.
┌────────┐ ┌──────────┐ ┌───────────┐
│ client │──→│ checkout │──→│ inventory │
└────────┘ └──────────┘ └───────────┘
│
├── threads fill
├── queue grows
├── callers timeout
└── retries add more load
Worked example. Suppose checkout receives 200 requests per second. Each request calls inventory once. Checkout has 60 worker threads.
Normal case: - inventory latency = 50 ms = 0.05 s - needed concurrency = arrival rate × latency - needed concurrency = 200 × 0.05 = 10 in-flight calls
Ten in-flight calls fits easily inside 60 threads. No problem.
Bad case: - inventory latency jumps to 1.5 s - needed concurrency = 200 × 1.5 = 300 in-flight calls
But checkout has only 60 threads. So effective throughput becomes: - max throughput = threads ÷ latency - max throughput = 60 ÷ 1.5 = 40 requests per second
Arrival is still 200 per second. Service capacity is now 40 per second. Queue growth is: - queue growth = 200 - 40 = 160 requests per second - after 10 seconds, queued requests = 160 × 10 = 1,600
Now add retries. If timed-out clients retry once, incoming demand can approach 400 requests per second. Then required concurrency becomes: - 400 × 1.5 = 600 in-flight calls
See the trap? The dependency did not crash. It just got slow. That is enough to kill the caller. Then the caller hurts its own callers. That is cascading failure.
The waiting line helps only if it is bounded and deliberate. An invisible queue inside thread pools is not a design. It is denial.¶
3) Resilience patterns: contain, shed, recover¶
Look. Resilience is not one magic box. It is a stack of small controls.
request path
│
├── timeout
├── retry with backoff + jitter
├── circuit breaker
├── bulkhead
└── graceful degradation
Timeouts¶
A timeout is a promise to stop waiting. Without it, threads and sockets stay occupied. Set it shorter than the caller’s full budget. Do not let one slow prep station consume the whole house rules budget.
Retries with backoff¶
Retries are for transient faults. Not for overload without limits. Use exponential backoff and jitter.
Example: - attempt 1 fails at 150 ms - wait 100 ms - attempt 2 fails - wait 200 ms - attempt 3 fails - wait 400 ms
Total extra waiting is: - 100 + 200 + 400 = 700 ms
So what to do? Retry only when the operation is idempotent. Retry only a few times. Retry only if the remaining budget allows it.
Circuit breakers¶
A breaker stops sending fresh traffic to a sick dependency. Closed means normal traffic. Open means fail fast. Half-open means test recovery with limited probes.
This matters because fast failure is often kinder than slow collapse.
Bulkheads¶
Bulkheads split resources. One pool for checkout. Another for search. Another for recommendations. If recommendations melt, checkout still breathes. Simple, no?
Graceful degradation¶
Not every feature deserves equal protection. Core path first. Optional path later. A payment page can hide recommendations. A feed can hide like counts. The restaurant should still serve the main meal.¶
4) Health checks and heartbeats: know who is alive, and who is actually useful¶
See. "Process exists" is not the same as "service is healthy."
liveness ─→ should I restart it?
readiness ─→ should I send traffic to it?
heartbeat ─→ is this worker still making progress?
Use liveness checks for deadlocks and crashes. Use readiness checks for dependency health, queue age, and warmup state. Use heartbeats for long-running workers.
Bad design:
- app returns HTTP 200 on /health
- database is unreachable
- queue age is 90 seconds
- orchestrator still sends traffic
Good design: - liveness says process loop is alive - readiness says database, cache, and critical downstreams are usable - worker heartbeat says jobs are still progressing
Now what is the problem? Health checks can lie when they are too shallow. If the check ignores backlog, saturation, or stuck threads, it gives false comfort.
Good signals to include: - dependency reachability, - queue depth or age, - thread-pool saturation, - recent error rate, - last successful heartbeat time.
Heartbeats matter especially for async systems. A job worker may not crash. It may just stop making progress. A missing heartbeat tells you sooner.¶
5) Design for failure, not against it¶
Look. You do not win by pretending failure is rare. You win by making failure boring.
A resilient design usually asks five questions. 1. Which dependencies are critical, and which are optional? 2. What is the timeout for each hop? 3. Which calls are safe to retry? 4. What gets isolated behind bulkheads? 5. What degraded response still helps the user?
A practical rule. For each dependency, decide one of three actions. - wait a little, - retry carefully, - or fail fast and degrade.
Do not leave behavior accidental. If a dependency matters to correctness, protect it with idempotency keys, validation, and strong alarms. If it is optional, cut it loose quickly.
One more thing. Chaos is not only for giant companies. Even a small system should test: - one instance crash, - one dependency timeout, - one queue backlog spike, - one bad payload, - one network partition.
The goal is simple. A single bad order ticket should not burn the kitchen.¶
Where this lives in the wild¶
- Stripe Checkout — a payments reliability engineer places hard timeouts around tax and promo services so card authorization is not blocked by optional logic.
- Netflix playback startup — a client platform engineer degrades artwork and personalization calls when catalog services lag, preserving the play button first.
- Slack messaging — an infrastructure SRE isolates file-upload workers from the core message-send path so one heavy workload does not stall plain chat.
- Uber rider ETA pipeline — a dispatch engineer uses heartbeats and circuit breakers so stale map or surge services do not freeze trip matching.
- Amazon checkout — a fulfillment platform engineer makes reservation retries idempotent so a timed-out order does not double-hold inventory.
Pause and recall¶
- Why is a timing failure often more dangerous than a clean crash?
- In the worked example, why did 1.5-second latency collapse a 60-thread service?
- When should you retry, and when should you fail fast?
- What is the difference between liveness, readiness, and a heartbeat?
Interview Q&A¶
Q: Why use a timeout plus fallback, not just wait longer for a slow dependency? A: Because waiting longer consumes threads, sockets, and user patience. A short timeout preserves the caller and creates room for graceful degradation.
Common wrong answer to avoid: "Longer waits improve reliability" — they often only move the outage upstream and make it wider.
Q: Why use a circuit breaker, not simply add more retries? A: Because retries multiply traffic on a dependency that is already unhealthy. A breaker cuts fresh load, contains blast radius, and gives the dependency space to recover.
Common wrong answer to avoid: "Retries always improve success rate" — under overload they often reduce total success.
Q: Why create bulkheads, not one giant worker pool for efficiency? A: Because shared pools maximize contagion. Bulkheads trade a little peak efficiency for much better isolation when one path degrades.
Common wrong answer to avoid: "One pool is simpler, so it is safer" — simpler resource layout can still fail more dramatically.
Q: Why prefer graceful degradation, not fail the whole request when one optional service is sick? A: Because users usually value the core action more than every enhancement. Preserving the main path keeps the product usable while the optional path recovers.
Common wrong answer to avoid: "Consistency means every feature must succeed together" — that confuses core correctness with optional experience.¶
Apply now (5 min)¶
Pick one request in a product you know. List its critical dependencies and optional dependencies. Give each hop a timeout, a retry policy, and a fallback. Then mark where you would place one circuit breaker and one bulkhead.
Sketch from memory: - the four failure types, - the slow-dependency cascade math, - and the closed → open → half-open breaker states.
Bridge. We handle failures. But is the system fast enough? A system that doesn't fail but takes 10 seconds to respond is still unusable. We need to budget latency. → 12-latency-and-throughput-budgets.md