11. Failure Modes and Resilience — Draw the bad day too¶
~12 min read. A clean diagram is not enough; your blueprint must survive when boxes start dying.
Built on the ELI5 in 00-eli5.md. The blueprint — the high-level architecture diagram — must show what happens on bad days, not only sunny ones.
1) Resilience starts on the diagram¶
See. A pretty HLD with only happy arrows is incomplete. Production failure does not ask permission. A server hangs. A zone disappears. A dependency becomes slow instead of dead. Your blueprint must show all three cases.
Failure domain means one unit that can fail together. One VM is a failure domain. One rack can be a failure domain. One availability zone can be a failure domain. One region can be a bigger one. A shared database cluster is also a failure domain, even if many services touch it.
Blast radius means how far one failure spreads. Small blast radius is good engineering. Large blast radius is architecture debt.
┌───────────┐ ┌──────────────┐
│ web tier │──→ │ auth service │
└─────┬─────┘ └──────┬───────┘
│ │
│ ▼
│ ┌──────────────┐
│ │ user store │
│ └──────────────┘
▼
┌──────────────┐
│ recommender │──→ optional cache
└──────────────┘
In this blueprint, auth is critical. Recommendations are optional. So the failure policy should differ. If recommender dies, page still loads. If auth dies for logged-in actions, some flows fail. That distinction must be explicit.
Now what is the common mistake? Putting too many services behind one hidden dependency. Then one slow database poisons five boxes. Or one shared thread pool chokes the whole process. Or one region carries all writes and becomes a single crater. So resilience begins with separation.
2) Detect fast, wait less¶
A dead component is easy. A slow component is worse. Why? Because slow calls hold threads, sockets, and connection slots. Soon healthy requests also queue behind sick ones.
So every cross-service call needs a timeout. Not "someday." A number. A hard number. If your end-to-end budget is 800 ms, you cannot let one downstream call wait 10 seconds. Simple, no?
Example budget: page budget = 800 ms auth = 80 ms catalog = 120 ms pricing = 70 ms recommendations = optional, cap at 150 ms reviews = optional, cap at 120 ms
If recommendations exceed 150 ms, cut them off. Return fallback content. Protect the rest of the page.
Health checks also matter. But use the right type.
- Liveness check: should this process be restarted?
- Readiness check: should traffic be sent here?
- Dependency-aware readiness: is the box alive enough for real work?
A pod can be alive but not ready. Maybe the process started, but caches are cold. Maybe DB connections are exhausted. Maybe migrations are still running. If the load balancer keeps routing there, you create fake failures.
client ──→ load balancer ──→ instance A healthy
├─→ instance B healthy
└─→ instance C not ready ▼ remove
See the principle. Detect quickly. Stop routing quickly. Wait less. Recover faster.
3) Retries, backoff, circuit breakers, bulkheads¶
Retries sound smart. Blind retries are how small outages become large outages.
Worked example now. Suppose service A receives 500 requests per second. For each request, it calls service B once. Normally B responds in 100 ms.
So concurrent in-flight calls to B are roughly: 500 requests/second × 0.1 second = 50 calls
Now B becomes sick. Latency jumps to 2 seconds. Same incoming rate stays 500 requests per second.
New in-flight calls become: 500 × 2 = 1,000 calls
If A has only 300 worker threads or async slots for that path, they saturate. Now A looks broken too.
Now add naive retries. Suppose every failed request retries two more times immediately. One original call becomes 3 total calls. Effective offered load becomes: 500 × 3 = 1,500 calls per second
At 2 seconds latency, in-flight becomes: 1,500 × 2 = 3,000 calls
See the explosion. The retry logic just turned pain into fire.
So what to do? First, retry only when the failure is likely transient. Second, back off. Third, add jitter so all clients do not retry together.
Example backoff: attempt 1 retry after 100 ms ± random jitter attempt 2 retry after 200 ms ± random jitter attempt 3 retry after 400 ms ± random jitter
Now circuit breaker. It watches recent failures and latency. If the downstream is clearly unhealthy, the breaker opens. Open means: do not call the dependency for a short period. Fail fast. Maybe use fallback. Maybe return partial data.
After a cool-down, half-open sends a small test trickle. If those succeed, close again. If not, reopen. That saves the caller from drowning with the callee.
Bulkheads are different. They isolate resources. One dependency gets its own thread pool, connection pool, queue, or worker set. Then search failure does not eat checkout capacity. File processing backlog does not freeze message delivery. The ship gets compartments. One flooded room does not sink everything.
4) Graceful degradation and containment¶
A resilient system does not insist on full luxury during failure. It protects the core journey first.
Imagine an e-commerce homepage doing five downstream calls: auth, catalog, pricing, reviews, recommendations.
Traffic = 1,200 requests per second. Homepage worker budget = 400 concurrent request slots.
Normal latencies: auth = 60 ms catalog = 90 ms pricing = 40 ms reviews = 110 ms recommendations = 180 ms
Now recommendations slows to 3,000 ms. No timeout, no breaker, no fallback.
Step 1: Each homepage request waits 3 seconds for recommendations. Step 2: Concurrency needed just for that slow call becomes: 1,200 × 3 = 3,600 waiting operations Step 3: Available slots are only 400. So slots fill in: 400 / 1,200 second = 0.33 second
In one-third of a second, the homepage tier is saturated. Even users who only needed catalog and pricing now suffer.
Now redesign the blueprint. Set recommendations timeout = 150 ms. Mark it optional. Add breaker after repeated failures. Serve cached popular items when it times out.
New math: 1,200 × 0.15 = 180 waiting operations
180 is below the 400-slot budget. Core page survives. Luxury module degrades. Business continues.
That is graceful degradation. You are not pretending nothing failed. You are choosing what can disappear first.
Containment also applies at region level. If one region is burning, stop sending all users there. If one queue is exploding, pause producers or drop optional work. If one dependency is bad, hide its features instead of dragging the city down.
A good blueprint answers one hard question: "When this box fails, what still works?" If you cannot answer that, the HLD is not finished.
Where this lives in the wild¶
- Netflix streaming and homepage APIs — circuit breakers keep a slow personalization or metadata dependency from freezing the full user request path.
- Kubernetes-backed services — readiness checks remove unhealthy pods from load balancers before those pods amplify user-facing errors.
- Shopify checkout — graceful degradation keeps cart and payment paths alive even when reviews, recommendations, or other optional widgets misbehave.
- Slack messaging — bulkheads and queues separate core message delivery from file processing and search indexing backlogs.
- Stripe payment flows — tight timeouts, idempotent retries, and dependency isolation reduce duplicate work when banks or networks respond poorly.
Pause and recall¶
- What is the difference between a failure domain and a blast radius?
- Why is a slow dependency often more dangerous than a dead one?
- In the retry example, how did 500 requests per second turn into 1,500 calls per second?
- Which features should degrade first: critical flows or optional ones?
Interview Q&A¶
Q: Why use a circuit breaker and not just more retries? A: More retries assume the dependency will recover quickly and can handle extra load. A circuit breaker accepts current reality, fails fast, and protects the caller from saturation. Common wrong answer to avoid: "Circuit breakers are just retries with a timer" — no, they change the control flow by stopping calls altogether for a period.
Q: Why put bulkheads in the HLD and not leave isolation to implementation details? A: Resource isolation changes blast radius, capacity planning, and failure behavior across services. That is architecture, not mere code style. Common wrong answer to avoid: "Bulkheads matter only inside one process" — they also exist at queue, cluster, AZ, and dependency boundaries.
Q: Why set strict timeouts instead of waiting for the downstream to finish eventually? A: Waiting forever converts one bad dependency into system-wide resource exhaustion. Timeouts cap damage and preserve headroom for healthy work. Common wrong answer to avoid: "Timeouts reduce correctness" — bad timeout values can hurt, but no timeout is usually worse for both correctness and availability.
Q: Why degrade gracefully and not return an error when one optional dependency fails? A: Users usually prefer a smaller working experience over a complete outage. Architecture should preserve the core journey first and luxury later. Common wrong answer to avoid: "Because optional features do not matter" — they do matter, but they matter less than the critical path during failure.
Apply now (5 min)¶
Exercise: Take one architecture you know. Mark each dependency as critical or optional. Then choose one protection for each edge: timeout, retry policy, breaker, bulkhead, or fallback.
Sketch from memory: Draw a blueprint with one web tier, two critical dependencies, and two optional ones. Then mark what happens when each dependency becomes slow, not dead.
Bridge. The system can now survive bad boxes and bad days. But sometimes nothing is broken; the flood of traffic itself is the problem. → 12-rate-limiting-and-backpressure.md