13. Circuit breaker and bulkhead — stopping one burning room from taking the whole building¶
~15 min read. Resilience is not bravery; it is disciplined refusal to fail together.
Built on the ELI5 in 00-eli5.md. The board rules — rules that protect the town when one building is on fire — become resilience controls.
1) Failure spreads through waiting, not only through crashes¶
See, most cascading failures start with slowness. One dependency becomes slow, callers keep waiting, threads pile up, and the queue turns into a parking lot.
That is why your first board rules are timeouts and retry budgets. A fast failure often protects the rest of the system better than patient optimism.
- Set a timeout smaller than the caller’s own deadline, or upstream requests will expire first and waste work.
- Use retry budgets so retries remain a controlled fraction of original traffic, not an accidental traffic amplifier.
- Distinguish transient faults from overload. Retrying a saturated dependency often makes saturation worse.
user request
-> api
-> slow payment call
-> thread waits
-> pool fills
-> queue grows
-> unrelated requests also slow
Worked example. Payment latency jumps from 80 milliseconds to 4 seconds. Without a 300 millisecond timeout, checkout workers keep waiting until the whole pool blocks.
So resilience begins before the breaker. It begins when you decide how long waiting is still useful.
2) Circuit breaker changes behavior based on recent failure signals¶
A circuit breaker usually has three states: closed, open, and half-open. Closed means calls pass normally while failures are measured.
When failures or timeouts cross a threshold, the breaker opens. Open means requests fail fast immediately, protecting callers from more waiting.
closed --too many failures--> open
^ |
| | cool-down expires
| v
+----- success in probes --- half-open
|
+-- failures --> open
After a cool-down, the breaker becomes half-open and allows a few probe calls. If probes succeed, the breaker closes. If probes fail, it opens again.
- Use rolling windows, not lifetime counters, or old incidents will dominate today’s decisions.
- Trip on meaningful signals like timeout rate, server error rate, or consecutive failures by endpoint.
- Return useful fallback behavior when possible, such as cached data, degraded UI, or a queued retry.
Worked example. Recommendation service fails 60 of the last 100 calls. Breaker opens for 30 seconds, serves cached suggestions, then probes with five half-open requests.
Simple, no? The breaker is just a disciplined refusal to keep touching a hot stove.
3) Bulkheads isolate failure domains so one crowd cannot drown another¶
Now come to bulkheads. A ship survives because compartments stay separate. Systems survive because worker pools, queues, and limits stay separated too.
Without bulkheads, one noisy dependency can consume every thread, connection, or CPU slice. Then healthy features fail merely because they share the same lane.
shared pool isolated pools
---------- -------------------------
A A A A B B payments | search | mail
A overloads all [10] [20] [5]
B also starves isolated isolated isolated
- Give payment calls their own connection pool if payment slowness should not freeze profile reads.
- Give premium tenants or batch jobs separate quotas when they should not starve interactive traffic.
- Use queue limits with rejection when one lane should shed load instead of poisoning every lane.
Worked example. Search indexing spikes during a catalog import. Because indexing has its own workers, customer checkout still has free workers and stays responsive.
These are board rules again. They are rules about who may consume scarce space during stress.
4) Put timeout, retry, breaker, and bulkhead together¶
The strongest pattern is layered, not lonely. Timeout limits waiting. Retry budget limits amplification. Circuit breaker stops repeated touching. Bulkhead limits blast radius.
Order matters too. A sensible flow is deadline first, then small bounded retries, then breaker evaluation, all inside an isolated worker or connection budget.
request deadline
-> acquire isolated worker
-> call dependency with 200 ms timeout
-> retry once if budget allows
-> breaker opens on bad window
-> fallback or fail fast
Worked example. Checkout has a 900 millisecond deadline. Inventory gets 150 milliseconds, payment gets 300, recommendations get 80 and may be skipped entirely.
- Do not retry everywhere in the stack, or one user request multiplies into a storm.
- Do not share one giant pool, or one broken downstream drags healthy features underwater.
- Measure fallback success too. A graceful degrade is useful only when the user still gets value.
Preventing cascading failure is therefore not magic. It is careful budgeting of time, attempts, and shared capacity.
A mature system treats resilience as policy, not as heroics. That is what good board rules really mean.
Where this lives in the wild¶
- Netflix API platform engineer — uses circuit breakers and bounded fallback paths so one weak dependency does not freeze the whole request graph.
- Amazon checkout reliability engineer — isolates payment, inventory, and recommendation paths because not every dependency deserves equal blocking power.
- Zerodha trading SRE — sets strict timeouts and separate execution pools so reporting slowness does not affect order placement.
- DoorDash platform engineer — uses retry budgets and bulkheads so restaurant menu sync spikes do not drown consumer-facing ordering traffic.
- OpenAI or Anthropic inference platform engineer — separates model pools and request classes so one overloaded model lane does not starve every workload.
Pause and recall¶
- Why do many cascading failures begin with waiting and queue growth rather than clean crashes?
- What is the difference between open and half-open breaker states?
- Why is a retry budget safer than unlimited retry loops under overload?
- How does a bulkhead help one healthy feature survive another feature’s failure?
Interview Q&A¶
Q: Why set a timeout before adding a circuit breaker? A: Because timeout bounds waiting on each call. Without it, callers can still exhaust their own workers before the breaker has enough evidence to trip.
Common wrong answer to avoid: "The breaker alone is enough; timeouts are optional tuning."
Q: Why does a circuit breaker improve system behavior during repeated failures? A: Because it converts repeated slow failures into quick rejections or controlled fallbacks, preserving threads, sockets, and user-facing latency budgets.
Common wrong answer to avoid: "A breaker fixes the dependency itself."
Q: Why are bulkheads different from circuit breakers? A: Because bulkheads isolate capacity between lanes, while breakers change whether calls continue flowing to a troubled dependency.
Common wrong answer to avoid: "They are the same pattern with different names."
Q: Why can retries become dangerous without a budget? A: Because every failure creates extra traffic. Under overload, those extra attempts often deepen the overload instead of healing it.
Common wrong answer to avoid: "Retries only add harmless resilience because they come after an error."
Apply now (5 min)¶
Choose one critical request path, then assign a total deadline, per-call timeouts, a retry budget, a breaker threshold, and one isolation boundary. Write which dependency may degrade gracefully and which one must stay strict.
Sketch from memory:
- the waiting chain where one slow dependency fills the whole worker pool,
- the closed-open-half-open breaker state diagram,
- and the bulkhead picture showing separate pools for payment, search, and mail.
Bridge. Even with good protection rules, distributed systems still keep a few uncomfortable truths. → 14-honest-admission.md