07. Timeout management — spend time like money¶

~14 min read. Reliability often fails not because one step is bad, but because the system wastes its total time budget badly.

Built on the ELI5 in 00-eli5.md. The vitals monitor watches clocks too, and the stability kit depends on knowing exactly when the full treatment is already taking too long.

1) First picture: every workflow has one total clock¶

A user feels one wait, not many internal waits. So the whole workflow needs a total deadline. Inside that deadline, each step gets a smaller budget.

total request budget = 10 s

retrieve docs   2 s
model call      4 s
verify output   2 s
render answer   2 s

The simple version: If one step steals extra time, later steps suffocate. That is why timeout management is allocation, not just a number in one SDK call. The triage desk may decide a request is retryable,

but the clock may say no.

2) Per-step timeouts prevent hidden queue death¶

Now what is the problem with one big outer timeout? Inner steps can hang quietly. The outer timeout fires too late. Threads stay occupied. Queues grow. Users stack up.

bad design
outer timeout = 30 s
inner retrieval has no timeout
inner tool has no timeout
model call waits forever until outer timer kills all

This is how small slowness becomes system slowness. Each meaningful dependency needs its own timeout.

retrieval timeout,
model timeout,
tool timeout,
verification timeout,
streaming idle timeout. For example, a support assistant has a 12-second outer SLA. Without step limits, retrieval stalls for 9 seconds, model uses 5 seconds, response misses SLA.

With step limits, retrieval is capped at 2 seconds. If it misses, fallback answer generation starts sooner. The user gets a degraded but timely answer. That is the stability kit saving experience.

3) Total budget beats local optimism¶

Teams often size timeouts independently. Retrieval gets 5 seconds. Model gets 10 seconds. Verification gets 5 seconds. Total becomes 20 seconds. But the product promise is 8 seconds.

The production problem: Each subsystem thinks it is reasonable. The user does not care. The practical response: Start from the whole-request budget. Then divide downward.

product promise: answer within 8 s
reserve for frontend/render = 1 s
reserve for degraded fallback = 1 s
remaining decision budget = 6 s

Now apportion the 6 seconds. For example, suppose you need:

retrieval,
model,
output validation. Choose:
retrieval = 1.5 s,
model = 3.5 s,
validation = 1.0 s. Now keep 1.0 s spare for one retry or fallback handoff. See the discipline. The retry dose must live inside this budget, not outside it.

4) Timeout value should reflect step role, not symmetry¶

All steps should not get equal time. Some steps are cheap. Some are expensive. Some are mandatory. Some are optional.

step role              timeout style
mandatory cheap step   short hard cap
expensive core step    medium cap
optional enrichment    very short cap or skip
streaming output       idle timeout + total timeout

The simple version: A citation reranker might be optional. A permission check is not optional. A health probe should be extremely short. For example, a research assistant flow:

query rewrite,
web retrieval,
synthesis,
citation polish. If citation polish misses 600 ms, ship the answer with simpler formatting. If permission check misses 600 ms, do not continue blindly. So timeout design follows consequence. The triage desk cares about function,

not symmetry.

5) Streaming needs two clocks, not one¶

Streaming systems have a special issue. Users care about first token quickly. They also care that the stream keeps moving. So streaming paths need:

time to first token timeout,
idle gap timeout,

total generation timeout.

stream clocks
┌────────────────────────────────────┐
│ TTFT timeout      = 2 s            │
│ idle gap timeout  = 3 s            │
│ total stream cap  = 20 s           │
└────────────────────────────────────┘

A stream can fail in three ways. No start. Stall mid-way. Never finish. Worked example.

A chat product starts streaming in 1 second. Then nothing arrives for 6 seconds. Without idle timeout, the UI spinner hangs, user trusts the stream is alive, and support tickets rise.

With idle timeout of 3 seconds, the app can cut over to, "Generation stalled. Here is the partial answer and a retry option." That is graceful. The vitals monitor should watch token cadence, not only initial response.

6) Timeout budgets should adapt by request class¶

Now a senior point. Not every request deserves the same time. Autocomplete in an IDE has a tight budget. Research synthesis may have a looser budget. Payment approval may prefer certainty over speed, but still needs a ceiling.

request class        total budget
inline completion    1.5 s
chat answer          8 s
deep research        25 s
approval workflow    12 s + human fallback

The simple version: One universal timeout is lazy design. For example, a coding assistant uses 800 ms for inline completion, but 10 seconds for an explicit "explain this file" request. That is not inconsistency.

That is matching latency to user intent. The triage desk should know the request class before assigning budgets.

7) Timeouts must emit cause-aware outcomes¶

A timeout should not merely say, "Timed out." It should say where, under what budget, and what fallback was attempted.

timeout event
step = retrieval
budget = 1500 ms
elapsed = 1510 ms
request_class = support-chat
fallback_used = cached_policy_answer

That is actionable. It also helps later incident review. Without this detail, timeouts blur together, and teams tune the wrong step.

Where this lives in the wild¶

GitHub Copilot — IDE performance engineer: uses extremely tight inline completion timeouts but allows longer budgets for explicit chat explanations because user intent differs sharply.
Perplexity — answer pipeline engineer: separates time to first token and total synthesis timeout so the product can stream quickly without allowing endless answer generation.
Intercom Fin — support runtime owner: caps account-tool lookups aggressively, then serves policy-only degraded answers when live customer data misses its budget.
Cursor — agent orchestration engineer: enforces per-step tool timeouts so repository search or test runs cannot consume the entire edit-request deadline.
Klarna assistant — workflow reliability lead: keeps approval-related flows within strict total budgets so users are not left waiting while the system silently retries payment-adjacent checks.

Pause and recall¶

Why is one outer timeout not enough for a multi-step AI workflow?
How should you derive per-step timeouts from the product promise?
Why do streaming systems need time-to-first-token and idle-gap timeouts separately?
Why should timeout budgets vary by request class?

Interview Q&A¶

Q: Why derive per-step timeouts from a total user-facing deadline instead of choosing each timeout independently? A: Independent local choices often add up to an impossible end-to-end latency contract. Common wrong answer to avoid: "Because independent timeouts are harder to implement." The real issue is broken budget composition. Q: Why should optional enrichment steps have shorter or skippable timeouts than core safety checks? A: Their value is lower than their risk of delaying the full workflow, so they should be sacrificed first under pressure. Common wrong answer to avoid: "Because optional steps are less accurate." Accuracy is not the deciding dimension here. Q: Why is an idle timeout necessary in streaming systems even after the first token arrives? A: A stream can start and then stall, creating a false sense of progress unless token cadence is monitored. Common wrong answer to avoid: "Because TTFT already proves the stream is healthy." It proves only that the stream started. Q: Why can two requests to the same model need different timeout budgets? A: User intent, risk, and expected interaction pattern differ, so acceptable waiting time also differs. Common wrong answer to avoid: "Because one model secretly changes speed by feature." Model speed matters, but request contract matters more.

Apply now (5 min)¶

Exercise. Pick one AI workflow with at least four steps. Assign a total latency budget, then divide it across steps. Mark which steps are mandatory, optional, and degradable.

Sketch from memory. Draw the total-clock box with smaller per-step boxes inside. Add TTFT, idle, and total stream clocks if your flow streams. Mark where the stability kit activates when the budget is gone.

Bridge. Timeouts tell us when to stop waiting. But if we retry after a timeout, we must be sure we do not repeat the same side effect twice. Next comes idempotency and deduplication. → 08-idempotency-dedup.md