03. Retry & backoff — repeat carefully, not desperately¶

~14 min read. A retry can rescue one request or destroy the whole system, depending on how you send it.

Built on the ELI5 in 00-eli5.md. The retry dose — one careful repeat attempt — only works when the triage desk already suspects a transient problem.

A retry is not free. It consumes time, tokens, provider quota, and user patience.

If many requests fail together, naive retries can create a storm.

provider hiccup at 12:00:00
      │
      ├── app retries immediately for 10,000 requests
      │        │
      │        └── provider gets hit again while still sick
      │
      └── failure wave becomes bigger than the original issue

The simple version: The retry dose must have spacing. That spacing is backoff. The spacing must also differ across clients. That difference is jitter. Without both,

your own recovery logic becomes the outage.

2) Exponential backoff solves crowding better than fixed waits¶

Now what is the problem with fixed waits? Suppose every caller waits exactly one second. Then every caller retries together one second later. Still crowded. Exponential backoff increases wait after each failed attempt. Typical pattern:

attempt 1 wait 1 second,
attempt 2 wait 2 seconds,
attempt 3 wait 4 seconds,
attempt 4 wait 8 seconds. Picture first.

attempt timeline

fixed wait:       0s ─ 1s ─ 2s ─ 3s
exponential wait: 0s ─ 1s ─ 3s ─ 7s

Exponential spacing gives the provider time to breathe. It also protects your own queues. For example, assume a model endpoint returns 503 for 6 seconds. Policy A uses fixed 1-second waits.

Attempts happen at 0, 1, 2, 3 seconds. All fail. Policy B uses exponential waits of 1, 2, 4 seconds. Attempts happen at 0, 1, 3, 7 seconds. The fourth attempt lands after recovery. Same retry count.

Very different outcome. The retry dose needs timing, not emotion.

3) Jitter prevents synchronized retry waves¶

Exponential backoff alone is not enough. If all clients start together, they may still retry together. That is why we add jitter. Jitter means random variation.

without jitter
client A: retry at 1s, 3s, 7s
client B: retry at 1s, 3s, 7s
client C: retry at 1s, 3s, 7s

with jitter
client A: retry at 0.8s, 2.9s, 6.6s
client B: retry at 1.2s, 3.5s, 7.8s
client C: retry at 0.9s, 2.4s, 6.9s

The spread matters: that spread reduces spike density. Now the provider sees a slope, not a wall. For example, three thousand app servers share one provider.

Without jitter, attempt two creates a second traffic cliff. With jitter, attempts arrive over a wider window, which increases the chance that partial recovery helps real users. The practical response:

Use exponential backoff with jitter by default for transient model and tool failures. The vitals monitor should still record how many retries succeeded, because too many saved requests may indicate hidden instability.

4) Retry budgets stop local recovery from becoming global waste¶

Teams often ask, "How many retries should we allow?" Wrong framing. Ask, "What retry budget can this workflow afford?" A retry budget caps repeated attempts by request,

feature, or time window. Why? Because cost and latency are finite. Because users will leave. Because one sick dependency can eat your whole quota.

request budget example
┌──────────────────────────────┐
│ max total attempts = 3       │
│ max added latency = 4 s      │
│ max added token spend = 1.5x │
└──────────────────────────────┘

The simple version: The retry dose belongs inside a budget. For example, a support assistant has a total answer budget of 10 seconds. Retrieval takes 2 seconds. Rendering takes 1 second.

Only 7 seconds remain for model work. If each model timeout is 4 seconds, you cannot afford two full retries. Maybe you allow:

attempt 1 timeout 3 seconds,
one retry after 1 second,
attempt 2 timeout 3 seconds,
then fallback. That fits inside 7 seconds. Three retries would break the promise. So the budget decides, not optimism.

5) When not to retry¶

This is senior-level judgment. Do not retry persistent failures. Do not retry invalid credentials. Do not retry schema mismatches. Do not retry deterministic prompt bugs. Do not retry side effects unless the action is idempotent.

Do not retry when the user deadline is already gone.

retry? decision

429 rate limit                → maybe yes
503 overloaded model          → maybe yes
network timeout               → maybe yes
invalid API key               → no
tool says "unknown field"    → no
charge_card already sent      → no, unless protected by idempotency

The production problem: Some teams wrap every exception in one retry helper. That is dangerous. The triage desk must label retryable classes clearly. The sealed ward must take over when retries stop helping. For example, a refund tool call timed out after sending the request. You do not know whether the refund already happened. Blind retry may issue a second refund. This is not a retry problem anymore. It is an idempotency problem. We will study that later.

For now, remember this rule. Uncertain side effects deserve protection before retry.

6) A practical retry policy for AI calls¶

A good policy combines classification, backoff, jitter, budget, and exit paths.

if transient and within budget:
    retry with exponential backoff + jitter
else if fallback available:
    use backup path
else:
    degrade honestly or escalate

The retry dose is not the whole treatment plan. It is one step before the backup ambulance, the stability kit, or the senior doctor. That is mature reliability.

Where this lives in the wild¶

GitHub Copilot — inference reliability engineer: uses exponential backoff with jitter after short-lived model 503 bursts, but caps retries tightly to protect IDE latency.
OpenAI API client teams — platform SDK maintainer: apply retry budgets around rate limits so one account does not convert temporary throttling into quota exhaustion.
Intercom Fin — runtime engineer: retries citation-generation failures once when parsing breaks, then falls back to a plain-text support answer instead of looping.
Perplexity — search backend engineer: backs off web-fetch retries with jitter because synchronized retries against a slow source site can look like abuse.
Stripe support assistant — workflow engineer: avoids retrying refund-side-effect calls unless an idempotency key proves the action is safe to repeat.

Pause and recall¶

Why is fixed-delay retry weaker than exponential backoff during shared outages?
What problem does jitter solve that backoff alone cannot solve?
Why should retries be governed by budgets instead of hope?
Name three situations where you should not retry automatically.

Interview Q&A¶

Q: Why use exponential backoff with jitter instead of a fixed retry interval? A: Exponential spacing reduces pressure while the dependency recovers, and jitter prevents clients from re-synchronizing into fresh spikes. Common wrong answer to avoid: "Because exponential backoff guarantees success." It only improves odds and system behavior. Q: Why should retry policy be tied to end-to-end latency budget rather than only provider error type? A: A retry that succeeds too late can still violate the product contract, so recovery must respect the workflow deadline. Common wrong answer to avoid: "Because provider errors are less important than UX." Both matter; budgets combine them. Q: Why is blind retry dangerous for side-effecting tools? A: The first attempt may have succeeded before the acknowledgment failed, so repeating the action can duplicate money movement or state change. Common wrong answer to avoid: "Because tool calls are slower than model calls." Slowness is not the core risk. Q: Why can a high retry success rate still indicate a system problem? A: Frequent rescued requests may mean the dependency is unstable and that users are paying hidden latency and cost taxes. Common wrong answer to avoid: "Because retries should almost never work." Healthy transient recovery should work sometimes, just not become the norm.

Apply now (5 min)¶

Exercise. Design a retry policy for one model call in your product. Choose retryable errors, max attempts, backoff schedule, jitter rule, and a total latency budget. Then name the fallback after the budget is exhausted.

Sketch from memory. Draw the retry timeline for fixed waits versus exponential waits. Add one note for where jitter spreads the wave, and one note for where the sealed ward should replace the retry dose.

Bridge. Retries help only while the dependency is worth touching. Next we learn when to stop touching it entirely, which is the job of the sealed ward. → 04-circuit-breakers.md