08. Error Handling and Retries — fail loudly, retry selectively¶

~16 min read. Async services touch many unreliable systems, so error design is part of the product, not an afterthought.

Built on the ELI5 in 00-eli5.md. The order ticket — the request plan — needs clear failure labels so the front desk knows what to retry, surface, or stop.

First picture: not all failures deserve the same response¶

Look at the tree first. A timeout is not the same as bad input. A 429 is not the same as authentication failure. An upstream 500 is not the same as corrupted local state.

request fails
    │
    ├── client error      ──→ return clear 4xx
    ├── transient upstream ──→ maybe retry
    ├── permanent upstream ──→ fail fast
    └── local bug          ──→ log, alert, fix code

See. Senior systems do not say, "Catch everything and retry." That just multiplies pain. We need an exception hierarchy, retry rules, and safety valves.

Build an exception taxonomy first¶

Start with categories. ValidationError. AuthenticationError. RateLimitError. UpstreamTimeout. UpstreamUnavailable. BusinessRuleError. InternalBug.

example map
┌──────────────────────┬──────────────────────────┐
│ error type           │ action                   │
├──────────────────────┼──────────────────────────┤
│ bad request body     │ 422 to client            │
│ bad API key          │ 401 to client            │
│ upstream timeout     │ retry maybe              │
│ upstream 429         │ backoff maybe            │
│ upstream 400         │ fail fast                │
│ coding bug           │ 500 + alert              │
└──────────────────────┴──────────────────────────┘

This matters because retries depend on classification. A malformed prompt payload will not improve on attempt two. A brief network timeout might. Simple, no?

In FastAPI, you often raise HTTPException for client-visible failures. For internal workflows, custom exceptions keep the logic cleaner. Map them at the edge. Do not leak random stack traces through the front desk.

Worked example: retry with backoff only for transient errors¶

Picture the flow first.

call upstream
    │
    ├── success ──→ return
    ├── timeout ──→ wait a bit, retry
    ├── 429     ──→ back off, retry carefully
    └── 400     ──→ stop, caller sent bad request

Now a small example.

import asyncio

async def call_with_retry(send_once, retries: int = 3):
    delay = 0.5
    for attempt in range(1, retries + 1):
        try:
            return await send_once()
        except UpstreamTimeout:
            if attempt == retries:
                raise
            await asyncio.sleep(delay)
            delay *= 2
        except RateLimitError:
            if attempt == retries:
                raise
            await asyncio.sleep(delay)
            delay *= 2

Now trace numbers. Attempt 1 fails with timeout. Sleep 0.5 seconds. Attempt 2 gets 429. Sleep 1.0 second. Attempt 3 succeeds. Total extra wait is 1.5 seconds. That is acceptable if the endpoint deadline allows it.

But do not retry forever. And do not retry non-idempotent side effects casually. If the first attempt maybe created a billing record, a second attempt could double-charge. That is why retry logic must know operation semantics.

Circuit breakers stop endless pain loops¶

Now what is the problem? Suppose the upstream LLM provider is half down. Every request hits it. Every request times out after ten seconds. Your whole API becomes a waiting room.

Circuit breakers help. After enough failures, you stop calling the broken dependency for a while. You fail fast, or degrade gracefully.

closed ──→ failures rise ──→ open
   ▲                          │
   │                          └── reject calls fast
   └──── half-open test ──────┘

See. This protects your own service. A broken oven should not make every line cook stand there touching it again. The front desk may choose fallback behavior instead. For example, serve cached embeddings, use a smaller backup model, or return a structured unavailable message.

Logging and user messages should differ¶

Another mature rule. Operator detail and user detail are not the same. Users need a safe, actionable message. Operators need request ids, exception class, upstream name, latency, and retry count.

A clean error response might say, "Model provider timed out. Please retry." Your logs should say, provider=anthropic, operation=messages.create, attempt=2, request_id=abc123, duration_ms=8140.

The order ticket needs labels for both humans. The user at the counter. And the engineer in the kitchen. That is why structured logging matters here.

Practical retry rules for AI services.

Retry on transient network failures. Retry on 429 if the call is safe and budget allows. Retry on 5xx selectively. Do not retry validation errors, auth failures, or deterministic bad prompts. Do not bury repeated failures under giant generic except Exception blocks.

Also set a total deadline. Three retries without a deadline can exceed product latency budgets. A good retry policy respects the full request SLA. The cancel bell still wins. We will see that next.

Where this lives in the wild¶

OpenAI API client gateway — platform engineer: retry policy distinguishes brief network drops from deterministic 400-series request mistakes.
Anthropic-backed enterprise chat — SRE: circuit breakers stop cascading timeout storms when a provider region degrades.
Perplexity retrieval backend — backend engineer: vector search timeouts may deserve limited retries, but malformed filters should fail immediately.
Stripe-style billing around AI usage — payments engineer: idempotency prevents double writes when transient failures trigger safe retries.
Customer support copilot — product engineer: user-facing error text stays clean while structured logs capture provider, attempt count, and trace ids.

Pause and recall¶

Why is exception taxonomy the first step before writing retry logic?
Which failures should usually not be retried?
What problem does a circuit breaker solve that plain retries do not?
In the analogy, why should the front desk react differently to a bad order and a broken oven?

Interview Q&A¶

Q: Why not retry every failed upstream call three times by default? A: Because many failures are deterministic or non-idempotent, so blind retries waste latency budget, amplify load, and may duplicate side effects. Common wrong answer to avoid: "More retries always improve reliability."

Q: Why is a circuit breaker valuable even when you already have retries? A: Retries help with brief blips, but a breaker protects your service during sustained dependency failure by failing fast instead of piling up doomed waits. Common wrong answer to avoid: "Circuit breakers are only for hardware systems, not APIs."

Q: Why should user-facing error messages differ from internal logs? A: Users need safe, simple guidance, while operators need detailed diagnostics and correlation fields to debug the real fault path. Common wrong answer to avoid: "Returning the full stack trace is more transparent."

Q: Why do retries require understanding idempotency? A: Because reissuing a side-effecting operation can corrupt state or double-charge unless the operation is explicitly safe to replay. Common wrong answer to avoid: "Idempotency only matters for GET requests."

Apply now (5 min)¶

Exercise. Pick one upstream dependency. List four possible failure types. For each, mark fail fast, retry, or open breaker.

Sketch from memory. Draw the failure tree with client error, transient upstream, and local bug. Add one note on where the front desk should surface each case.

Bridge. Retries help only inside a deadline. Next we study cancellation and timeouts, where the cancel bell tells every layer when to stop. → 09-cancellation-timeouts.md