04. Fallback chains¶

Routing decides what to try first. Fallback decides what to try when first fails. Providers fail; models throttle; transient errors compound. The fallback chain is what turns provider failures into operator-controlled degradations instead of caller-visible outages.

A platform engineer at a Bengaluru fintech runs a fire drill: she pages the on-call, declares claude-sonnet-4-6 unavailable in the gateway's routing policy, and watches dashboards. The chatbot for the company's premium banking product, normally backed by the smartest reasoner, transparently degrades to claude-haiku-4-5 for the duration. p95 latency drops slightly; quality on the eval golden set dips three points; users see no error. After ten minutes she restores the routing. The drill takes fifteen minutes total. Six months ago, the same outage — a real one, when the provider had a regional incident — was a six-hour P1 with a war room.

The difference is the fallback chain. This chapter builds it: what counts as a fallback, the order to walk it, when to degrade gracefully, when to refuse cleanly, and how the caller is told.

What "fallback" means here¶

A fallback is a substitute action the gateway takes when the primary cannot serve the call within the budget. Five kinds, in increasing cost to quality:

#	Kind	What it returns	When to use
1	Same model, different region	Identical model behaviour, slightly different latency	Regional incident on the primary
2	Same model, different provider	Identical model behaviour if available cross-provider; otherwise nearly identical	Provider-wide incident
3	Smaller / faster model on same provider	Lower-quality but valid response	Primary model overloaded; budget allows
4	Cached response	The exact or semantically similar prior response	The call is cache-eligible and a recent hit exists
5	Refusal with explanation	A structured "we cannot serve this" with the reason	All previous options exhausted

A fallback chain is an ordered list drawn from these kinds. The routing plane (chapter 03) produces it; the fallback executor walks it.

A worked chain¶

For an interactive smart-reasoner call with a 5000 ms latency budget:

primary:    anthropic:claude-sonnet-4-6:ap-south-1     (700ms p95)
fallback 1: anthropic:claude-sonnet-4-6:us-east-1      (1200ms p95)  -- region failover
fallback 2: openai:gpt-4o:eu-west-1                    (900ms p95)   -- provider failover
fallback 3: anthropic:claude-haiku-4-5:ap-south-1      (350ms p95)   -- model degrade
fallback 4: cache lookup (semantic, 90-day window)                    -- cached degraded answer
fallback 5: refuse with code=MODEL_UNAVAILABLE_TRY_LATER              -- terminal

The chain is evaluated in order. Each step has its own latency expectation, factored against the remaining budget. If a step's worst-case latency would breach the budget, the gateway skips to the next viable step or refuses.

The chain composition is policy, not hard code. It is read from the routing policy file (chapter 03's structure extends here):

aliases:
  smart-reasoner:
    candidates:
      - id: "anthropic:claude-sonnet-4-6:ap-south-1"
        weight: 100
      - id: "anthropic:claude-sonnet-4-6:us-east-1"
        weight: 0
        role: "fallback"
      - id: "openai:gpt-4o:eu-west-1"
        weight: 0
        role: "fallback"
      - id: "anthropic:claude-haiku-4-5:ap-south-1"
        weight: 0
        role: "degrade"
    fallback_policy:
      allow_degrade: true
      allow_cache_serve: true
      refusal_code: MODEL_UNAVAILABLE_TRY_LATER

A new operator looking at one alias should be able to read its fallback policy in twenty seconds.

When to walk the chain¶

The fallback executor walks the chain on:

Trigger	Action
Transient error (HTTP 5xx, network failure, timeout < budget)	Brief retry on same candidate (1–2 tries); if persistent, advance
Rate-limit error (HTTP 429) from provider	Advance immediately; do not retry the same candidate
Hard error on this candidate (4xx that is not retriable)	Advance
Budget exhausted mid-call	Cancel and advance only if remaining budget covers the next candidate's worst case
No candidate currently meets routing filters	Refuse — the chain has not started

The decision "advance vs retry same" is the most common mistake. A 429 means the provider is telling you to wait; retrying immediately wastes budget and burns goodwill. Advance.

Budget math¶

The latency budget bounds the chain. Track the time remaining:

budget = 5000 ms
elapsed at primary failure: 1100 ms
remaining: 3900 ms

next candidate worst-case (p99): 1500 ms
proceed? yes (1500 < 3900)
... it also fails after 1500 ms
remaining: 2400 ms

next candidate worst-case: 1000 ms (smaller model)
proceed? yes
... succeeds in 320 ms
return.

The gateway tracks elapsed time per call and compares to each step's worst-case before attempting. The point is to refuse cleanly with budget left over rather than slip past the budget hoping the next step is fast.

For non-interactive workloads (batch, background), the budget is large and the chain rarely runs out. For interactive workloads, the chain is short by necessity — two or three steps with tight worst-cases.

Degrade vs refuse¶

The most consequential choice in the chain design is whether to degrade or refuse when the chain has exhausted same-quality candidates.

Degrade — return a worse-but-valid response. The caller may or may not be told that degradation happened. The product decides whether to surface it to the user.

Refuse — return a structured error indicating the call cannot be served. The caller surfaces "service unavailable; please try again" or similar.

The decision hinges on two questions:

Is a lower-quality response useful to the user? For a summary, often yes — a less articulate summary still summarises. For a code-generation tool, often yes — a simpler model still writes valid code, slower. For a tool-calling agent that needs the smarter model to plan, often no — the smaller model may hallucinate tool calls that the agent then executes.
Will the user notice the degrade? If quality is materially worse, surface it. If imperceptible, do not bother.

Per-alias policy answers both:

smart-reasoner:
  fallback_policy:
    allow_degrade: true
    degrade_notification: "soft"   # set a flag on response; caller decides

tool-using-agent:
  fallback_policy:
    allow_degrade: false           # do not silently use a less capable planner
    refusal_code: REASONER_UNAVAILABLE

The gateway is the executor; whether to degrade is product policy expressed in the routing config.

Telling the caller what happened¶

The unified response shape from chapter 02 carries provenance per call:

response:
  model_used:
    provider: "openai"               # not the primary
    model_version: "gpt-4o"
    region: "eu-west-1"
  cache_status: "miss"
  degraded: true                     # if a degrade happened
  fallback_step: 2                   # which step of the chain succeeded
  primary_failure_reason: "HTTP_529_OVERLOADED"

Callers can branch on degraded. Most product UIs ignore it; some surface a "responses may be slower than usual" banner. The audit log captures everything for incident reconstruction.

Refusal shape¶

When the chain refuses, the response is a structured error consistent with the contract patterns in module 19 chapter 05:

{
  "ok": false,
  "error": {
    "code": "MODEL_UNAVAILABLE_TRY_LATER",
    "retriable": true,
    "retry_after_ms": 30000,
    "human_hint": "The AI service is temporarily unavailable. Please try again in a moment.",
    "model_action": "Surface the message to the user; do not retry within 30s.",
    "fields": {
      "chain_attempted": 5,
      "last_error_per_step": ["HTTP_529", "HTTP_529", "TIMEOUT", "CACHE_MISS", "EXHAUSTED"]
    }
  }
}

The error is structured so callers can branch. retry_after_ms tells the caller (or its retry layer) when to come back. The audit log captures the full chain attempt for postmortems.

Chain composition patterns by workload¶

A few common compositions, as a starting menu.

Interactive chat (budget ~3000 ms, smart model needed)

primary:    smart-reasoner in-region
fallback 1: smart-reasoner cross-region
fallback 2: cache (semantic, last 24h)
fallback 3: refusal

No degrade — quality drop visible.

Interactive summary (budget ~2000 ms, quality tolerant)

primary:    fast-summariser in-region
fallback 1: fast-summariser cross-region
fallback 2: smaller fast-summariser
fallback 3: cache (exact or semantic)
fallback 4: refusal

Degrade allowed.

Batch summarisation (budget 60s)

primary:    smart-reasoner in-region
fallback 1: smart-reasoner cross-region
fallback 2: smart-reasoner alternate provider
fallback 3: pause (queue, retry in 5 min)
fallback 4: skip item; flag for human

Pause is a valid fallback for batch.

Embeddings (budget 1000 ms)

primary:    embeddings-v3 in-region
fallback 1: embeddings-v3 cross-region
fallback 2: embeddings-v2 (older but compatible)
fallback 3: refusal

No silent degrade across embedding versions — vector spaces are not interchangeable; surface the version used.

Agent tool-calling (budget 10000 ms)

primary:    smart-reasoner with tool-calling, in-region
fallback 1: smart-reasoner cross-region
fallback 2: refusal — do NOT degrade to a weaker planner

The patterns are not exhaustive; they are starting points.

What the chain does not solve¶

Slow providers, not failed providers. If the primary is slow but eventually returns, the chain may not trigger; the call is just slow. Latency budget caps this — if the budget is exceeded, the call cancels.
Quality drift, not availability. If the primary still returns 200 OK but gives bad answers, the chain does not catch it. That is an eval and drift concern (module 04_ai_product_evals and this module's chapter 09).
Cost runaway. A chain that walks through expensive models can cost more than a single failed primary. Cost is bounded by the routing's cost ceiling; the chain respects the ceiling and refuses if all remaining candidates exceed it.

How fallback interacts with other surfaces¶

Routing (chapter 03) produces the chain.
Quota (chapter 05) — a fallback candidate at quota gets skipped, not retried.
Cost (chapter 07) — each step's estimated cost is checked against the call's ceiling before attempting.
Cache (chapter 08) — cache is a chain step, not a separate path.
Audit (chapter 11) — the chain attempt is fully audited, including each step's outcome.

How to recognise broken fallback in the wild¶

Provider outage = product outage (no chain)
The chain exists but a single transient error blows the budget (no per-step budget check)
Degrade is silent and quality drops invisibly (no provenance in response)
The chain retries the same throttled candidate multiple times (no 429-advance logic)
Refusal is an HTTP 500, not a structured error (no error contract on the chain)

Interview Q&A¶

Q1. The primary provider returns HTTP 429. The chain has three more candidates. What does the executor do? Advance immediately to the next candidate, not retry the primary. 429 means "you are throttled; wait." Retrying on the same candidate burns budget and provokes longer throttles. Mark the primary as quota-pressured in observability so the routing scorer (chapter 03) deprioritises it for the next several seconds. Wrong-answer notes: "retry with backoff" is the SDK-level instinct; at the chain level, advance.

Q2. Should the gateway always degrade rather than refuse, since "any answer is better than no answer"? No. Degrade is appropriate when a lower-quality response is useful to the user; refuse is appropriate when the lower-quality response would be misleading or unsafe. A tool-calling agent that needs the smart reasoner to plan should refuse rather than ask the small model to plan, because the small model may invent invalid tool calls the agent then executes. The decision is per alias, in the policy. Wrong-answer notes: "always degrade" is a slogan that produces wrong actions in production.

Q3. How does the chain handle a 5-second latency budget when the primary takes 4.8 seconds and then fails? The remaining budget is 200 ms. The next candidate's worst-case p99 is checked; if it exceeds 200 ms, the gateway does not attempt and refuses with MODEL_UNAVAILABLE_TRY_LATER. If a candidate fits (e.g., a cache lookup under 50 ms), the gateway tries it. Refusing cleanly with budget left over is better than slipping past the budget. Wrong-answer notes: "try anyway" produces calls that breach SLO.

Q4. The product wants to know when its calls were served by a fallback. How does the gateway tell it? The response carries provenance: model_used (the actual provider/model/region), degraded (boolean), fallback_step (which step of the chain succeeded), primary_failure_reason. Products that care can branch on these; products that do not can ignore them. The audit log captures everything regardless. Wrong-answer notes: "the gateway is transparent so the product never knows" is wrong; transparency means the product can know, not that the gateway hides.

What to do differently after reading this¶

For each alias, define the chain explicitly in the routing policy.
Per chain step, record a worst-case latency; use it in budget math.
Decide degrade-vs-refuse per alias; record it in policy, not in code.
Surface provenance fields on every response. Products can choose to act on them; the audit always captures them.
Run a fire drill: declare the primary unavailable and verify the chain behaves as designed. Do this quarterly.

Bridge. Fallback decides where to send a call when something fails. Quotas decide whether the call gets sent at all. A shared provider has a finite throughput; the gateway distributes it across the company. The next chapter builds rate limits and quotas — per-tenant, per-feature, per-provider — and the math behind them. → 05-rate-limit-and-quota.md