Skip to content

05. Fallback strategies — send the patient somewhere safer

~14 min read. Reliability improves sharply when the system has a second-best plan that is still genuinely useful.

Built on the ELI5 in 00-eli5.md. The backup ambulance — another route when the main route is unavailable — is how AI products keep serving users after the sealed ward closes.


1) First picture: fallback is a tree, not one switch

Teams often say, "We have a fallback." Usually they mean one weaker model. That is not enough. A real fallback plan has layers.

primary path fails
      ├── model fallback       → smaller or alternate provider
      ├── agent fallback       → simpler workflow with fewer steps
      ├── cached fallback      → previous safe answer or template
      └── human fallback       → queue for review

The simple version: The backup ambulance can go to several places. Which place is best depends on the request. Low-stakes autocomplete? A small local model may be enough. A refund approval?

Maybe go to a human. The triage desk chooses the fallback class.

2) Model fallback: same task, weaker or different engine

This is the most common fallback. Primary model fails. Route to another model. Maybe same provider, smaller tier. Maybe different provider entirely.

primary model: premium-long-context
      │ failure
fallback 1: mid-tier general model
      │ failure
fallback 2: local small model or cached template

Model fallback works only if you know the quality drop. Do not assume a smaller model can do the same job safely. For example, a documentation assistant normally uses a large reasoning model. When that path times out,

the system falls back to a smaller summarization model. Quality changes:

  • answer may be shorter,
  • fewer nuanced caveats,
  • lower citation recall. If the product says this honestly, and the use case is low risk, that fallback is good. If the use case is medical advice, that fallback may be unacceptable. The practical response:

Record capability boundaries for each fallback model. The backup ambulance needs destination rules, not only a vehicle.

3) Agent fallback: reduce workflow complexity

Sometimes the model is fine. The workflow is too ambitious. A multi-step agent may fail because tool use is unstable, not because language generation is impossible. Then the fallback should simplify the plan.

full agent
planner → search → tool → verify → finalize
   └── fallback agent
        retrieve FAQ → summarize → show limits

See the wisdom. The fallback agent does less. That can make it more reliable. Worked example. A travel agent normally:

  1. interprets user request,
  2. fetches bookings,
  3. checks fare rules,
  4. proposes options,
  5. executes a change. If booking tools are unstable, a fallback agent may do only:

  6. fetch last booking snapshot,

  7. explain the likely options,
  8. hand off the actual change to support. The user still gets value. The stability kit is working with the backup ambulance here.

4) Cached and rule-based fallback can beat weak generation

Now what is the problem with always falling back to another model? Sometimes another model is still too slow, too costly, or still uncertain. A cached answer, template,

or rules engine may be safer.

request type
   ├── repeated FAQ question ─────────→ cached answer
   ├── known outage status question ──→ incident template
   ├── account-balance request ───────→ direct tool + template
   └── open-ended research question ──→ model fallback

The simple version: A strong fallback is not always more AI. Sometimes it is less AI. Worked example. A cloud-status assistant gets, "Is the payment API down right now?"

Primary model provider is degraded. Best fallback is not another big model. Best fallback is:

  • query status source,
  • render trusted outage template,
  • offer subscription link for updates. That is faster and more accurate. The backup ambulance took the patient to the nearest stable ward, not the fanciest hospital.

5) Fallback selection should use risk and user intent

One fallback tree should not serve every user equally. Risk matters. Intent matters. Deadline matters. Statefulness matters.

fallback selector
┌──────────────────────────────────────┐
│ low risk + simple ask   → smaller AI │
│ low risk + repeated ask → cache      │
│ medium risk + tool down → simpler AI │
│ high risk + uncertainty → human      │
└──────────────────────────────────────┘

For example, consider two requests. Request A. "Summarize this meeting transcript." Request B. "Issue a refund to order ORD-4413."

If the primary agent fails, Request A can drop to a smaller model. Request B should probably escalate. The triage desk decides the backup ambulance path based on consequence, not just failure type.

6) Fallback quality must be pre-measured

Now a senior rule. A fallback is part of the product. So it must be evaluated before the outage. Not during the outage. Measure at least:

  • success rate,
  • latency,
  • user-visible quality drop,
  • safety risk,
  • cost.
    fallback scorecard
    ┌───────────────────────────────┐
    │ path: model-b-small           │
    │ success: 94%                  │
    │ p95 latency: 2.4 s            │
    │ quality delta: -8%            │
    │ safe for: summaries, drafts   │
    │ not safe for: payment actions │
    └───────────────────────────────┘
    

Without this scorecard, fallback is wishful thinking. With it, the sealed ward can trigger known-safe alternatives automatically.


Where this lives in the wild

  • GitHub Copilot — IDE runtime engineer: falls back from a cloud code-generation model to a cached local completion model for short inline suggestions when the main endpoint is unhealthy.
  • Intercom Fin — support workflow designer: switches from a tool-heavy resolution agent to a simpler answer-only assistant when backend systems are degraded, while sending action requests to humans.
  • Perplexity — answer infrastructure lead: uses cached query results and existing citation sets for repeated trending questions when live search providers are unstable.
  • Klarna assistant — payments reliability engineer: avoids model-to-model fallback for money-moving actions and instead routes failed payment workflows into a support approval queue.
  • Cursor — agent product engineer: drops from multi-file autonomous edit mode to a suggestion-only mode when repository tool calls or verification loops keep failing.

Pause and recall

  • Why is a fallback tree better than one single fallback model?
  • When is agent fallback better than model fallback?
  • Why can a cached or rule-based response be safer than another generative attempt?
  • Why must fallback quality be measured before incidents happen?

Interview Q&A

Q: Why choose a rule-based or cached fallback instead of always routing to a smaller model? A: For repeated or structured tasks, deterministic fallbacks can be faster, safer, and more accurate under stress. Common wrong answer to avoid: "Because smaller models are always lower quality." Some smaller models are fine; the point is task fit. Q: Why is fallback selection a risk decision and not just a cost decision? A: The acceptable quality drop depends on consequence. A summary can degrade; a financial action may require a human. Common wrong answer to avoid: "Because risky requests are rarer." Rarity does not determine severity. Q: Why can a simpler fallback agent outperform the full agent during incidents? A: Fewer steps mean fewer failure surfaces, so reduced ambition can raise end-to-end reliability materially. Common wrong answer to avoid: "Because simpler agents always answer faster." Speed helps, but reduced dependency count is the core reason. Q: Why should fallback paths be pre-evaluated like primary paths? A: Outages are the worst time to discover that the backup route is unsafe, too weak, or too expensive. Common wrong answer to avoid: "Because fallback testing helps marketing claims." The goal is operational trust, not promotion.


Apply now (5 min)

Exercise. Pick one AI product feature. Write a three-level fallback tree for it: model fallback, non-model fallback, and human fallback. For each path, state what quality is lost and what risk is still acceptable.

Sketch from memory. Draw the backup ambulance tree. Mark where the sealed ward triggers it, and where the stability kit or senior doctor becomes the better destination.


Bridge. Fallback keeps service alive, but sometimes the backup is intentionally smaller. Next we study how to use the stability kit and degrade honestly instead of pretending everything is normal. → 06-graceful-degradation.md