Skip to content

07. Dual-sourcing and fallback chains — keeping a second supplier warm

~18 min read. Every model supplier has an outage eventually. The team with a second supplier already evaluated and ready takes the outage in stride; the team without one watches their product go dark. The matching habit on this page is keeping a second supplier warm without paying twice for production traffic. Hot-hot, hot-warm, hot-cold — three patterns, three cost profiles, three recovery times.

Builds on 06-switching-cost-anatomy.md. Migration costs are real even in good times. During an incident they become impossible — you cannot pay seven switching-cost taxes at 3 AM. The fix is to pay them in advance and keep the second supplier ready.


1) Hook — the six-hour outage one team barely noticed

A finance team runs an AI compliance-review pipeline on the OpenAI API. Eight million tickets a month. Strict-mode JSON outputs feed downstream risk-scoring logic. Median latency 2.4 seconds. Bill around $40,000 a month. It is a healthy production system.

On a Tuesday afternoon, OpenAI experiences a six-hour partial outage in the US-East region. Error rates climb from 0.1% to 18% within ten minutes. Latency p99 climbs from 8 seconds to 45 seconds. Several other teams in the same office watch their dashboards turn red, scramble to get vendor support on the line, and ultimately tell their users that "we are experiencing degraded performance from an upstream provider."

This team does not. Their on-call dashboard flashes a fallback notice. Traffic shifts from the OpenAI primary to a Sonnet 4.6 secondary on Anthropic over a ninety-second ramp. The Anthropic-bound traffic uses prompts that were rewritten for Anthropic six months ago, validated weekly with shadow eval, and kept warm with 5% of production traffic continuously. Schema-adherence rates and quality metrics stay within the team's defined floor. 95% of the six-hour outage is served by the secondary supplier. The 5% that fails is the small slice of workloads that depended on OpenAI-specific features the secondary cannot replicate.

                BEFORE (single supplier):
                ─────────────────────────
                              outage
                Time          starts                  outage ends
                  │             │                          │
                  ▼             ▼                          ▼
                ─────────────████████████████████████─────────
                error rate:    0.1% → 18% → recovery in 6 hrs
                customer impact: 6 hrs of broken summaries

                AFTER (hot-warm dual-sourcing):
                ───────────────────────────────
                              outage
                Time          starts        failover    outage ends
                  │             │              │              │
                  ▼             ▼              ▼              ▼
                ─────────────███▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒───────
                error rate:    0.1% → 18% → 2% (failover) → 0.1%
                customer impact: 90s of degradation, then normal

This chapter is how that team built that capability. The answer is not "build redundant infrastructure" — that costs too much and is wrong- shape for AI workloads. The answer is hot-warm dual-sourcing — primary serves, secondary stays continuously evaluated, both ready to swap.


2) The metaphor — the second supplier already in the building

A restaurant that depends on one fish supplier has a problem the day that supplier's truck does not show up. The kitchen runs out of fish by lunch. The menu cuts in half. Customers leave.

A smart restaurant manager keeps a second supplier on retainer. Not at half the volume — half is too expensive, and the secondary's prices during normal times do not compete with the primary's negotiated rate. At maybe 5% of regular orders, enough that the supplier knows the restaurant, the kitchen knows their cuts, the staff has run their fish through the prep workflow recently. When the primary truck fails to arrive, the manager calls the second supplier, doubles the regular order, and lunch happens. The kitchen is not surprised because they have been cooking with the secondary's fish all along, just at small volume.

That is hot-warm dual-sourcing. Primary serves the bulk. Secondary stays continuously evaluated. The matching habit on this page is deciding which version of "warm" the workload deserves, and budgeting for it correctly.


3) The three patterns — hot-hot, hot-warm, hot-cold

┌──────────────────────────────────────────────────────────────────────┐
│                    DUAL-SOURCING PATTERNS                            │
├──────────────────────────────────────────────────────────────────────┤
│ HOT-HOT                                                              │
│ ────────                                                              │
│ Both suppliers serve production traffic in parallel.                 │
│ Outputs scored and the better-scoring one wins per request.          │
│ Failover time: instantaneous. Recovery: trivial.                     │
│ Cost: ~2x. Quality: highest. Most expensive pattern.                 │
├──────────────────────────────────────────────────────────────────────┤
│ HOT-WARM                                                             │
│ ─────────                                                            │
│ Primary serves 95%+ of traffic. Secondary serves a continuous        │
│ canary fraction (1-5%) for eval and freshness.                       │
│ Failover time: 30-120 seconds. Recovery: fast.                       │
│ Cost: primary + 5%. Quality: matches primary on normal days.        │
├──────────────────────────────────────────────────────────────────────┤
│ HOT-COLD                                                             │
│ ─────────                                                            │
│ Primary serves all traffic. Secondary is a documented config         │
│ change, no live traffic, no continuous eval.                         │
│ Failover time: hours to days. Recovery: depends on team's            │
│ readiness during the incident.                                       │
│ Cost: primary only. Quality: unknown on secondary until needed.      │
└──────────────────────────────────────────────────────────────────────┘

Hot-hot is the right pattern when quality on the hard cases is paramount and the workload tolerates 2x model cost. Most teams pick hot-warm. Hot-cold is acceptable for low-stakes workloads where a few hours of degraded service is survivable. The choice is a function of the business cost of an outage.


4) Hot-hot — both suppliers run, score-and-pick

In hot-hot, every request goes to both suppliers in parallel. The outputs are scored — by a fast judge model, by schema validation, by heuristics, or by all three — and the higher-scoring output wins.

                INCOMING REQUEST
              ┌────────┴────────┐
              ▼                 ▼
         PRIMARY            SECONDARY
         (Sonnet 4.6)       (GPT-4o)
              │                 │
              ▼                 ▼
            score             score
              │                 │
              └────────┬────────┘
                  PICK HIGHER
                  RETURN TO USER

Hot-hot pros — instant failover (if one supplier fails, the other's output is already in hand), no degraded period during incidents, quality often higher than either supplier alone on hard cases because the score-and-pick step catches each supplier's individual weak spots.

Hot-hot cons — model cost is ~2x. Latency is bounded by the slower of the two suppliers (unless you cap and take the faster one). The scoring infrastructure (judge model or heuristic) adds cost and complexity. The hot-hot pattern is hard to justify for routine workloads but earns its keep on high-stakes outputs — legal contract analysis, medical content review, financial advice — where quality matters more than cost.

Hot-hot is also occasionally used as a training pattern. The judge's preference signal becomes data for fine-tuning, RLHF, or routing-rule improvements. The 2x cost buys both reliability and quality data.


5) Hot-warm — primary serves, secondary kept evaluated

In hot-warm, the primary supplier serves the bulk of traffic. A small continuous fraction (1-5%) goes to the secondary as canary traffic, which gets scored against the same eval rubric the team uses for the primary. The result — when the primary fails, the secondary is already proven on this week's workload distribution, with this week's prompts, with this week's edge cases.

NORMAL OPERATION                       FAILOVER
────────────────                       ────────

       95%                              5% canary becomes 100%
        │                                          │
        ▼                                          ▼
   ┌─────────┐                              ┌─────────┐
   │ PRIMARY │                              │ PRIMARY │  (down)
   │ Sonnet  │                              │ Sonnet  │
   └─────────┘                              └─────────┘
                                                 X
        5% canary                                X
        │                                        X
        ▼                                        ▼
   ┌──────────┐                            ┌──────────┐
   │SECONDARY │                            │SECONDARY │
   │ GPT-4o   │                            │ GPT-4o   │  (now 100%)
   └──────────┘                            └──────────┘

Hot-warm pros — cost is primary's bill plus roughly 5% of the secondary's bill. Failover is fast because the secondary is already warm — prompts validated, schema adherence verified, latency characterized. Quality on the secondary is known, not guessed.

Hot-warm cons — the canary fraction is real money (5% of a $40,000/month primary is $2,000/month, more if the secondary is more expensive per token). The scoring infrastructure to compare primary and canary quality is real engineering. Maintaining two prompts in sync (or two prompt variants for the two suppliers) is real maintenance.

Hot-warm is the right pattern for most production AI systems. The 5% cost is cheap insurance and the failover is fast enough to keep the product healthy through a 4-hour outage with minimal user impact.


6) Hot-cold — secondary is a paper plan only

In hot-cold, the secondary exists only as documentation. Prompts have been rewritten and shadow-tested at some point in the past. The config change is documented. But no live traffic flows to the secondary day to day.

When the primary fails, the team executes the documented plan — flip config, deploy, monitor, fix bugs as they emerge. Recovery times are measured in hours or days, not minutes, because the documentation has drifted from production reality.

Hot-cold pros — zero ongoing cost. Engineering work happens upfront and then waits.

Hot-cold cons — when the secondary is needed, it has not been validated against current prompts, current workload distribution, current schema expectations, or current downstream consumers. The first hour of an outage is spent rediscovering everything the team learned six months ago. Worse, vendors evolve — Anthropic releases a new model, deprecates an old one, changes a feature behavior — and the cold secondary may not even be a working configuration anymore by the time it is needed.

Hot-cold is acceptable for low-stakes internal workloads where a few hours of degraded service is survivable. It is not acceptable for any customer-facing production system.


Mid-content recall

  1. State the three dual-sourcing patterns. What is the cost and failover-time profile of each?
  2. Why is hot-warm cheaper than hot-hot, and what does the "warm" part actually buy you that hot-cold does not?
  3. When does the 2x cost of hot-hot earn its keep?

7) Health checks and failover triggers — when to flip

A dual-sourced system needs a clear rule for when traffic moves to the secondary. Three signal categories drive most production triggers.

Error rate. When the primary's error rate (5xx responses, timeouts, malformed outputs) exceeds a threshold (typically 5% over a one-minute window), flip. The threshold is workload-specific — high-volume workloads can tolerate 2% sustained errors briefly; low-volume workloads may need to flip at 10% because the absolute count is too small to be statistically reliable below that.

Latency. When p99 latency exceeds a threshold (typically 3x of baseline), flip. A 10-second p99 ballooning to 60 seconds is not just slow — it is breaking downstream timeouts and queueing customer requests indefinitely. Sometimes latency degradation matters more than error rate; a "working but unusable" supplier is worse than a clearly- failing one.

Evaluation score. This is the dual-source-specific signal that hot-cold cannot use. The canary traffic produces a continuous quality score against the eval rubric. When the primary's quality drops below threshold (vendor model regression, prompt drift, workload shift), flip — even if error rates and latency look fine. A supplier producing fast valid-but-lower-quality outputs is a silent regression that metrics will not catch.

Vendor incident webhook. Both OpenAI and Anthropic publish status pages and incident webhooks. Wire these into the failover decision — when the vendor declares an incident affecting your tier or region, preemptively flip without waiting for your own error rates to climb. This is the cheapest source of signal — the vendor has more information than you do about their own systems.

FAILOVER DECISION TREE
──────────────────────
vendor incident declared?           → flip preemptively
error rate > 5% for 60s?            → flip
p99 latency > 3x baseline for 60s?  → flip
eval score < threshold for 5 min?   → flip (canary-detected regression)
none of the above?                  → stay on primary

Three additional design notes. One — the failover decision is one-way in most production designs; flipping back to the primary requires a manual all-clear because flapping between suppliers during an incident makes things worse, not better. Two — the failover should be gradual (ramp from 0% to 100% over 30-120 seconds) so that any secondary-side rate-limit problems surface before they cause a second outage. Three — the failover should keep a small canary fraction (perhaps 5%) on the primary even during failover, so you can detect when the primary is healthy enough to start the reverse ramp.


8) Latency-tier fallback — the smaller cousin of dual-sourcing

Not every fallback is across vendors. A common pattern is latency-tier fallback within the same vendor — try frontier, fall back to mid on rate-limit errors.

                INCOMING REQUEST
                ┌──────────────┐
                │  Try Opus    │
                │  4.7 first   │
                └──────┬───────┘
              ┌────────┴─────────┐
              ▼                  ▼
           success            429 / 503 /
                              timeout
              │                  │
              ▼                  ▼
          return            ┌─────────────┐
                            │ Fall back   │
                            │ to Sonnet   │
                            │ 4.6         │
                            └──────┬──────┘
                                return

This pattern is for rate-limit pressure, not vendor outages. When the frontier tier is throttled, falling back to the mid tier degrades quality slightly but preserves availability. The mid tier typically has much higher per-tier quotas and is unlikely to be throttled at the same time.

Latency-tier fallback is sometimes the only form of fallback a team maintains, because the engineering cost is much lower than cross- vendor dual-sourcing — same supplier, same SDK, same prompt conventions, smaller prompt-portability tax (within-vendor model ports are usually 1-2 points rather than the 5-12 points of cross- vendor).

The limit — latency-tier fallback does not help during a vendor-wide outage. If Anthropic is down, Opus and Sonnet are both unavailable. Cross-vendor dual-sourcing is the broader insurance.


9) The warm-prompts problem — why "warm" is more than canary traffic

A secondary supplier needs more than continuous traffic to be truly warm. Three things must stay continuously fresh.

Prompts validated for the secondary's conventions. The seven switching-cost taxes from chapter 6 apply in full. The secondary needs its own prompt variant, written for its own conventions, tested against the same eval rubric. When the team updates the primary's prompt, the secondary's prompt must update too — either manually or through a prompt-management system that maintains per-supplier variants.

Schema adherence verified continuously. Schema-adherence rates drift over time as workload distributions shift, as the vendor updates the model, as new edge cases emerge. The canary traffic must be monitored for schema failures so the team knows the secondary will still produce valid output when it is needed at full traffic.

Capability parity tracked. Not every secondary can do what the primary can. If the primary uses computer use, native parallel tool calling, or extended thinking, the secondary may not support those features. The team must know which capabilities the secondary lacks and plan for graceful degradation — some workloads may need to shed functionality during a failover rather than fail outright.

WARM-PROMPT MAINTENANCE CHECKLIST
─────────────────────────────────
□ secondary-specific prompt variant exists
□ prompt updates propagate to both variants
□ eval rubric runs weekly against secondary canary
□ schema adherence rate measured weekly
□ capability gaps documented and graceful-degradation
  plan in place per gap
□ secondary's vendor announcements monitored for
  deprecations or behavior changes
□ failover runbook reviewed quarterly
□ failover dry-run executed quarterly (5 minutes of
  intentional failover during low-traffic window)

The quarterly dry-run is the most important item on the list. The moment of an actual incident is not the time to discover that the failover script has a typo, the secondary's API key rotated last month, or the rate-limit policy on the secondary has been silently reduced. A dry-run reveals these problems while the cost is low.


10) Capability fallback — graceful degradation patterns

Not every workload can simply move from supplier A to supplier B at full feature parity. Three common capability gaps in 2026 and the graceful-degradation patterns for each.

Computer use — Anthropic's computer use (and recent comparable features) lets the model directly drive a screen. When the primary is down and the secondary lacks computer use, the workload either pauses, queues for the primary's return, or falls back to a non-computer-use workflow (text-only API calls).

Native parallel tool calling — OpenAI, Anthropic, and Gemini all support parallel tool calls in 2026, but open-weight secondaries without inference-engine guidance may serialize tool calls. The workaround — application-side parallelization (the orchestrator issues multiple sequential model calls, each requesting one tool, instead of relying on the model to emit parallel calls).

Extended thinking / reasoning modes — Anthropic's extended thinking and OpenAI's o-series reasoning are not portable to a non-reasoning secondary. The graceful degradation is to disable the reasoning request on the secondary and accept some quality loss on the hard cases — better than blocking all cases.

CAPABILITY GAP        GRACEFUL DEGRADATION
─────────────         ────────────────────
computer use          fall back to text-only API; queue for primary
parallel tool calls   serialize at application layer
extended thinking     disable reasoning request; accept quality loss
strict mode           add retry-on-validation-failure layer
prompt caching        recompute prefixes; cost increases temporarily
long-context          chunk and summarize; multiple secondary calls
multimodal vision     OCR pre-step; route text to secondary

The matching habit — for each capability the primary uses, write the graceful-degradation plan before the incident. During an incident is the wrong time to design the workaround.


11) Worked example — the team that survived a six-hour outage

Let's walk through how the team in the hook section designed their dual-sourcing system.

Step 1 — supplier selection. Primary OpenAI GPT-4o; secondary Anthropic Sonnet 4.6. The two suppliers run on independent infrastructure (different cloud providers, different network paths) so a correlated outage is unlikely.

Step 2 — prompt parity. The team maintains two prompt variants per workload, one OpenAI-style and one Anthropic-style. Both variants are versioned in their prompt-management system (Vellum in this team's case). When the OpenAI variant is updated, the Anthropic variant is updated and re-shadow-tested before merging.

Step 3 — canary traffic. 5% of production traffic continuously routes to the Sonnet secondary. The outputs are scored against the same eval rubric the primary uses. Weekly review surfaces any quality drift.

Step 4 — failover trigger. Three triggers configured in their gateway (LiteLLM in this case). Error rate >5% over 60s, p99 latency

3x baseline over 60s, or manual flip via runbook command. The vendor incident webhook is wired in but used as an early-warning signal, not an automatic trigger, because false-positive incidents are expensive.

Step 5 — gradual ramp. Failover ramps from 0% to 100% of secondary traffic over 90 seconds. A small 5% canary stays on the primary throughout the failover so the team can detect recovery.

Step 6 — graceful degradation. Three capability gaps were identified between GPT-4o and Sonnet 4.6 at the time of the design. None were critical to this workload. The team documented the gaps and the degradation plan in the runbook.

Step 7 — quarterly dry-run. Every quarter, the team runs a 5-minute intentional failover during a low-traffic window. The dry-run catches config drift, API-key rotations, and small behavior changes before they matter in a real incident.

Cost math. - Primary: \(40,000/month - Secondary canary (5%): ~\)2,200/month - LiteLLM gateway + observability: ~$300/month - Total: $42,500/month, vs $40,000 baseline = ~6% insurance premium

Outage outcome. - 6 hours of OpenAI degradation - ~10 seconds from trigger to first secondary traffic - 90 seconds to full failover - 95% of traffic served during outage - 5% failed (the small slice depending on OpenAI-specific features) - 0 customer-facing errors after the 90-second ramp

The 6% insurance premium bought the team a 95% outage survival rate. The math justifies itself the first time it is needed.


12) Failure modes — dual-sourcing that doesn't pay off

SIGNAL                                FIX
──────                                ───
hot-cold secondary that has not       → upgrade to hot-warm with at least
 been tested in 6 months                1% canary traffic; quarterly dry-
                                        runs are minimum viable

hot-hot deployed for routine          → match the pattern to the workload;
 workloads to "be safe"                 routine workloads can afford
                                        hot-warm

secondary prompt drifted from         → prompt updates must propagate to
 primary prompt                          all supplier variants in the same
                                        deployment

failover trigger too sensitive         → tune thresholds; false positives
 (flaps)                                are expensive and erode trust

failover trigger too insensitive       → tune thresholds; missed triggers
 (misses real outages)                   are why the secondary exists

no graceful-degradation plan per       → for each primary-only capability,
 capability gap                          document the degradation pattern
                                        before the incident

no quarterly dry-run                   → the moment of incident is the
                                        wrong moment to test the failover

vendor incident webhook ignored        → vendor knows their own systems;
                                        wire the signal in

failover ramp instant (0% to 100%      → ramp over 30-120 seconds to
 in one step)                            surface secondary rate-limit
                                        issues gradually

flapping back to primary too soon      → require manual all-clear for
                                        reverse flip; or wait 15 minutes
                                        of clean primary metrics

The pattern across these — dual-sourcing only works if the system has been kept honest. A "warm" secondary that has not been exercised is a cold secondary with a misleading label.


Where this lives in the wild

Dual-sourcing infrastructure and patterns show up throughout the production AI ecosystem.

  • LiteLLM — open-source proxy with built-in fallback chains; primary/secondary routing with health checks. The canonical open-source dual-sourcing layer.
  • OpenRouter — multi-supplier gateway with per-request fallback policies and supplier ranking. Often used as the primary fallback abstraction.
  • Vercel AI SDK — provider abstraction with per-call provider selection; supports per-call fallback patterns.
  • Helicone — proxy with retry and fallback configuration; logs failover events for analysis.
  • AWS Bedrock — multi-supplier (Anthropic, AI21, Cohere, Meta, Mistral) within one API; cross-supplier failover within Bedrock is trivial. Cross-cloud failover from Bedrock to direct Anthropic API is still cross-vendor.
  • Azure OpenAI Service — multi-region deployments; intra-Azure region failover is straightforward.
  • Vertex AI — multi-supplier gateway; failover within Vertex is one config change.
  • Anthropic API + Bedrock + Vertex — three ways to reach the same Anthropic model. Useful for vendor-redundancy at the cloud level.
  • OpenAI API + Azure OpenAI — two ways to reach the same OpenAI models. Different SLAs, different regions, different quotas.
  • Together AI, Fireworks AI — hosted open-weight secondaries popular for cross-vendor failover from closed-weight primaries.
  • Anyscale, Modal — self-hosted infrastructure that some teams use as a tertiary fallback for the strictest residency requirements.
  • Replicate — per-model endpoints used as low-traffic emergency fallback for niche workloads.
  • Langfuse, LangSmith, Braintrust — eval platforms that compute the continuous canary quality score against an eval rubric.
  • Vellum, PromptLayer, Pezzo — prompt management with per-supplier variants, version control, and shadow-deployment support.
  • PagerDuty, Opsgenie — incident-management platforms wired to failover decisions.
  • Datadog, Grafana, Honeycomb — observability platforms tracking error rate, latency, and eval score signals that drive failover.
  • StatusPage, Statuspal — vendor status integration for incident webhook signals.
  • Anthropic Status Page, OpenAI Status Page, Google Cloud Status — vendor-published incident webhooks for the failover decision.
  • GitHub Actions + Slack — quarterly-dry-run automation in many teams' CI/CD pipelines.
  • OpenAI Evals, Anthropic Evals, Braintrust Evals — eval-set infrastructure that runs against the canary traffic.
  • Inkeep, Glean — production AI products with documented dual- sourcing patterns in their architecture.
  • Cursor, Windsurf — code-AI products with multi-supplier fallback for autocomplete and agent modes.
  • Notion AI — multi-supplier backend with intra-vendor and cross-vendor fallback.

Pause and recall

  1. Name the three dual-sourcing patterns and the cost / failover-time profile of each.
  2. Why is hot-cold acceptable for low-stakes workloads but not for customer-facing production?
  3. State the four failover trigger signals. Which one is dual-source-specific and unavailable to single-supplier systems?
  4. What three things must stay continuously fresh for a secondary to be truly "warm"?
  5. Why is the failover ramp gradual rather than instant?
  6. State the graceful-degradation pattern for three primary-only capabilities (computer use, parallel tool calls, extended thinking).
  7. Roughly what percentage premium does hot-warm dual-sourcing add to the primary supplier's bill?

Interview Q&A

Q1. Your primary supplier has a 4-hour outage. Walk me through what happens in your system. A. Three phases. One — detection. Vendor incident webhook fires within minutes, error rate signals climb in our gateway, p99 latency balloons. The failover trigger fires in under a minute. Two — failover. Gateway ramps secondary traffic from 5% canary to 100% over 90 seconds. A 5% sentinel stays on the primary so we know when it recovers. Three — degraded operation. 95% of traffic served on the secondary; the 5% that fails is the slice requiring primary-only capabilities (computer use, native feature). Customer impact is limited to the 90-second ramp window plus the 5% slice. After the outage ends, we flip back manually over a 15-minute reverse ramp after confirming primary metrics are clean. Trap: "We fall over to the secondary instantly." Instant failover without ramp causes secondary-side rate-limit issues and a second outage.

Q2. Why hot-warm rather than hot-hot? A. Cost. Hot-hot doubles the model bill — for a $40K/month workload, that is $40K/month of insurance. Hot-warm runs the secondary at 5% canary, which is roughly $2K/month — a 5-6% insurance premium for similar failover capability on most workloads. Hot-hot earns its keep on high-stakes workloads where quality is paramount and the 2x cost buys a quality lift from score-and-pick, not just availability. For most production AI systems, hot-warm is the right balance. Trap: Defaulting to hot-hot for redundancy. The cost rarely justifies it on routine workloads.

Q3. What's wrong with hot-cold dual-sourcing for a production customer-facing system? A. Three things. One — the documented config has drifted. Vendors deprecate models, change feature behaviors, rotate API keys. The hot-cold plan that worked six months ago may not work today. Two — the prompt has drifted. The team updated the primary's prompt; the secondary's prompt was not updated because nothing forced it. The secondary may not produce comparable output. Three — the team's muscle memory has decayed. During an incident, the team is rediscovering everything instead of executing. Result is recovery times of hours or days, not minutes — which is unacceptable for customer-facing production. Trap: Treating documented-only dual-sourcing as equivalent to hot-warm.

Q4. How do you decide when to trigger failover automatically vs manually? A. Three signal categories drive automatic triggers — error rate above threshold over a 60-second window, p99 latency exceeding 3x baseline over a 60-second window, and continuous-eval score below floor over a 5-minute window. These three are fast and statistically robust. Vendor incident webhooks are an early-warning signal, often used manually because vendor incidents are sometimes scoped to specific tiers or regions and may not affect this team's traffic. The flip-back-to-primary direction is always manual or has a 15-minute clean-metrics gate to prevent flapping. Trap: Triggering automatic failover on a single failed request. Statistical robustness matters.

Q5. What is the "warm prompts" problem? A. A secondary supplier needs more than canary traffic to be truly warm — it needs prompts that have been rewritten for its conventions (the seven switching-cost taxes from chapter 6), validated against the same eval rubric the primary uses, and updated whenever the primary's prompts update. Schema adherence rates must be measured continuously because they drift over time. Capability gaps must be tracked and mapped to graceful-degradation plans. The "warm" in hot-warm is not just live traffic — it is a maintenance discipline. Skip the discipline and the secondary is hot-cold with a misleading name. Trap: Equating canary traffic with full warmness.

Q6. A capability the primary supports is not available on the secondary. How do you handle it? A. Graceful degradation, designed before the incident. For each primary-only capability, document the workaround pattern. Computer use without a secondary equivalent — queue for primary's return or fall back to text-only workflow. Parallel tool calls without secondary support — serialize at the application layer. Extended thinking without secondary support — disable reasoning request, accept quality loss on hard cases. The principle is to shed functionality gracefully rather than fail the request outright. The runbook lists every gap and every fallback. Trap: Designing graceful degradation during the incident.

Q7. How do you keep your failover capability honest over time? A. Quarterly dry-runs. Every three months, deliberately failover during a low-traffic window for 5-10 minutes, verify the secondary serves traffic correctly, verify monitoring and alerts fire as expected, verify the reverse ramp works, and capture any issues (API-key rotation, rate-limit policy change, prompt drift) in a follow-up ticket. The dry-run is the only way to catch the slow drift between documented capability and actual capability. Teams that skip dry-runs discover their failover is broken during an actual incident. Trap: Assuming a system that worked when built will work indefinitely without exercise.

Q8. Why is latency-tier fallback (Opus → Sonnet within Anthropic) not a substitute for cross-vendor dual-sourcing? A. Three reasons. One — both Opus and Sonnet share Anthropic's infrastructure. A vendor-wide outage takes both down. Latency-tier fallback is for rate-limit pressure, not vendor outages. Two — even if both stay up, a serious incident affecting Anthropic's API gateway or auth layer affects all Anthropic models simultaneously. Three — the matching habit for resilience is independent failure modes. Two suppliers on the same infrastructure are not independent. Latency-tier fallback is a useful pattern for cost and rate-limit management; it is not a substitute for cross-vendor redundancy. Trap: Treating intra-vendor fallback as full dual-sourcing.


Apply now (5 min)

Step 1 — classify your current sourcing. For each production workload, mark its dual-sourcing pattern — hot-hot, hot-warm, hot-cold, or none. If most of your workloads are "none" or "hot-cold," that is the immediate priority list.

Step 2 — pick one workload for hot-warm upgrade. Choose a workload with high business cost during outage. Identify a credible secondary supplier (a different vendor, on independent infrastructure, with acceptable quality on a shadow test). Calculate the 5% canary cost — this is the insurance premium.

Step 3 — write the failover runbook. For the chosen workload, write a one-page runbook covering the failover trigger conditions, the ramp procedure, the graceful-degradation plan for each capability gap, and the flip-back-to-primary criteria. The runbook becomes the contract that the dual-sourcing system delivers against.


Bridge. Dual-sourcing handles the catastrophic case — primary supplier completely down. But most days you are not in catastrophe. Most days you are bumping into the more mundane limit — rate caps, quota windows, the Friday-afternoon spike that pushes you past your tier. The next chapter is the anatomy of rate limits and quota negotiation — how to plan headroom so you do not need failover for what is really a sizing problem.

08-rate-limits-and-quota.md