10. Model upgrades in production — the playbook for changing the cook mid-shift¶

~17 min read. A model upgrade is not a config change. It is a controlled substitution of one cook for another, on a running kitchen, without the diners noticing. The playbook has five stages, and skipping any of them is how careful teams ship regressions on a Friday afternoon.

Builds on 09-vendor-risk-and-outages.md. EOLs force the upgrade. New model launches invite it. Cost reductions reward it. This chapter is what happens between the decision to upgrade and the rollback button collecting dust.

1) Hook — the upgrade that saved 22% and almost broke the kitchen¶

A customer-success automation runs reply drafting on a frontier cook. The team has been on Sonnet 3.7 for fourteen months. Anthropic ships Sonnet 4.6 with public benchmarks suggesting better instruction following, longer effective context, and a 22% lower output token price. The product team sees the price drop and asks for a one-week migration.

The engineering lead pushes back politely. "Let me show you the playbook, and then we will decide on a timeline." The team runs the regression eval on day one. Aggregate score with the existing system prompt — 71.4% on Sonnet 3.7, 68.8% on Sonnet 4.6. A 2.6-point regression. The price drop is real but the quality is worse with the old prompt.

The engineer spends days two through twelve re-tuning the system prompt for Sonnet 4.6. New cook is more literal about instructions, less prone to add unsolicited apologetic preamble, handles structured output slightly differently. By day twelve the regression eval lands at 77.6% on Sonnet 4.6 with the new prompt — a 6.2-point lift over the Sonnet 3.7 baseline, plus the 22% cost reduction.

Days thirteen through twenty-eight — shadow traffic. The new prompt runs in parallel on 100% of production tickets but only the old cook's draft is shown to users. Output diff is logged. Soak period catches three small issues — the new draft skips a closing line the old draft always included, the new draft is on average 14% shorter (users like this, eval did not capture it), and the new draft occasionally uses a phrasing that conflicts with the brand voice guide on a thin slice of escalations.

Days twenty-nine through forty-five — canary rollout. 1% on day twenty-nine. 5% on day thirty-two. 25% on day thirty-five. 100% on day forty-two. Each step gated on eval scores, user satisfaction metric, and the legal team's review of the brand-voice slice.

Final outcome — 6% quality lift, 22% cost reduction, zero user-visible regressions. The product team, who had asked for one week, now uses this upgrade as the example for every future model swap. "This is what done looks like."

That story is the playbook. Five stages. Each one earns its place.

2) The metaphor — replacing the cook while the kitchen runs¶

A restaurant cannot close for a week every time a cook changes shift. The kitchen must keep serving while the new cook learns the menu. The way serious kitchens do this — the new cook starts on the line beside the outgoing cook for one shift, plating identical dishes. The customer sees the outgoing cook's dish. The kitchen manager tastes both. When the new cook's plates are reliably as good as the old cook's, the new cook takes one station. Then two. Then the whole kitchen. The outgoing cook stays in the back, ready to step in if anything goes wrong.

That is shadow traffic, eval gating, canary rollout, and rollback in restaurant terms. The model upgrade playbook is exactly this pattern, written down as engineering steps.

3) The five-stage playbook¶

┌──────────────────────────────────────────────────────────────────┐
│ STAGE 1 — REGRESSION EVAL                                        │
│ Run the full eval suite. Set a quality bar. No bar pass, no go.  │
├──────────────────────────────────────────────────────────────────┤
│ STAGE 2 — PROMPT RE-TUNING (if needed)                           │
│ The new cook may need re-tuning. Iterate until eval clears.      │
├──────────────────────────────────────────────────────────────────┤
│ STAGE 3 — SHADOW TRAFFIC                                         │
│ Run the new cook in parallel. Users see old cook. Log diffs.     │
├──────────────────────────────────────────────────────────────────┤
│ STAGE 4 — CANARY ROLLOUT                                         │
│ 1% → 5% → 25% → 100%. Gate each step on metrics.                 │
├──────────────────────────────────────────────────────────────────┤
│ STAGE 5 — POST-CUTOVER SOAK + ROLLBACK READINESS                 │
│ Two weeks of watching. Rollback button hot the entire time.      │
└──────────────────────────────────────────────────────────────────┘

Each stage answers a different question. Stage one asks — is the new cook at least as good as the old cook on the cases we have measured? Stage two asks — if not, can prompt iteration close the gap? Stage three asks — is the new cook's output shape close enough to the old cook's that users will not notice the swap? Stage four asks — as we expose real users in increasing slices, does the live metric track our offline eval? Stage five asks — now that we are at 100%, are the slow tails we did not see in canary appearing?

Skipping any stage means asking a question after the answer would have mattered.

4) Stage one — regression eval¶

The eval suite is the gate. It must exist before the upgrade plan starts. If you do not have one, building one is the first job — the upgrade waits.

The structure of the gate — the eval suite from chapter 03's bake-off methodology, but used differently. In the bake-off, you compare candidate cooks head to head to pick one. In the regression eval, you ask whether the candidate cook clears the bar set by the incumbent on every relevant slice.

EVAL OUTCOME              DECISION
─────────────             ────────
New > old on all slices   Proceed to stage 3 (skip stage 2)
New ≥ old aggregate,      Proceed with caution; track the regressed
slice regression          slices in production carefully
New < old aggregate       Go to stage 2 (prompt re-tune)
Re-tune fails to close    Stop. Do not upgrade this workload. File
the gap                   the finding back to the model selection
                          discussion.

The bar is per-slice, not just aggregate. If the new cook is 4 points better on average but 7 points worse on the slice that handles your high-value tickets, the aggregate hides a real regression. Aggregate-only upgrade gates are how teams ship invisible degradations.

5) Stage two — prompt re-tuning¶

Most upgrades require some prompt work. The reasons vary.

New cooks change instruction-following style. Sonnet 4.6 became more literal about negative instructions than Sonnet 3.7. "Do not include apologies" now works directly where Sonnet 3.7 needed a positive rephrasing. The same prompt produces different outputs because the cook reads instructions differently.

New cooks change output shape defaults. GPT-5 ships with a default style that is less verbose than GPT-4o for the same task. If your prompt contained implicit length-management — "please keep it brief" — you can often remove the line without losing the brevity. If your prompt did not, you may now get outputs that are too brief for downstream consumers.

New cooks change tool-use formatting. Tool argument shapes shift between model generations even when the documented schema is the same. Cached plans from the old cook may need re-generating once a small slice of upgrades is done.

New cooks expose reasoning-effort dials. Upgrading from a non-reasoning to a reasoning model — Sonnet 3.7 to Sonnet 4.6 with extended thinking, or GPT-4o to GPT-5 with reasoning_effort — is a different beast than a within-generation upgrade. The output shape changes. The cost shape changes. The latency shape changes. We get to this in chapter 12.

The pattern that works — iterate the system prompt against the eval suite until the new cook clears the bar. Document every change. Save the two prompts side by side. When the migration is done, write a short post on "what the new cook wanted that the old cook did not." That post saves the next team's time.

6) Stage three — shadow traffic¶

Shadow traffic is the most under-appreciated stage. It is the only stage that catches the failures the eval suite cannot.

The setup — production traffic flows to the old cook as normal. The same traffic also flows to the new cook in parallel. Users see the old cook's output. The system logs both outputs and computes diffs. Nothing about the user experience changes.

                    REQUEST
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   OLD COOK       NEW COOK       DIFFER
   (serves        (parallel)     (logs side-by-side)
    user)

The shadow window catches four things the eval misses.

Tone drift. The new cook is technically correct but reads differently. A bank's customer-service bot might shift from formally polite to conversationally polite. The eval scored both as correct. The customer trust signal might shift.

Length drift. The new cook averages 18% shorter on the same prompts. Users either love this (most do) or feel under-served (a small slice does). Length is invisible to scalar accuracy metrics.

Format edge cases. A small slice of inputs produces a format the new cook handles differently — a Markdown table renders, a JSON block uses single quotes, a list nests differently. These break downstream parsers that assume the old cook's habits.

Latency profile. The new cook is faster on average but has a fatter p99 tail. The eval did not measure latency. Production cares.

Soak duration. One to four weeks. Shorter for low-stakes workloads, longer for high-stakes ones. The clock starts when 100% of relevant traffic is shadowing, not when the first request is shadowing. The clock resets when you make a prompt change that closes a discovered drift.

7) Stage four — canary rollout¶

The canary is the controlled exposure of real users. The numbers vary, but the shape does not.

DAY    USER %     WATCH FOR
────   ──────     ─────────
1      1%         Eval delta on the canary slice; any error spike
3      5%         Same; user-feedback signal if available
7      25%        Same; first chance to see weekly cycles
14     100%       Same; rollback if anything went wrong

Each step holds for long enough to observe a full cycle of the underlying behaviour. For interactive products, twenty-four hours is the minimum at 1% — enough to see the difference between morning, afternoon, and night load shapes. For weekly products, the 25% step should hold for a full week so you see Monday's load and Friday's load on the new cook.

The gate at each step — not a binary "any regression, halt" but a threshold. "If the canary's eval slice regresses by more than 2 points or the user-satisfaction signal regresses by more than 0.5 points, halt and investigate." Pure stop-on-any-noise gates either never pass (noise floor) or get overridden (judgement is dulled).

The output diff problem reappears at canary. Things you did not catch in shadow because the volume was small — a rare ticket type, a Monday-morning escalation pattern, a peak-load latency tail — surface in canary because the volume is bigger. This is by design. Canary is the safety net for what the shadow missed.

8) Stage five — post-cutover soak and rollback¶

The cutover is not the end. The two weeks after 100% are when the slow tails appear. The rollback button stays armed for the whole window.

The post-cutover dashboard. Daily — eval score (continuous regression eval against production endpoint), user satisfaction, error rate, p99 latency, monthly cost projection. Weekly — manual sampling of 20-50 outputs for tone and format, support-ticket volume from end users.

Rollback target — under five minutes from decision to rollback complete. That number is achievable only if the rollback is a flag flip, not a deploy. The new and old cooks should be configured side by side, with the active cook chosen by a config switch. Re-deploying to roll back is too slow when something is on fire.

ROLLBACK READINESS CHECK
─────────────────────────
☐ Old cook still callable (not removed from config)
☐ Old prompt still in repo at the version it last shipped
☐ Config flip is a one-line change reviewable in 30 seconds
☐ Rollback procedure has been tested in staging in the last week
☐ On-call runbook references the rollback step explicitly

When the soak window closes cleanly, the old cook can be removed from config. Until then, the cost of keeping two cooks warm is cheap insurance.

Mid-content recall¶

Why is aggregate eval score an insufficient gate for a model upgrade?
What kinds of drift does shadow traffic catch that the eval suite cannot?
Why does the canary's 1% step typically hold for at least twenty-four hours?

9) The cost-aware upgrade — when cheaper means re-tuning¶

Sometimes the new cook is genuinely cheaper, but the prompts must be re-tuned to maintain quality. This is the most common upgrade pattern in 2025-2026.

Why it happens. Each model generation is post-trained on a different mix of supervised and preference data. A prompt that exploited the old cook's specific biases — say, the old cook reliably followed a numbered list instruction — may produce different output on the new cook because the new cook is post-trained to handle lists slightly differently.

The trap to avoid — "the new cook scored 2 points lower on the eval, so the upgrade is not worth it." That conclusion is only valid if you ran stage two and concluded that prompt re-tuning could not close the gap. Without stage two, the conclusion is premature. Many upgrades that look like 2-point regressions turn into 4-point lifts after a week of prompt work.

The reverse trap — "the new cook is cheaper, just swap it in." That swap is the regression-shipping pattern. Cost savings that come with a quality regression are a tax on the user, not an engineering win.

10) The reasoning-effort dial — when the cook becomes a thinker¶

Upgrading from a non-reasoning cook to a reasoning cook is structurally different from a within-generation upgrade. Opus 4.7 with extended thinking. GPT-5 with reasoning_effort: high. Gemini 2.5 Pro with the thinking budget set. These cooks generate a chain of internal reasoning before they produce the user-facing output.

The output shape changes. The cook is more accurate on multi-step problems, more deliberate on tool calls, less prone to surface-level guesses. The latency profile changes — typical reasoning calls are 2-10x slower than non-reasoning calls for the same prompt.

The cost shape changes. Reasoning tokens are billed. We get to the specifics in chapter 12. For now — assume that a reasoning model on the same prompt can cost 3-10x what a non-reasoning model cost, even if the per-token price is the same. The reasoning tokens are invisible to the prompt but very visible to the invoice.

The implication for the upgrade playbook — stage three (shadow) must measure latency and per-call cost, not just output quality. A workload that has a strict latency budget may not survive the swap regardless of quality lift. A workload that has a strict cost budget may need the reasoning dial tuned down to acceptable levels.

The pattern that works — treat the dial as part of the prompt. The combination of (model, system prompt, reasoning effort) is the cook that gets evaluated. Two values of the dial are two different cooks.

11) Failure modes — where upgrades drift off the rails¶

LEAK                                     FIX
────────────────────────────────────     ──────────────────────────────────
Upgrading without an eval suite          Build the eval suite first; the
                                         upgrade waits

Aggregate-only gating                    Per-slice gates; never let the
                                         average mask a real regression

Skipping shadow because eval passed      Shadow always; eval cannot see
                                         tone, length, format drift

Canary too short                         At least one cycle per step (24h
                                         minimum for interactive products)

Rollback is a deploy                     Rollback must be a config flip;
                                         test it in staging weekly

Removing the old cook on cutover day     Keep both cooks warm for the two
                                         week post-cutover soak

Calling a cost reduction an upgrade      Cost reductions without quality
when quality drops                       maintenance are regressions

Upgrading across reasoning-effort        Treat reasoning effort as part of
boundary without re-evaluating cost      the cook; re-evaluate cost and
                                         latency, not just quality

Eight common failures. The shared pattern — the playbook exists because the failure modes do. Every stage skipped is a failure mode invited.

Where this lives in the wild¶

The upgrade playbook is operationalized differently across the toolchain, but the stages are the same.

Anthropic API — versioned model slugs make pinning trivial; the recommended pattern is to pin during upgrade and unpin only after soak.
OpenAI API — dated snapshots play the same role; the gpt-5-2026-02-15 style pins are the rollback anchor.
Google Gemini API — -001, -002 stable versions enable the same pattern; the rolling model name is the upgrade target only after soak.
Mistral La Plateforme — versioned endpoints for Mistral Large 2 and open-weight variants; same pinning discipline applies.
AWS Bedrock — versioned model IDs (anthropic.claude-sonnet-4-6-v1:0-style) let you canary by model ID without touching application code.
Azure OpenAI Service — deployment-level model versions; you stand up a new deployment for the new cook and switch traffic via routing config.
Vertex AI — model garden with versioned endpoints for Gemini and partner models; canary at the routing layer.
Together AI, Fireworks AI, Anyscale, Replicate — open-weight hosts where the model version is the cook; portability cost between hosts is low because the weights are the same.
OpenRouter — model identifier per snapshot; useful for A/B between suppliers, not just within one supplier.
LiteLLM — the open-source router where the model swap typically lives as a config change.
Vercel AI SDK — provider-agnostic client; the upgrade target is a one-line change in the model factory.
Helicone — observability with model-version tagging on requests; enables shadow diff analysis at scale.
Langfuse — open-source observability with eval-on-trace; the regression-eval-on-production-traffic pattern lives here for many teams.
LangSmith — LangChain's observability platform with built-in eval runs against datasets; common home for stage one and stage two work.
Braintrust — eval-first platform that explicitly supports the per-model-version comparison pattern.
Vellum — prompt management with side-by-side model comparison and deployment gating.
PromptLayer — prompt versioning that pairs with model versioning for the upgrade traceability story.
Pezzo — open-source prompt management with versioning; same pairing pattern.
Statsig, LaunchDarkly — feature-flag platforms that many teams use to drive the canary percentage and the rollback flip.
GitHub Copilot Enterprise — internally runs a regression eval suite against every model upgrade before exposing to users.
Cursor — publishes model upgrades visibly; the user-facing model picker is the canary mechanism for power users.
Notion AI, Slack AI, Linear AI — application layer; model upgrades go through internal shadow and canary before user exposure.

Pause and recall¶

Name the five stages of the upgrade playbook and what each one answers.
Why does the eval suite need to gate per-slice and not just on aggregate?
What four drift types does shadow traffic catch that the eval suite misses?
Why must rollback be a config flip rather than a deploy?
When is stage two (prompt re-tuning) skipped, and when is it required?
How does the upgrade playbook change when crossing from a non-reasoning to a reasoning cook?
What is the minimum hold time at each canary step for an interactive product, and why?

Interview Q&A¶

Q1. Walk me through how you would upgrade a production system from Sonnet 3.7 to Sonnet 4.6. A. Five stages. Stage one — run the existing eval suite against Sonnet 4.6 with no prompt changes. Set per-slice gates, not just aggregate. Stage two — if any slice regresses, iterate the system prompt against Sonnet 4.6 until the eval clears. Stage three — shadow traffic for one to four weeks; log output diffs; catch tone, length, and format drift that the eval cannot see. Stage four — canary 1% → 5% → 25% → 100% with metric gates at each step. Stage five — two-week soak after 100% with rollback armed. Total elapsed time is typically four to eight weeks for a production system. Trap: "Just swap the model name in config." That swap ships regressions.

Q2. Your new model scores 2 points lower on the eval suite. Should you skip the upgrade? A. Not yet. Run stage two first. Prompt re-tuning closes most 2-3 point gaps because the new cook is usually post-trained differently and the old prompt was fitted to the old cook. Skip the upgrade only if stage two cannot close the gap with reasonable effort. The conclusion "the new model is worse" is only valid after the new model has been given the prompt it needs. Trap: Treating eval-on-old-prompt as the only signal.

Q3. Why is shadow traffic necessary if the eval suite is comprehensive? A. Eval suites measure scalar correctness. Shadow traffic catches the non-scalar drifts that affect users — tone, length, format edge cases, latency profiles. The eval can score two outputs as both correct while the user experience differs materially. Shadow is the only place that side-by-side comparison on real traffic happens. Trap: "Our eval is good enough." No eval is good enough to skip shadow. The eval measures what the engineer thought to measure.

Q4. How long should the canary at 1% run before going to 5%? A. At least one full operational cycle. For interactive products with daily load patterns, twenty-four hours. For weekly products (consumer apps, employee tools), longer at the higher percentages — the 25% step often holds for a full week so Monday's load shape and Friday's load shape are both observed on the new cook. Trap: Quoting a fixed time without reference to the cycle. "24 hours at each step" makes no sense for a workload with weekly seasonality.

Q5. What does it cost to skip the rollback-readiness check? A. The next time an upgrade goes wrong, the rollback takes hours instead of minutes. In customer-facing workloads, that delta is the difference between an incident report and a contractual SLA breach. Rollback readiness is a config flip, an old prompt in repo, an old model still callable, and a tested procedure. None of those are expensive to maintain. The cost is paid only when you skip them. Trap: "Rollback is rare so it does not need to be fast." Rare and catastrophic is exactly the combination that demands fast.

Q6. What changes when upgrading from a non-reasoning to a reasoning model? A. The output shape, latency, and cost all change. Reasoning tokens are invisible to the prompt but billed and add 2-10x latency. The upgrade playbook stage three must measure latency and per-call cost, not just quality. The reasoning-effort dial becomes part of the cook — two values of the dial are two different cooks for eval purposes. Trap: Treating reasoning models as a quality-only upgrade. The operational profile is different.

Q7. A cheaper model scored 5 points better on the eval after prompt re-tuning. Are you ready to ship? A. Not yet. Stage three is next. Shadow traffic catches the drifts the eval missed. Then canary. The eval and prompt work is necessary but not sufficient. The full playbook is what carries the upgrade to production without a regression. Trap: Stopping the playbook at the eval win.

Q8. How do you justify the upgrade playbook to a product manager asking for a one-week swap? A. The one-week swap saves a week if everything goes right. The playbook saves a quarter if anything goes wrong. The cost of the playbook is one engineer's attention for four to eight weeks. The cost of skipping it is a regression that ships, a rollback that takes hours, and a trust loss with the team that depends on the output. The arithmetic favours the playbook on any workload that matters. Trap: "We are too small to need the full playbook." The smaller the team, the less recovery capacity. The playbook matters more, not less.

Apply now (5 min)¶

Step 1 — locate your eval gate. Find the eval suite that would gate your next model upgrade. If you cannot find it, you have not yet done stage zero — the eval suite is the precondition for stage one.

Step 2 — write your rollback playbook. Open a doc. Write — "to roll back the most recent model upgrade, I would do these three things in order." If any of the three steps is "redeploy," that is the playbook gap. Rollback should be a config flip.

Step 3 — pick a recent upgrade and grade yourself. For the last model swap your system did, score it against the five stages. For each stage you skipped, write one sentence — "the failure mode this stage exists to prevent is...". The stages you cannot defend skipping are the stages to add to the next swap.

Bridge. Upgrades assume you are renting the kitchen from a supplier. The next chapter is the economics of owning the kitchen. When does self-hosting pay off, when does it cost more than you save, and what does the crossover curve actually look like.

→ 11-on-prem-vs-managed-economics.md