11. Model-layer bugs — when the suspect is the brain itself¶
~14 min read. The last suspect in the lineup. The most expensive to test. Sometimes the model itself is the bug.
Built on the ELI5 in 00-eli5.md. The suspects — prompt, tool, loop, memory, model — all four others have alibis now. The lineup has one face left. The model. We must interrogate it carefully.
The picture before the details¶
You have walked the lineup. Prompt is innocent. Tool is innocent. Loop ran clean. Memory was fresh. Yet the case file still shows the wrong answer. Same input as last week. Same code. Same config. So who did it?
The brain. The model itself.
Models are not constants. They shift under your feet. A version pointer changes. A safety update lands. A provider re-routes silently. Yesterday's claude-sonnet-4-5 is today's claude-sonnet-4-5-20251022. Yesterday's gpt-4-turbo aliased to 0613, today it aliases to 1106. Same name. Different brain.
See. The model layer is the hardest suspect to interrogate. Why? Because you cannot step through its code. You cannot grep its weights. You can only run inputs and read outputs. So the only test that works is A/B replay — run the same input through the old model and the new model. Diff the outputs. That diff is your confession.
suspects ordered by elimination cost
┌────────────────────────────────────────┐
│ prompt ──→ cheap to swap, diff text │
│ tool ──→ cheap to mock │
│ loop ──→ trace the control flow │
│ memory ──→ inspect storage │
│ model ──→ requires replay against │
│ previous version, costs │
│ tokens, sometimes blocked │
└────────────────────────────────────────┘
last in the lineup. for a reason.
So what to do? We name the eight ways the model goes rogue. For each, we list the trace signature, the elimination test, and the fix.
The eight model-layer bug patterns¶
1. Version regression¶
Same prompt. Same tools. Same temperature. New model snapshot. Different behaviour.
You upgraded from claude-3-5-sonnet-20240620 to claude-3-5-sonnet-20241022. Eval score on your refactoring set drops 6 points. Why? The new snapshot was trained on different data, with different RLHF preferences. The brain is genuinely different.
- Trace signature:
evidence tagmodel.versionchanged between good and bad runs. - Elimination test: Pin the old version. Replay the failing input. Output matches old behaviour? Confession.
- Fix: Pin exact snapshot. Add the failing case to your lock. Re-evaluate before next bump.
2. Capability cliff¶
A task that worked on a stronger model fails completely on a weaker one. Not degraded — failed.
You moved your coding agent from sonnet-4-5 to haiku-4-5 to save cost. Simple refactors still pass. Complex ones — multi-file, cross-cutting concerns — collapse. The smaller model cannot hold the structure in its head. Below a capability threshold, performance does not degrade smoothly. It cliffs.
- Trace signature: Failure rate spikes only on the hardest case files. Easy tasks unaffected.
- Elimination test: Replay the failing trace on the bigger model. Passes? Capability cliff.
- Fix: Route by task complexity. Small model for trivial calls. Big model for hard ones.
3. Temperature drift¶
Last week your config had temperature=0.0. Today it is 0.7. Someone merged a refactor. The default changed. Nobody noticed.
The agent now produces creative but inconsistent outputs. Tool arguments hallucinate. JSON schema breaks 1 in 20 calls.
- Trace signature:
evidence tagmodel.temperaturechanged. Or outputs for identical inputs no longer match. - Elimination test: Run the same input 5 times. Output varies? Temperature is non-zero.
- Fix: Pin temperature in code, not config. Assert it at startup. Log it on every span.
4. Refusal pattern shift¶
The model used to answer your medical-billing extraction prompts. After a safety-tuning update, it refuses 8% of them. "I cannot provide medical advice." But you are not asking for advice. You are asking it to parse a code from a PDF.
Safety classifiers are retrained constantly. The boundary moves.
- Trace signature: Output contains refusal phrases — "I cannot", "I'm not able to", "as an AI". Rate suddenly non-zero.
- Elimination test: Replay on the previous snapshot. Refusal absent? Safety shift.
- Fix: Reframe the prompt to clarify intent. Add system-prompt context. If chronic, switch model or escalate to vendor.
5. Output schema break¶
Your parser expects clean JSON. New model version wraps it in <thinking>...</thinking> tags first. Or it prepends "Sure! Here is the JSON:". Downstream parser explodes.
Reasoning models especially do this. Their default output now interleaves thought and answer.
- Trace signature: Parser exceptions spike. Raw output contains preamble or tags.
- Elimination test: Inspect the raw model output. Compare with previous snapshot's output for the same input.
- Fix: Use structured outputs API (JSON mode, tool calls). Strip preambles defensively. Update parser to ignore
<thinking>blocks.
6. Reasoning-token behaviour change¶
You switched from a regular model to a reasoning model — o1-mini, or claude-sonnet-4-5 with extended thinking enabled. Latency tripled. Cost doubled. Nobody told the on-call SRE. PagerDuty fires on p95.
The model is not slower. It is now thinking out loud, silently, for thousands of tokens.
- Trace signature:
evidence tagmodel.thinking_tokensnon-zero. Latency p95 jumped 3x without traffic spike. - Elimination test: Disable extended thinking. Replay. Latency returns? Reasoning-token cause.
- Fix: Budget thinking tokens (
thinking.budget_tokens=2000). Or route reasoning model only to hard tasks.
7. Provider routing¶
You sent the request to gpt-4. Internally, your gateway A/B-tested 5% of traffic on gpt-4o. Or your LiteLLM fallback routed to Mistral when OpenAI hit a 429. The user saw a different brain.
- Trace signature:
evidence tagmodel.providerormodel.deployment_iddiffers from intended. Output style subtly different. - Elimination test: Force the canonical provider. Replay. Bug vanishes?
- Fix: Log the actual model returned in the response (most APIs include this). Alert when the served model differs from the requested one.
8. Deprecation surprise¶
You called gpt-4-turbo. That alias used to resolve to gpt-4-turbo-2024-04-09. Quietly, the vendor re-pointed it to gpt-4-turbo-2024-11-20. Your eval scores drifted. Nobody changed code.
Aliases are convenient. Aliases are also landmines.
- Trace signature: No code change. Eval drift. Vendor changelog mentions alias update.
- Elimination test: Switch to the dated snapshot you tested. Drift gone?
- Fix: Never use floating aliases in production. Always pin the dated snapshot. See 18-versioning-agents for full discipline.
The model A/B debugging pattern¶
This is the only reliable interrogation technique for the model suspect. Run the failing input through two models — the suspected current one and a known-good prior. Diff the outputs structurally.
failing input from case file
│
┌──────────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ model: NEW │ │ model: OLD │
│ version: B │ │ version: A │
│ temp: 0.0 │ │ temp: 0.0 │
│ provider: X │ │ provider: X │
└──────┬───────┘ └──────┬───────┘
▼ ▼
output B output A
│ │
└──────── diff ────────┘
│
┌────────┴────────┐
│ same? not model │
│ diff? confession│
└─────────────────┘
Pin everything else. Temperature, system prompt, tool schemas, max tokens. Only the model identifier changes. If outputs differ, the model layer is your confession. If they match, the model is innocent — go back to the lineup.
Worked example — coding agent regresses after Sonnet → Haiku swap¶
The case file. A team runs a coding agent that does multi-file refactors. Eval suite has 80 tasks. Two weeks ago, eval score was 71/80. This week, 52/80. Same code. Same prompts. Same eval set.
The complaint slip says: "agent gives up on hard refactors with 'this is too complex, please break it down'." Trace shows a clean conversation. No tool errors. No loop bugs. Memory empty. The lineup is at the model now.
Step 1 — read evidence tags. Open one failing trace. The span has model.id=claude-haiku-4-5. Open a passing trace from two weeks ago. model.id=claude-sonnet-4-5. The team migrated to save 80% cost. Nobody re-ran eval.
Step 2 — A/B replay. Take 10 failing tasks. Run each through both models, temperature 0, same prompt.
task ID sonnet-4-5 haiku-4-5 diff
─────────────────────────────────────────────
t-01 pass fail refactor incomplete
t-02 pass pass identical
t-07 pass fail "too complex"
t-12 pass fail missing edge case
t-15 pass pass identical
t-23 pass fail hallucinated import
t-28 pass fail "please clarify"
t-31 pass pass identical
t-34 pass fail file boundary lost
t-40 pass fail partial diff
7 of 10 fail only on haiku. Identical inputs. Identical config. Confession — capability cliff.
Step 3 — analyse the cliff. Tasks that fail involve 3+ files or 200+ lines of context. Tasks that pass are single-file or under 50 lines. The brain runs out of structural capacity above a threshold.
Step 4 — fix. Route by complexity heuristic.
task arrives
│
estimate token count of context
│
┌─────────┴──────────┐
│ < 4K and ≤ 2 files │── yes ──→ haiku-4-5 (cheap)
└─────────┬──────────┘
no
│
▼
sonnet-4-5 (capable)
Cost goes up 20%, not the original 80%. Eval score returns to 71. Most calls still hit haiku.
Step 5 — write the lock. Add the 7 failing tasks to the regression eval. Tag them capability-cliff. Block any future model swap that drops below 70/80. The crime cannot return.
Model-layer regressions across vendor stacks¶
- OpenAI GPT-4 → GPT-4-turbo migration (2024) — teams reported 5–8 point eval regressions on hard reasoning when moving to turbo; fix was pinning
gpt-4-0613on critical paths and using turbo for cheap calls. - Anthropic Sonnet version updates —
claude-3-5-sonnet-20240620→20241022shifted coding benchmarks by 3–5 points in both directions; floating aliases caused silent shifts. - Google Gemini 1.5 → 2.0 schema diffs — default function-calling output format changed, breaking 1.5-shaped parsers and causing
JSONDecodeErrorstorms in auto-upgraded SDKs. - Mistral medium → small-latest swap — multi-tool agents started looping; the smaller model could not maintain tool-selection coherence past four turns. Classic capability cliff.
- LiteLLM proxy routing fallbacks — silent route to Together/Anyscale when OpenAI rate-limited; output style failed strict schema downstream. Fix: log
response.modeland alert on mismatch. - Anthropic model card eval deltas — each release ships task-specific deltas; the role is making suspect-5 elimination evidence public.
- OpenAI model deprecation policies — explicit sunset dates per snapshot; the role is forcing engineering hygiene around model lifecycle.
- Vendor regressions (e.g., GPT-3.5 → 4 behaviour shifts) — well-documented prompt-following changes between versions; the role is exposing how migration costs are paid in prompt rework, not just inference cost.
- Cohere Command-R version updates — multilingual ranking shifts between R and R+; the role is making non-English regressions a known class.
- Gemini version pinning via
models/gemini-1.5-pro-002style strings — explicit revision selection; the role is opting out of vendor-managed alias drift. - AWS Bedrock model versioning — explicit
modelIdper request, optional cross-region inference profiles; the role is making routing deterministic in enterprise compliance settings. - Azure OpenAI provisioned-throughput models — separate snapshot lifecycle from public ChatGPT; the role is decoupling enterprise pinning from consumer-product changes.
- Replicate model SHAs — every published model has an immutable SHA; the role is making model pinning indistinguishable from git commit pinning.
- Hugging Face revision pinning —
revision="abc123"on every model load; the role is exposing model versioning at the library level for open-source workflows. - Anthropic strict mode regressions — historical incidents where strict-mode JSON output broke for specific schema shapes; the role is showing that even constrained decoding has version-locked behaviour.
- Vertex AI Model Garden version selection — Google's UI explicitly surfaces version + check-pointed weights; the role is integrating model versioning into the deploy UI.
- OpenAI Evals snapshot diffing — eval suite reruns on two model snapshots; the role is the canonical A/B replay for suspect 5.
- Promptfoo's
--providers a,b,c— same prompts across multiple models; the role is making capability-cliff detection a one-command operation. - Braintrust regression suites — historical model performance preserved across releases; the role is keeping the cliff visible after the rollback.
- LangSmith model comparison view — side-by-side runs against two model versions; the role is suspect-5 elimination without leaving the trace UI.
- MMLU / BFCL / SWE-bench leaderboards — public deltas across snapshots; the role is providing a baseline outside the team's own eval set.
- Chatbot Arena ELO shifts — community-reported behaviour drift; the role is the lagging indicator of capability cliffs in the wild.
Recall — model-layer regressions and the A/B replay¶
- Why is the model the last suspect in the lineup, not the first?
- What is the only reliable interrogation technique for a suspected model regression?
- What is a "capability cliff" and how does it differ from a "version regression"?
- Why are floating aliases like
gpt-4-turbodangerous in production?
Interview Q&A¶
Q: How do you debug a model regression in production when you cannot reproduce it locally? A: You run the model A/B pattern. Take the failing input from the case file — captured trace with frozen seed, temperature, system prompt. Run it twice — once against the current model snapshot, once against the previous one. Pin everything else. Diff the outputs. If only the model differs and the output differs, the model layer is the confession. If outputs match, the model is innocent and the bug lives elsewhere — likely in production-only context like memory or session state.
Common wrong answer to avoid: "I would just roll back the model version" — rollback is the fix, not the diagnosis. Without confirming via A/B that the model is the cause, you might roll back and still have the bug, because the real cause was a prompt change that shipped the same week.
Q: Why is a capability cliff worse than a version regression? A: A version regression usually shifts scores by a few points across the eval set, uniformly. Annoying but visible — your aggregate metric moves. A capability cliff is worse because cheap tasks still pass perfectly. Your average eval might only drop 3 points. But on the 10% of hard tasks that matter most — multi-step reasoning, long-context refactors, complex tool chains — failure rate goes from 2% to 60%. Aggregate metrics hide it. You only notice when angry users file complaint slips about the hardest cases.
Common wrong answer to avoid: "Both are the same — just run evals and you will catch it" — only if your eval set has enough hard cases. Most eval sets over-represent easy tasks. A capability cliff can pass a 1000-task eval if only 30 of them are at the cliff edge.
Q: A user reports the agent now refuses a request that worked last month. The model version pin is unchanged. What do you check first?
A: Three things in order. One — verify the served model matches the requested model. Check the response payload's model field, not just your request. Provider routing or A/B test could have swapped it silently. Two — check provider changelog for safety-tuning updates. Even with a pinned dated snapshot, some vendors patch safety classifiers in place without changing the version string. Three — A/B against the same snapshot from a different provider mirror or local cache if you have one. If the refusal exists everywhere, it is a safety shift. If only one path refuses, it is routing.
Common wrong answer to avoid: "Just rewrite the prompt to be less risky-sounding" — that might paper over the bug but does not identify the root cause. The next safety update will break a different prompt. You need to confirm whether it is routing, in-place patch, or genuine refusal pattern shift first.
Q: Why pin dated snapshots instead of floating aliases like gpt-4-turbo or claude-3-5-sonnet-latest?
A: Floating aliases are vendor convenience, not engineering hygiene. The alias resolves to a different snapshot whenever the vendor updates it. Your eval scores drift. Your output style shifts. Your case file from yesterday is not reproducible tomorrow because the same alias now serves a different brain. Pinned dated snapshots — claude-3-5-sonnet-20241022, gpt-4-turbo-2024-04-09 — are immutable. You upgrade deliberately, with a re-evaluation, with a regression lock in place. The cost is one config line; the benefit is reproducibility and a clean A/B baseline forever.
Common wrong answer to avoid: "Floating aliases are fine because the vendor only points them at improved versions" — empirically false. Vendors point them at cheaper, faster, or differently-aligned versions. "Improved" is the vendor's metric, not yours. Your eval is the only metric that matters for your application.
Apply now (10 min)¶
Step 1 — model the exercise. Here is the A/B replay matrix I would build for a suspected capability cliff between claude-3-5-sonnet-20240620 and claude-3-5-sonnet-20241022:
| Variable | Value (pinned) |
|---|---|
| System prompt | identical (frozen from case file) |
| Tool schemas | identical (frozen from case file) |
| Temperature | 0 |
| Seed (if supported) | identical |
| Input | the exact failing input from the case file |
| Model version | the only variable |
| Run | Model | Output | Verdict |
|---|---|---|---|
| A | 20240620 | "Refund approved, $250 credit..." | matches expected behaviour |
| B | 20241022 | "I cannot determine the refund amount..." | regression confirmed — capability cliff |
Verdict: suspect 5 confesses, but only on this slice. Action: pin 20240620 on the affected path, file regression eval, plan migration to 20241022 with rework.
Step 2 — your turn. Pick one failing case file from your agent's recent traces. Note the exact model version. Replay the same input through the previous snapshot. Pin temperature 0 and identical system prompt. Diff the outputs. If they match, the model is innocent and the lineup must be re-walked. If they differ, classify the failure as one of the eight patterns and write a one-sentence confession.
Step 3 — reproduce from memory. Draw the A/B replay diagram. Label the two columns with what is pinned and what is the single variable (model version). Show the diff arrow and the two possible verdicts — "model innocent, retry lineup" or "confession found, write the lock".
What you should remember¶
This chapter explained why the model is the last suspect interrogated and the most expensive to clear. A model swap costs days of evals, regression checking, and rollout. Walking the cheaper suspects first means most bugs confess before the model interrogation ever begins. But when the model is genuinely guilty, the only reliable evidence is an A/B replay on the captured case file — same prompt, same tools, same seed, only the model version changes.
You also learned why floating aliases are vendor convenience masquerading as engineering hygiene. gpt-4-turbo, claude-3-5-sonnet-latest, and gemini-1.5-pro are not pinned. They resolve to whatever the vendor currently serves, and a Tuesday-morning vendor update can shift your eval scores without a single line of your code changing. Pin dated snapshots; upgrade deliberately.
Carry this diagnostic forward: when a regression appears with no code change, check the served model against the requested model. Vendor routing, in-place safety patches, and silent fallback providers all change the brain answering your prompt without changing the version string in your code.
Remember:
- The model is suspect 5. Walk suspects 1–4 first; model swaps cost days.
- A/B replay is the only honest interrogation. Pin everything else; vary only the model.
- Floating aliases drift. Pin dated snapshots and upgrade with intent.
- Capability cliffs hide in aggregate metrics. Build evals that over-sample the hardest 10%.
- Always log
response.model. Routing layers can lie about which brain answered.
Bridge. The single-agent lineup is exhausted. Prompt, tool, loop, memory, model — all five suspects have been interrogated. But modern agents rarely work alone. Two or three agents hand off context to each other. A new class of bug emerges — bugs that live between agents, not inside any one of them. → 12-multi-agent-handoff-bugs.md