13. Honest admission — what model selection still does not solve¶
~12 min read. Mature applied AI leads sound calm because they have read the contract twice and know which clauses are missing. Model selection is full of those missing clauses.
Built on the ELI5 in 00-eli5.md. We still have the cook, the kitchen, the ticket, the supplier, the matching habit, and the second supplier. This final lesson is an honest look at what those placeholders still cannot promise.
1) Capability drift between minor versions is real, and silent¶
A supplier ships claude-sonnet-4-6 in March and claude-sonnet-4-6-20260512 in May. Same name, same tier, same price. The release notes say "improved tool use and reduced refusal rate". They do not say "now produces 8% longer summaries on average" or "fixed a regression in date parsing that made your eval suite pass for the wrong reason."
The capability drift is real. Your prompt that anchored against a specific verbosity now drifts. Your downstream parser that counted on a certain JSON style breaks subtly. The supplier did not lie. They also did not tell you everything.
Senior teams treat every minor model version as a new model for the purposes of evaluation. Same name does not mean same behavior.
2) Vendor opacity on internals is a constant¶
When the supplier says "Sonnet 4.6", you do not know:
- Which exact checkpoint serves your request today
- What capacity tier you sit in vs your noisy neighbors
- Whether the model behind the API is the same one OpenAI/Anthropic/Google described in their paper
- Whether quantization, routing, or speculative decoding affects your output
You can detect drift through evals. You cannot inspect the cause. Mature teams accept this and design around the symptom rather than chasing the cause.
3) Judge-model bias contaminates every bake-off¶
You ran the bake-off carefully. Same eval set, same prompts, same temperature, statistical significance computed. Sonnet beat Haiku 58-42 with p < 0.05. Verdict: Sonnet.
But the judge was Sonnet 4.6 itself. Or it was Opus 4.7, which shares architecture and training data with Sonnet. Or it was GPT-5, which has its own preferences about formality and length.
LLM-as-judge is itself a model. It has its own biases — toward longer outputs, toward outputs that look like its own outputs, toward outputs with certain stylistic markers. Pairwise judging reduces some of this. Ensemble judging reduces more. Neither eliminates it.
The honest position — every bake-off result is conditional on the judge. Re-run with a different judge to see how stable the verdict is.
4) Prompt portability cost is never fully quantifiable upfront¶
You plan to migrate from Sonnet to Gemini Flash. You estimate the migration cost at two engineer-weeks based on prompt edits, schema differences, and tokenizer math. The actual cost is six engineer-weeks because the new model's behavior on your specific tool descriptions surprises you, because three of your few-shot examples were anchoring to a Claude-specific pattern, because your eval set did not cover the edge case your customer hit on day three of the canary rollout.
Migration estimates are reliably wrong. The variance is high because what you do not know — the subtle anchoring your current prompt does on the supplier's specific habits — is invisible until you try.
The practical hedge is to budget 2-3x your estimate and never migrate two things at once.
5) Model EOL timelines are unilateral and short¶
A supplier announces a model deprecation. Usually 6-12 months notice from majors, sometimes less. Open-weight models do not EOL but the hosted infra serving them does — Together AI or Fireworks can sunset a specific open-weight deployment with 30-60 days notice.
You do not negotiate this. The supplier announces; you migrate. The kitchen manager who built around a single cook with no fallback signs up for this risk every time.
This is why the second supplier is not optional at production scale. It is the answer to a deprecation you cannot see coming.
6) Capacity is a black box at consumer tiers¶
You are on Anthropic Tier 4 with 4000 RPM and 800K TPM on Sonnet. The published numbers are limits, not guarantees. The supplier does not commit to delivering those numbers under load — there are no SLAs on TPM at non-enterprise tiers.
When the supplier has a global capacity crunch (a new model launch, a viral consumer app on their infrastructure, a regional outage), your effective TPM might drop. You will see it as latency tails and occasional 503s, not as a notice.
Enterprise contracts include capacity SLAs. Consumer tiers do not. Plan accordingly.
7) Emergent capabilities arrive without notice¶
You deployed your agent on Sonnet 4.6 carefully. It does its job within a narrow scope. A minor update lands. Suddenly the agent can do things you did not intend — it now understands a tool's description in a way that lets it skip a confirmation step you assumed it would always honor. Or it now writes longer code when asked, or it now invokes a tool you had hoped it would not.
Emergent capability is good for users but unpredictable for operators. The defense is not at the model level — it is at the schema, prompt, and policy level. Tight schemas, strict tool descriptions, approval gates for state-mutating actions. Module 16 (designing agents) covers this in depth.
8) Reasoning-effort knobs interact unpredictably with cost¶
Modern frontier models expose a reasoning-effort dial — Opus 4.7's extended thinking, GPT-5's reasoning_effort parameter, Gemini's thinking-mode flag. Higher reasoning effort means better answers and more hidden tokens. The hidden tokens are charged.
Predicting the right reasoning level per ticket is hard. Too low and quality regresses; too high and you burn budget on tickets that did not need it. There is no clean formula. You experiment, evaluate, and tune — and the tuning needs to be re-done when the model updates, because the relationship between reasoning level and quality shifts.
This is the most expensive surprise in 2026 model selection. Teams routinely discover their token bill doubled because someone left reasoning at "high" by default.
Where this lives in the wild¶
- OpenAI model deprecation page — 6-12 month notices, mechanical migration paths.
- Anthropic API changelog — minor version bumps documented but behavior diffs are not.
- Gemini model lifecycle — Pro/Flash/Flash-Lite tiers with their own deprecation cadence.
- AWS Bedrock model deprecation notifications — propagated from upstream suppliers.
- Azure OpenAI model retirement schedule — published in advance.
- OpenRouter status page — shows real-time provider availability across all suppliers.
- AnthropicStatus.com / status.openai.com / status.cloud.google.com — outage transparency.
- Berkeley Function-Calling Leaderboard — surfaces capability drift between model versions.
- MMLU, HumanEval, GSM8K, MT-Bench, ARC-AGI — standard benchmarks; surface but do not explain drift.
- LMSYS Chatbot Arena — surfaces user-preference drift.
- MTEB — embedding benchmark; useful for embedding-model bake-offs which have the same drift problem.
- HELM (Stanford) — holistic eval framework, surfaces some drift dimensions.
- OpenLLMetry — observability standard for LLM behavior tracking.
- Langfuse / LangSmith / Braintrust — eval platforms that catch drift, do not prevent it.
- Helicone — cost observability surfaces the reasoning-effort surprise.
- LiteLLM / OpenRouter — abstractions that smooth some portability cost but not all.
- Anthropic enterprise contracts — explicit capacity SLAs at enterprise tier.
- OpenAI Enterprise — capacity guarantees and longer deprecation windows.
- Vertex AI enterprise support — region-specific SLAs.
- AWS Bedrock provisioned throughput — moves capacity from black box to contract.
- The OpenAI Forum, Anthropic Discord, r/LocalLLaMA — early-warning communities for drift detection.
- arXiv papers on emergent capability — academic literature surfaces patterns but not your specific model.
- The semi-private "model behavior" channels — where AI engineers share drift observations across companies.
Pause and recall¶
- Why does same-name-different-version need a fresh eval?
- What three things does the supplier never disclose about the model serving your request?
- Why is judge-model bias not solvable by picking a stronger judge?
- What is the realistic multiplier on a prompt-migration estimate?
- Why are capacity SLAs absent at consumer tiers?
- How should emergent capability change the prompt and schema design?
- What makes reasoning-effort knobs expensive surprises?
Interview Q&A¶
Q1. Why is "we did a bake-off and picked the winner" not enough? A. Because the bake-off is conditional — on the eval set, on the judge, on the model version, on the prompt phrasing. Two months later, any of those four can shift and the verdict changes silently. Mature teams re-run the bake-off after every minor model version and after every significant prompt change. Trap: "Once we ran a clean bake-off, we are done." A bake-off is a snapshot, not a contract.
Q2. How do you defend against silent capability drift in a minor model update? A. Three layers. (1) Regression eval suite that runs on every model version change. (2) Output-shape monitoring in production (length, JSON validity, tool-call rate) with alerts on distribution shifts. (3) Strict schemas and tight tool descriptions so emergent capabilities cannot widen the agent's blast radius without your knowledge. Trap: "A stronger model fixes drift." A stronger model has more capability, which often increases drift surface, not reduces it.
Q3. Your supplier announces a model EOL with 90 days notice. What is your playbook? A. (1) Identify all surfaces using the model. (2) Pick a successor — same supplier next-gen, or the second supplier's equivalent tier. (3) Run a shadow eval for 4-6 weeks. (4) Retune prompts as needed. (5) Canary at 1% → 5% → 25% → 100% with eval gates. (6) Keep the old model warm until canary at 100% is stable for two weeks. If the timeline is tight, parallel-test on multiple successors so you do not commit to one too early. Trap: "Just point the SDK at the new model." Same name does not mean same behavior, and a 90-day window is not enough to discover that the hard way.
Q4. How do you budget for a model migration? A. Estimate in engineer-weeks, then multiply by 2-3x. The variance comes from invisible prompt anchoring you discover only when the new model behaves differently. Never migrate two things at once — not model + prompt, not model + tool description. Single-variable changes only. Trap: "We will save money by combining migrations." You will save nothing because you will have no idea what regressed.
Q5. The reasoning-effort knob just doubled your bill. What happened, and how do you prevent it? A. Someone left reasoning at "high" globally, possibly via a config copy from another service, or a model update changed the default. Prevention: explicit reasoning-effort per task class, monitored as a metric, gated by per-task evals that show the marginal quality is worth the marginal cost. Treat reasoning effort like memory allocation in a database — bounded, observed, alerted. Trap: "We just set it to high to be safe." Safe for quality, expensive in dollars.
Q6. Capacity at the consumer tier dropped silently. Diagnose. A. (1) Confirm the symptom — increased p99 latency, occasional 503s, not 429s. (2) Check the supplier's status page for capacity incidents. (3) Compare with traffic on the second supplier — if it is healthy, the cause is supplier-side. (4) Short-term — divert non-critical traffic to the second supplier. (5) Long-term — escalate to enterprise sales for a capacity SLA, or buy provisioned throughput where available. Trap: "Our load is normal so it cannot be capacity." Capacity at the consumer tier is shared and opaque. Your load did not cause it. Your status as a Tier 3 customer means you absorb it.
Q7. What should an interviewer say at the end of this module to sound mature? A. "Model selection is a recurring decision, not a one-time setup. Capability drifts, suppliers EOL, capacity is opaque, and emergent capabilities arrive unannounced. We design the kitchen to survive all of that — second supplier warm, eval suites that catch drift, tight schemas that contain emergent behavior, and a budget that assumes migrations will cost 2-3x the estimate." That is the answer to a senior interviewer's "what would you do differently next time" question.
Apply now (5 min)¶
Step 1 — list your blind spots. For each placeholder — the cook, the supplier, the matching habit — write one thing you do not know about your current production setup. (Examples: "we do not know how Sonnet 4.6 differs from 4.5 on our eval set", "we have no second supplier warm.")
Step 2 — rank by blast radius. If each blind spot became a problem tomorrow, which one would hurt most?
Step 3 — pick one. Schedule the work to close the top blind spot this quarter. The other items go on the vendor-risk register for next quarter.
The discipline is not closing all the gaps. It is knowing where the gaps are and which one you are closing this quarter.
Bridge. Picking the cook is half of running the kitchen. The other half is keeping the recipe under control. The next module is about the most editable, most fragile, most under-managed surface in any AI system — the prompt. → ../13_prompt_lifecycle_operations/00-eli5.md