03. The bake-off methodology — how to actually compare cooks¶

~15 min read. A bake-off is not a vibes contest. It is a controlled experiment with a held-out eval set, identical prompts, a judge, and a significance test. Twenty hand-picked examples and a feeling do not pick a model — they pick a delusion. The matching habit demands proof.

Builds on 02-cost-latency-quality-triangle.md. The triangle tells you what to measure. The bake-off is how the measurement actually happens.

1) Hook — two teams pick a model, only one team is right¶

Two teams. Same problem — pick a model for an internal RAG product. Both teams ran "bake-offs." Watch the difference.

Team A. A senior engineer pulls 20 representative queries from the backlog, runs each against GPT-4o and Sonnet 4.6, scrolls through the outputs, decides Sonnet "feels better," and ships it. Total time — two hours.

Team B. A senior engineer pulls 200 queries stratified across five query types from a held-out eval set, runs each against GPT-4o and Sonnet 4.6 with identical prompts and temperature 0, uses GPT-5 as a pairwise judge, runs a McNemar's test on the paired wins, reports Sonnet wins 54-46 with p=0.42 — not significant, and either runs more examples or declares the two models tied for this workload.

Three months later, both teams are in production. Team A's chosen model silently underperforms on the ~30% of queries that were not in their 20-example sample. They eat customer complaints, blame the model, and re-do the bake-off. Team B's choice is grounded — they know what they chose and why, and when the next model release lands they can re-run the identical evaluation and compare apples to apples.

The naive bake-off is not a small mistake. It is the default mistake. Most production AI teams ship models based on vibes from a tiny sample. The fix is mechanical — golden eval set, identical conditions, statistical significance.

2) The metaphor — the actual taste test in the kitchen¶

The kitchen manager wants to know whether the new cook is better than the old one. The naive way — taste one dish from each, declare a winner. But the new cook may have caught a lucky dish. The dish you happened to taste may not be representative. Your tongue may be biased — you saw the new cook's name on the plate.

The real bake-off has rules. Same recipe given to both cooks. Same ingredients. Same kitchen station. Same number of dishes — twenty each, not two. The plates come out unlabelled, so the manager cannot bias. A judge (another kitchen, ideally) tastes pairs and votes. A statistician counts the votes and tells you whether the gap is real or coincidence.

That whole apparatus exists for one reason — your nose lies. Your sample lies. The averages of small samples lie. The bake-off is the only mechanism that produces a defensible answer.

3) The anatomy of a proper bake-off¶

┌─────────────────────────────────────────────────────────────────────┐
│ THE BAKE-OFF SETUP                                                  │
├─────────────────────────────────────────────────────────────────────┤
│ 1. Golden eval set         — 100-500 representative examples        │
│ 2. Stratified sampling     — cover every query type, not just easy  │
│ 3. Identical prompts       — same system, same instructions, same   │
│                              tool definitions across cooks          │
│ 4. Identical parameters    — temperature, top-p, max-tokens, seeds  │
│                              where supported                        │
│ 5. Multiple judges         — humans or LLMs, with calibrated rubric │
│ 6. Pairwise vs absolute    — pick one mode and stick to it          │
│ 7. Significance test       — McNemar for paired binary; bootstrap   │
│                              for continuous; effect size always     │
│ 8. Cost-aware verdict      — quality alone does not pick the winner │
└─────────────────────────────────────────────────────────────────────┘

Eight pieces. Drop any one and the bake-off becomes evidence-shaped vibes.

The eval set is the spine. Without a held-out, version-controlled, representative set of examples, every other piece is meaningless. The prompts are the second spine — different prompts to different models is not a fair comparison, it is a comparison of prompt+model jointly. The judge is the third spine — without a calibrated judge, you cannot turn outputs into scores.

4) The golden eval set — the most important asset you don't have¶

Most teams do not have a golden eval set. They have a Slack channel of "interesting examples" and a notebook of "this one broke." Neither is a bake-off input.

A real eval set has four properties.

Held-out. No example in the set has ever been used to tune a prompt or train a finetune. Once an example informs a decision, it is contaminated — the team learns to game it and the score becomes optimistic.

Stratified. The set covers every important query type in proportion to production traffic. If 60% of production queries are short-form Q&A, 60% of the eval set is short-form Q&A. If 5% are long-context summarization, 5% are long-context. A bake-off on a non-stratified set picks a model good at the wrong distribution.

Labelled. Each example has a ground-truth answer or a calibrated scoring rubric. Without labels, you are doing pairwise judging — fine, but that is a different methodology with different sample-size requirements.

Versioned. The set lives in source control. When you add or change examples, you bump a version and re-run baselines. Otherwise next quarter's "the new model is 8% better" is uninterpretable — you don't know if the model improved or the eval set drifted.

EVAL SET SIZE         CONFIDENCE
─────────────         ──────────
20 examples           Almost none. Detects only huge effects (>30% gap).
50 examples           Detects 15-20% gaps with p<0.05.
100 examples          Detects 10-12% gaps reliably.
200 examples          Detects 6-8% gaps. Most production decisions here.
500 examples          Detects 3-4% gaps. Frontier-vs-frontier comparison.
1000+ examples        For research-grade publication or critical workloads.

The size depends on how big a quality gap you are willing to act on. If moving from Sonnet to Opus needs a 5-point lift to justify the cost, you need an eval set big enough to detect 5 points reliably — 300-500 examples.

5) Identical conditions — the part everyone skips¶

Three engineers run "the same prompt" against three models and compare. Two months later, the project lead notices each engineer used a slightly different system prompt — one had "Be concise," another had "Be detailed," a third had nothing. The bake-off was meaningless.

This is not hypothetical. It is the most common reason bake-off results do not replicate.

The fix is mechanical. The bake-off uses one prompt template with provider-specific adaptations only where required by the API. The template lives in source control. The bake-off run is reproducible from the template plus the eval set plus the parameter file.

WHAT MUST BE IDENTICAL              WHAT MAY DIFFER

system + user prompt content        provider-specific message format
tool definitions (same JSON)        provider-specific tool wrappers
temperature, top-p, max-tokens      tokenizer (cost) — measure not match
random seed (where supported)       API endpoint and retry behavior
eval set, example order             timestamp, request ID
judge model and rubric              ─

Anthropic, OpenAI, Google, and others use different message formats and different tool-definition shapes. You translate the content into each format faithfully — same instructions, same constraints, same output schema. You do not "optimize for each model" — that is a different experiment.

6) The judge — LLM-as-judge is itself a model¶

You cannot bake-off without scoring outputs. For small eval sets with clean ground-truth answers, exact-match or string-similarity works. For generation tasks — drafting, summarization, Q&A with open-ended answers — you need a judge.

Two judge modes, both legitimate.

Pairwise judging. Show the judge both outputs side by side. Ask which is better. Vote winner, loser, or tie. Statistical analysis is McNemar on the paired wins.

Absolute scoring. Show the judge one output at a time. Ask for a score on a calibrated rubric (1-5, or per-criterion). Average across examples. Statistical analysis is paired t-test or bootstrap on the score differences.

                       PAIRWISE                 ABSOLUTE
                       ────────                 ────────
sensitivity            high (small gaps         lower (rubric noise
                       detectable)              floors out small gaps)

cost per example       1 judge call             2 judge calls (one per output)

bias risk              position bias            rubric drift across batches

best for               head-to-head selection   tracking absolute quality
                                                over time

Pairwise judging has a known position-bias problem — the judge tends to prefer the first output it sees. Mitigate by running each pair both ways (A then B, and B then A) and counting both votes.

The judge model itself matters. A weaker judge gives noisier scores. The research consensus in 2026 — use the strongest available model as the judge, and ideally a different family from the models being compared. If you bake-off Sonnet vs GPT-4o, use Opus or GPT-5 as the judge — and even better, use both and check agreement.

7) Statistical significance — the test that catches the vibes¶

Sonnet won 53 of 100 examples. Did Sonnet actually beat Haiku, or did it flip a slightly biased coin? The math says — not significant. A McNemar's test on a 53-47 split with n=100 gives p ≈ 0.54. Random.

Most engineers underestimate how much sample you need.

PAIRED COMPARISON, n=100, McNemar's test (continuity-corrected)
─────────────────────────────────────────────────────────────
result       p-value (rough)    significant at p<0.05?
─────       ───────────────    ─────────────────────
50-50       1.00                no
53-47       0.54                no
55-45       0.32                no
58-42       0.10                no
60-40       0.046               yes, barely
65-35       0.0035              yes
70-30       3 × 10⁻⁵            yes, decisively

At n=100 you need roughly a 60-40 split for significance. That is a 20-point gap. Below that, your "winner" is noise.

At n=500 you need roughly a 55-45 split. At n=1000, roughly a 53-47. That is the trade — larger eval sets detect smaller effects.

A pair of practical heuristics:

If your bake-off split is less than 60-40 at n=100, run more examples or declare the models tied.
If two models tie within significance, pick the cheaper one. This is the cost-aware verdict — the bake-off's job is not to find the "best" model, it is to find the cheapest model whose quality is not meaningfully different from the best.

8) Worked example — the bake-off that almost picked the wrong model¶

A document-summarization workload. The team bakes off Sonnet 4.6 vs Haiku 4.5 on a 200-example eval set, using Opus 4.7 as a pairwise judge.

                  PAIRED RESULTS (n=200, pairwise judge)
                  ──────────────────────────────────────
Sonnet wins                 112
Haiku wins                   78
ties                         10

McNemar's test on 112 vs 78:  χ² = 5.86, p = 0.0155
Effect: Sonnet wins ~59% of non-tied pairs.

Statistically significant. Sonnet wins. Done?

Not done. Cost-aware verdict next.

                  Sonnet 4.6      Haiku 4.5
                  ──────────      ─────────
$/call            $0.0042         $0.00040     (10.5x cheaper)
$/month (1M)      $4,200          $400         ($3,800 saved)

Sonnet win rate   59%             41%
                  (i.e., 18 percentage points)

The team has to decide — is 18 percentage points of quality worth $3,800 a month? That depends on what "wins" mean for the workload. If a Haiku loss means a slightly less polished summary that is still factually correct, the answer is probably no. If a Haiku loss means a hallucinated fact or a missed key point, the answer is probably yes.

The team digs deeper. They re-judge with a failure-type rubric — was the loss "minor stylistic" or "material content"? Result:

LOSS TYPE                Sonnet won      Haiku won
─────────                ──────────      ─────────
material content miss    32 of 112       6 of 78
factual error            18 of 112       2 of 78
stylistic preference     62 of 112       70 of 78

Now the picture is different. Most of Sonnet's wins are stylistic. Material content and factual error are roughly balanced. For this workload, Haiku is good enough. The team ships Haiku and saves $3,800/month on a workload where the quality difference is mostly aesthetic.

That is the lesson — the bake-off does not pick a model. The bake-off plus the cost-aware verdict plus the failure-type analysis pick a model.

Mid-content recall¶

Why is a 20-example bake-off almost always insufficient to pick a model?
What does McNemar's test actually do, and on what kind of data?
Why must the judge be different from the models being compared, when possible?

9) Failure modes — bake-offs that lie¶

SIGNAL                                FIX
──────                                ───
"we tried 10 examples"              → 100 minimum, 200+ for production
                                      decisions

eval set hand-picked from           → stratify by query type; sample from
 memorable cases                      production traffic distribution

prompt re-optimized per model       → one template, faithful translation
                                      only where API forces it

judge is same model as one         → use a stronger and different-family
 being tested                        judge; ideally two judges

no significance test reported      → McNemar for paired binary, bootstrap
                                     for continuous

ties or near-ties declare winner   → declare tied, pick the cheaper

eval set never updated             → version it; refresh on every major
                                     release; track baseline scores

bake-off ignores cost              → cost-aware verdict — quality alone
                                     does not pick the winner

eval contaminated with prompt-     → strict held-out discipline; the eval
 tuning examples                     set never informs prompt design

A/B in production used as           → A/B is for confirming a chosen model
 selection                           in production; bake-off picks the
                                     candidate

Most bake-off failures fall into two buckets — too small to be trustworthy, or too biased to be fair. Both are mechanical fixes once you notice them. The hardest one to fix is eval-set contamination — once examples leak into the prompt-tuning loop, the eval set is dead and you need to build a new one.

10) Offline bake-off vs production A/B — when to trust each¶

Two distinct experiments, often confused.

Offline bake-off. Held-out eval set, controlled conditions. Picks a candidate model with statistical rigor. Cheap, fast, repeatable. Cannot detect anything not in the eval set — distribution shifts, downstream effects, user behavior.

Production A/B. Live traffic split between two models. Measures real user outcomes — clicks, conversions, complaint rates, escalation rates. Slow, expensive, harder to interpret, but the ground truth on revenue impact. Cannot be done with all candidate models — you only A/B the one or two that won the offline bake-off.

The right flow is offline bake-off picks the candidate, production A/B confirms it in the wild. Teams that skip the bake-off and go straight to A/B waste production traffic on weak candidates. Teams that skip the A/B and ship from the bake-off get bitten by distribution shift.

OFFLINE BAKE-OFF                    PRODUCTION A/B
────────────────                    ──────────────
days, not months                    weeks, not days
n = 100-500                         n = thousands to millions of users
quality metric (judge)              business metric (revenue, retention)
all candidates compared             1-2 candidates
held-out distribution               actual distribution
cheap                               expensive (50% traffic on a worse model)
catches no distribution shift       catches everything the bake-off missed

Where this lives in the wild¶

Bake-off infrastructure is its own product category in 2026.

Anthropic Console — built-in side-by-side comparison and basic judge-based scoring.
OpenAI Playground — pairwise comparison view for model selection.
OpenAI Evals — open-source framework for batched evaluation with custom graders.
Braintrust — production-grade eval platform with built-in judges, significance tests, and historical baselines.
Langfuse / LangSmith — trace-based evals that wrap production traffic for offline replay.
Helicone — observability with experiment tracking across models.
Vellum, PromptLayer, Pezzo — prompt-management with side-by-side eval views.
Weights & Biases Prompts — eval tracking with statistical significance built in.
MLflow — experiment tracking applied to model evals.
Inspect AI — Anthropic's open-source eval framework with built-in significance tests and scorers.
LM Eval Harness — open standard for benchmarking models on academic tasks.
Lighthouse — Anthropic's internal eval framework, surfaced via Console.
MMLU, GSM8K, HumanEval, MTEB, BIG-bench, AGI-Eval — public benchmarks that demonstrate proper bake-off methodology at scale.
Berkeley Function-Calling Leaderboard — public function-calling bake-offs across models.
Chatbot Arena (lmsys.org) — public pairwise judge with Elo ratings; the canonical large-scale bake-off methodology.
AlpacaEval — LLM-as-judge evaluation framework with calibrated judges.
MT-Bench — multi-turn dialog evaluation with GPT-4 / Opus judge.
HELM (Stanford) — comprehensive cross-model evaluation with significance reporting.
SWE-Bench — agent-and-coding benchmarks with strict held-out splits.
AppWorld — agent-environment benchmarks with reproducible eval conditions.
OpenAI BrowseComp — agent benchmark with held-out evaluation.
Hugging Face Open LLM Leaderboard — community standard for open-weight model comparison.
Artificial Analysis — independent third-party benchmarks across providers.
Stanford CRFM HELM — academic-grade evals with full statistical reporting.

Pause and recall¶

Name the eight elements of a proper bake-off setup.
Roughly what eval-set size detects a 5-point quality gap at p<0.05?
Why is using the same model as judge and candidate a fatal mistake?
State the cost-aware verdict rule when two models tie within significance.
What is the difference between pairwise judging and absolute scoring, and when do you use each?
Why is a production A/B not a substitute for an offline bake-off, and vice versa?
In the worked summarization example, why did the team ship Haiku even though Sonnet "won" the bake-off?

Interview Q&A¶

Q1. Walk me through how you'd compare two models for a new feature. A. I would build a held-out eval set of 200-500 examples stratified by query type from production traffic. Run both models with identical prompts and parameters. Use a stronger model as a pairwise judge, running each pair in both orders to control position bias. Run McNemar's test on the paired wins. Report significance. If significant, also break down the wins by failure type — material vs stylistic — to understand what kind of quality you are buying. Then apply the cost-aware verdict — if the gap is not material at the business level, pick the cheaper model. Trap: Skipping straight to "I'd A/B in production." That wastes production traffic on unproven candidates.

Q2. Your bake-off shows model A wins 54-46 on 100 examples. What do you ship? A. Nothing yet. 54-46 at n=100 is p ≈ 0.42 by McNemar — not significant. The right move is either to expand the eval set to 300-500 examples to see if the gap holds, or to declare the models tied and ship the cheaper one. Shipping based on 54-46 at n=100 is shipping a coin flip. Common wrong answer to avoid: "Ship the winner, it won." Without significance, "winner" is meaningless.

Q3. How do you build a golden eval set from scratch? A. Four steps. First, sample 500-1000 examples from production traffic, stratified by intent or query type. Second, label each example with the ground-truth answer or a scoring rubric, ideally by domain experts. Third, hold out 80% as the eval set and reserve 20% as a "calibration set" for prompt iteration — only the eval set scores the final decision. Fourth, version-control the set in git, document the labelling protocol, and re-run baselines on every major model release. Trap: Building the eval set from the same examples the team uses to debug prompts. Contamination kills the eval set.

Q4. What is wrong with LLM-as-judge? A. Two problems. One — the judge has its own biases. Position bias (prefers the first or last option), self-preference bias (prefers outputs that sound like its own training distribution), length bias (prefers longer outputs). Mitigations — randomize position, use a different-family judge, score length-controlled. Two — the judge can be gamed by stylistic features unrelated to quality. Mitigation — calibrate the judge against human labels on a subset and ensure agreement. Trap: Treating the judge as ground truth. It is a model. It can be wrong.

Q5. When would you do absolute scoring instead of pairwise judging? A. When you need to track absolute quality over time across model generations. Pairwise tells you A vs B today; absolute tells you A's score today vs A's score six months ago. For longitudinal eval pipelines, absolute scoring with a calibrated rubric is the right tool. The trade-off is lower sensitivity to small differences and higher rubric-drift risk across batches. Trap: Mixing modes in the same bake-off. Pick one and stick to it.

Q6. How do you handle non-determinism in model outputs during a bake-off? A. Three layers. Set temperature to 0 where supported. Set seeds where the API supports them (OpenAI seeds, Anthropic determinism flags). For remaining variability, run each example 3-5 times and average — variance becomes part of the measurement, not noise. Report mean and standard deviation per model, not just the point estimate. Trap: Single-run bake-offs on temperature > 0. Half your "difference" is sampling noise.

Q7. What is the difference between an offline bake-off and a production A/B? A. The bake-off picks a candidate from a held-out eval set under controlled conditions — cheap, fast, statistically rigorous on the eval distribution. The A/B confirms the candidate in production traffic — slow, expensive, ground truth on business metrics. The right flow is bake-off first to pick one or two candidates, A/B second to confirm. Skipping bake-off wastes production traffic. Skipping A/B misses distribution shift. Trap: Picking one or the other as "the right way." Both, in sequence.

Q8. How often should you re-run the bake-off? A. On three triggers. One — a new model release in any candidate family. Two — a material change to the prompt or system architecture. Three — quarterly as a baseline check, even with no other changes, to catch silent quality drift from model updates. The eval set itself should be refreshed (new examples added, old contaminated ones removed) at least twice a year to keep it representative. Trap: Running the bake-off once at launch and never again. Models update, traffic distributions shift, and your selection goes stale.

Apply now (5 min)¶

Step 1 — find your eval set. If you have one, count its size and check when it was last updated. If you do not, list five query types from your product and aim for 40 examples per type as a starting target.

Step 2 — write the conditions. For your most-used model, write down the exact system prompt, user prompt template, temperature, and tool definitions. This is the artifact a bake-off compares against.

Step 3 — pick a candidate to challenge it. Name one model one tier down (or one tier up) and sketch the bake-off plan — eval set, judge, significance test, decision rule. If your sketch fits on a sticky note, you have a bake-off plan.

Bridge. You can measure cooks now. The next chapter takes the four cooks, the triangle, and the bake-off, and turns them into a concrete routing matrix — which tier wins which kind of ticket, and how to build a router that saves 60-80% of your model bill.

→ 04-task-to-tier-mapping.md