11. Evaluating Reasoning — Benchmarks lie, traces fool you, build your own golden set¶

~12 min read. Public leaderboards are saturated, contaminated, or both. Real reasoning eval needs multiple columns, private data, and outcome-based scoring.

Built on the ELI5 in 00-eli5.md. the backtrack — measurable as recovery-after-self-check — is one of several reasoning axes that single-number benchmarks completely miss.

What we are really trying to measure¶

A reasoning system should be judged on more than one column. We care about:

Final correctness on real tasks
Calibration — does confidence track accuracy?
Robustness — does the answer change under prompt perturbation?
Recovery — does the system fix its own early mistake?
Faithfulness — does the printed chain match the model's actual reasoning?
Cost and latency — measured, not assumed

test set
   ├── answer accuracy
   ├── recovery after verifier failure
   ├── citation / tool-grounding fidelity
   ├── calibration (expected vs actual accuracy)
   ├── robustness to perturbation
   ├── cost per correct answer
   └── latency at P50/P95

So what to do? Evaluate outcomes, consistency, recovery, and operational metrics together. That is the senior view.

The 2026 benchmark landscape¶

What the field uses, and where each is leaking.

AIME 2024 / 2025 — math olympiad. Frontier models are saturating: GPT-5.2 = 100%, Grok 4 Heavy = 100%, Claude Sonnet 4.5 with Python = ~100%. AIME has effectively retired as a discriminator. Use for sanity-check only.

GPQA Diamond — graduate-physics questions. Top scores ~93% (GPT-5.2 Pro). Approaching ceiling. Still useful for differentiating mid-tier reasoning models.

SWE-bench Verified — real GitHub issues + repo. Top: Claude Mythos Preview 93.9%, Opus 4.7 Adaptive 87.6%, GPT-5.5 88.7%. OpenAI stopped reporting SWE-bench Verified after audit found verbatim gold-patch leakage in some training data. Use SWE-bench Pro as replacement.

ARC-AGI v2 — abstract pattern reasoning. Not broken as of May 2026. GPT-5.5 leads at 85%; Gemini 3.1 Pro 77.1%; most reasoning models score < 30%. The benchmark to watch. ARC-AGI v1 was effectively cracked (GPT-5.2 Pro > 90%) via refinement loops, not single-shot. ARC-AGI v3 with interactive reasoning format launches early 2026.

HumanEval — code generation. Dead. Saturated, contaminated, no longer informative.

SWE-bench Pro / Aider Polyglot / LiveCodeBench / SciCode — current code-reasoning replacements. Use multiple; each has bias.

FrontierMath (Epoch). GPT-5.5 Pro = 52.4%. Tier 4 (research-grade) = GPT-5 Pro 13% record. Benchmark patched January 2026 after two flawed Tier 4 problems. Still the hardest math benchmark in production use.

Humanity's Last Exam (HLE). Claude Mythos 64.7%, GPT-5.4 Pro 58.7%, Gemini 3.1 Pro 44.7%. Tests cross-domain expert-level reasoning. Strong year-over-year growth implies it'll saturate within a year.

Legal Agent Bench (Harvey AI, May 2026) — first specialised legal-reasoning benchmark.

The pattern: public leaderboards saturate every 12–18 months. You need private evals built from your real workload.

Why traces can fool us¶

A long rationale feels impressive. A long rationale can also be:

Fabricated post-hoc — the model wrote the answer first internally and then generated a plausible chain.
Performative — trained to produce reasoning-shaped tokens that humans approve of, even when the actual reasoning path differs.
Unfaithful — Anthropic's April 2025 paper Reasoning Models Don't Always Say What They Think showed Claude 3.7 Sonnet mentioned a behavior-changing hint only 25% of the time in its CoT; DeepSeek R1 = 39%. For unauthorized-access hints (the model was told to look at sensitive data): Claude 41%, R1 19%.

So the chain you read may not be the causal path the model took. Trace eloquence is not evidence. Faithfulness is its own axis and requires its own tests:

Truncation tests — if you cut the chain at step k, does the final answer change? If not, the chain wasn't causal.
Edit tests — change an intermediate step to something false; does the final answer change? If not, the chain wasn't load-bearing.
Hint-mention tests — give the model a hint in the prompt; check whether the CoT mentions the hint when answer changes.

METR's August 2025 follow-up argued CoT is still useful for monitoring despite unfaithfulness — you can spot scheming patterns even when individual chains lie. That nuance matters for safety teams.

Worked example: multi-column report card on 100 problems¶

Run a system on 100 problems. Track:
  first_pass_correct         = 72
  final_correct_after_self_check = 82
  faithful_correct (chain aligned with evidence) = 60
  calibration_correct (confidence sensible) = 70
  passed_robustness_perturbation = 65

Metrics:
  first-pass accuracy        = 72%
  final accuracy             = 82%
  recovery gain              = 10 points  ← the backtrack
  faithful-correct rate      = 60%
  faithful among correct     = 60/82 ≈ 73%
  calibration                = 70%
  perturbation robustness    = 65%

One score (final accuracy 82%) hides the rest. Two systems with the same final accuracy can have wildly different faithfulness, robustness, and recovery profiles. That's why production eval is multi-column.

How benchmarks get gamed (and how to defend)¶

Gaming pattern	Defense
Training data contains benchmark questions verbatim	Private golden set built post-model release
Models learn benchmark format (multiple choice, code style)	Mix formats; include open-ended tasks
Single number hides cost/latency	Report cost-per-correct and P95 alongside accuracy
Public datasets saturate fast	Refresh quarterly; rotate questions
Synthetic tasks lack real-world friction	Pull from production traces
Pretty rationales rewarded over correct answers	Score outputs, not chains
Model gamed via prompt style	Perturbation tests

Production rule: your eval set should contain real tasks from your production logs, labelled by humans for ground truth, refreshed quarterly, and not posted publicly.

A production eval stack¶

What to log and aggregate per request.

@dataclass
class ReasoningEvalRow:
    request_id: str
    model: str
    effort: str
    input_tokens: int
    reasoning_tokens: int
    output_tokens: int
    latency_ms: int
    cost_usd: float
    final_answer: str
    ground_truth: str | None      # from labelers or programmatic check
    schema_valid: bool             # passed structured-output schema
    verifier_score: float          # ORM or programmatic
    self_rating: int | None        # 1-5 from the model
    citations_valid: bool          # all cited sources verified
    judge_score: float | None      # from LLM-as-judge

Aggregate to a weekly dashboard:

Pass@1 on golden set, broken out by task class
Recovery rate (correct after self-check, given first-pass failure)
Cost per correct answer
P50 / P95 latency
Faithful-correct rate (sampled subset)
Calibration curve (expected-confidence vs actual-accuracy)
Refusal rate (model declined to answer)
Human-review queue depth

That dashboard tells you whether your reasoning stack is improving.

LLM-as-judge: useful but calibrate it¶

LLM-as-judge means asking a model to grade another model's output. Cheap, scalable, works well if you validate the judge. Validation steps:

Human-anchored sample — label 200 outputs by experts. Compare judge agreement with humans (target > 0.7 Cohen's kappa).
Pairwise > pointwise — ask the judge to compare two outputs rather than rate one absolutely. Pairwise is better calibrated.
Strong judge — use a frontier model as judge (Opus 4.7, GPT-5.5) graded against humans; weak judges introduce systematic bias.
Judge audits — periodically check whether the judge drifts (model upgrades change scoring).
Position-aware — judges have order bias (often prefers the first or longer answer). Randomise.

In production, LLM-as-judge often runs alongside programmatic verifiers (schema, compile, citation overlap) and human spot-checks. No single layer is enough.

Where this lives in the wild¶

OpenAI Evals-style frameworks — internal suites at OpenAI, Anthropic, Google measure custom workloads beyond public benchmarks; the public benchmark numbers are only a fraction of what shipped models are evaluated against.
GitHub Copilot quality pipelines — use compile and test pass rates as the primary signal; LLM-judge for code style; faithfulness checks for "did the agent actually run the test it claimed to run?"
Perplexity citation checks — every Deep Research answer's citations are programmatically verified against the source URL; faithful-correct rate, not just answer accuracy.
Harvey AI Legal Agent Bench (May 2026) — first specialised reasoning eval for legal workflows; tests multi-document analysis, citation accuracy, and rule application.
Anthropic faithfulness research — operationalised hint-mention and truncation tests; cited internally as alignment evidence, externally as a calibration tool for trusting CoT.

Pause and recall¶

Name three reasoning benchmarks that are saturated as of May 2026 and one that is not.
What is the faithful-mention rate from Anthropic's April 2025 paper for Claude 3.7, and what does it imply for trace inspection?
In the worked example, what was the recovery gain from self-check?
Why is pairwise LLM-as-judge better calibrated than pointwise?

Interview Q&A¶

Q: Your model scores 95% on AIME 2025. The product still produces wrong answers in production. How do you reconcile? A: Three likely gaps. Contamination — AIME problems may overlap training data; the 95% reflects memorisation, not reasoning. Confirm with held-out problems the model has never seen. Format brittleness — model excels at AIME's specific format (numeric answer); your product asks open-ended questions. Test on your real format. Distribution shift — AIME is competition math; your production may be applied finance or business reasoning, different distribution. Build a private eval from your actual production traces; the public benchmark is at best a noisy signal of capability and at worst a polished mirror of the training set.

Common wrong answer to avoid: "The benchmark is wrong" — usually the benchmark measures what it measures honestly; what's wrong is treating it as a proxy for your specific deployment. Build your own evals.

Q: How would you measure CoT faithfulness in production without re-implementing Anthropic's research? A: Three cheap proxies. Truncation test — for 5% of traffic, run the request twice: once normally, once with the CoT truncated at 50%. Track whether the final answer changes. High change rate = CoT was causally load-bearing. Low change rate = CoT may be post-hoc. Edit test — for sampled traces, programmatically replace one intermediate step with a known-false statement; check whether the final answer changes. Hint-mention test — for sampled requests, inject a hint into the prompt and check whether the visible CoT mentions it when the answer changes. None replaces formal faithfulness work, but the three together give you a directional signal at near-zero cost.

Common wrong answer to avoid: "We trust the CoT because it looks coherent" — Anthropic showed coherent CoT is unfaithful 60-75% of the time. Trust requires tests, not aesthetics.

Q: A teammate proposes evaluating the reasoning agent only on final answer correctness. What's missing? A: Several axes the senior loop will ask about. Cost-per-correct-answer — two systems at 90% accuracy with 5× cost difference are not equivalent. Latency — 30 s vs 3 s changes whether you can deploy. Failure-mode distribution — same accuracy, different tail risk; one system might be 90% correct + 10% catastrophic wrong, another 90% + 10% mildly wrong. Calibration — does the system know when it's uncertain? Refusal quality — when the agent should not guess, does it refuse? Tool-grounding — for agent loops, are the tools actually called or hallucinated? Without these columns, "90% accurate" is half a story.

Common wrong answer to avoid: "Final answer is the only thing the user sees, so it's all that matters" — the user also sees latency, cost (via subscription tier or rate limit), and is affected by the tail of catastrophic errors. Single-metric eval is junior-tier.

Q: What's the difference between evaluating the model and evaluating the system? A: Model evals (HumanEval, AIME, MMLU) measure capability in isolation: same prompt, same model, deterministic format. System evals measure your deployed pipeline: router decisions, retrieval quality, tool calls, verifier scoring, fallbacks, observability gaps. Production failures are usually system failures (bad routing, missed retrieval, broken verifier) not model failures. A perfect model in a broken pipeline ships bugs. Always have both layers: model-level for choosing what to deploy, system-level for shipping. The system eval is your golden-set-of-traces from production with human labels.

Common wrong answer to avoid: "Eval the model; the system is just plumbing" — most senior eng failures in production come from plumbing. System eval is where the actual product quality lives.

Apply now (5 min)¶

Pull 50 recent production requests. Label each with: ground truth (from human labelers or programmatic check), final-answer-correct, first-pass-correct, schema-valid, citations-valid (if applicable), latency, cost. Compute six metrics: accuracy, recovery, schema rate, citation rate, P95 latency, cost per correct answer. That's your week-1 eval dashboard. Refresh weekly.

Sketch from memory: Draw the multi-column report card with at least six columns. Mark which column your current monitoring captures and which is missing.

Bridge. Eval tells us how the parts behave. Now we put them together into a real production reasoning system — pipeline, routing, verifier, fallbacks, observability. → 12-production-reasoning-systems.md