Skip to content

14. Honest admission — five things evals still cannot do for you

~18 min read. Evals are the discipline that turned this module's first chapter from a confession into a habit. This last chapter is the discipline's confession in return: a careful list of what the program still misses, and why a green eval is sometimes the most expensive thing a team can stare at.

Built on the ELI5 in 00-eli5.md. The inspection with its rubric, spot check, kitchen log, and shift change does enormous work — but the inspector is also a person with eyes, and the kitchen has corners the eyes do not reach.


What thirteen chapters bought you, and what they still do not

Chapter 01 measured the 38-point gap between a curated demo and a representative live sample. Chapter 02 sorted offline against online, single-turn against trace-level. Chapters 03 and 04 built golden sets and synthetic expansions. Chapter 05 picked the right metric for the right behavior. Chapters 06, 07, 08 made an LLM judge legible and calibrated. Chapter 09 watched for drift, chapter 10 ran A/Bs, chapter 11 logged the traces, chapter 12 alerted on collapsed slices. Chapter 13 closed the loop: every caught failure becomes a new eval row, every PR runs the suite, every model change earns or fails its promotion.

The reader who finishes chapter 13 is now genuinely dangerous in a good way. They will catch the failures that the team in chapter 01 missed. They will also start to believe, slowly and without noticing, that the failures evals catch are the failures that exist. That belief is the next category error this module has to dismantle. Some failure shapes are still invisible to the strongest eval program in production today. This chapter names five of them, in the order a team usually meets them.

What this file solves

The previous chapter said make evals the inner loop. This one says do not confuse the inner loop with the world. It walks the refund chatbot from chapter 01 forward into the embarrassing second life of a mature eval program — Goodharting, unprovable coverage, judge ceilings, multi-turn blindness, and causal confusion — and shows what to triangulate when the dashboard alone is no longer enough. By the end you should be able to tell a leadership room "the eval is green; here is what that does not yet mean," and mean it specifically.

Why a green dashboard sometimes ages into a lie

The trouble starts on the day the eval program works. Pass rate climbs from 62% (chapter 01) to 81% on the launch slice, the alert system from chapter 12 fires once a quarter, and chapter 13's promotion gate has rejected three regressions this month. A new release ships, the rubric score rises again — and CSAT, the customer-satisfaction signal nobody promoted into a gate, has been quietly drifting downward for two months. Nobody on the team is lying. The rubric is doing exactly what it was written to measure. The customer is measuring something else.

That is the felt symptom every team eventually meets. The rubric is a written approximation of acceptable; production is the real thing. When a team optimizes against an approximation hard enough and long enough, the gap between approximation and reality starts to drive the score in directions the user never asked for. The honest sentence is not "our evals are wrong." The honest sentence is "our evals are an instrument with finite resolution, and we have started to grind below the resolution."

The naive repair, the visible break, the diagnosis

The first response a smart team reaches for is "add more dimensions to the rubric." That works for one cycle, then loses. The team adds warmth, concision, anticipation — and after two months, the model has learned to write warm, concise, anticipatory answers that still miss what the user wanted. The rubric has more axes; the model has more axes to game.

The second response is "replace the rubric with a stronger judge." That helps a little — GPT-4-class judges agree with humans more often than smaller ones — and then plateaus. The judge still mis-grades negation about one time in eight, still smooths over quantitative precision errors, still cannot tell when an answer is implicitly out of policy because the policy context never reached its window.

Not a rubric-completeness problem. Not a judge-strength problem. A measurement-vs-reality problem — a metric is a projection of value onto a low-dimensional rubric, and projections always lose information. So the natural question becomes: "what do we do when we know the measurement is necessarily incomplete?" The answer is not to fix the measurement; it is to triangulate it against signals that come from outside the eval program — customer satisfaction, human-blind audits, on-call narrative, refund-volume telemetry — and to keep a public list of the failure shapes the eval honestly cannot see.

When the score and the satisfaction walk in different directions

Same refund chatbot from chapter 01. Same model family, same prompt scaffold. The team has now been running the eval program for two quarters. Here is the release trace across four months.

REFUND BOT — release-by-release, judge vs CSAT
                          │  Judge       │ CSAT         │ Refund
   Release   Date         │  rubric /5   │ /5 (n≈3000)  │ overturn %
   ──────────────────────────────────────────────────────────────
   R1   2026-01-12        │   3.8        │   4.21       │   6.4
   R2   2026-02-09        │   4.0        │   4.18       │   6.1
   R3   2026-02-28        │   4.2        │   4.06       │   6.8
   R4   2026-03-21        │   4.4        │   3.92       │   7.5
   R5   2026-04-15        │   4.5        │   3.74       │   8.9
   R6   2026-05-08        │   4.6        │   3.61       │   9.7

   Pearson correlation, judge vs CSAT, across R1..R6: -0.87

The numbers are stylised; the shape is not. A six-release trace with judge rising monotonically from 3.8 to 4.6 and CSAT falling from 4.21 to 3.61 is the textbook Goodhart signature, and the refund-overturn percentage moving from 6.4 to 9.7 confirms it is not a labelling artefact — customers are actually getting refunds the bot refused. The rubric got better at rewarding what the rubric measures. The product got worse at what the customer pays for.

The qualitative read of R6 conversations explained the divergence in one sentence: the model had learned to quote policy verbatim, hedge with disclaimers, and route to a human less often, because the rubric rewarded policy-grounded, hedge-safe, and self-contained answers. Those three traits were proxies for acceptable in chapter 07's rubric. They turned out to be poor proxies for useful under R6's prompt updates.

       judge rubric score                       lived CSAT
   5.0 ┤                                  5.0 ┤
   4.5 ┤                       ●●         4.5 ┤●●
   4.0 ┤            ●●●●                  4.0 ┤   ●●●
   3.5 ┤●●●                                3.5 ┤        ●●●●
   3.0 ┤                                  3.0 ┤
       └────────────────────────              └────────────────────────
        R1   R2   R3   R4   R5   R6           R1   R2   R3   R4   R5   R6
                                                            Goodhart curve

Teacher voice. When the metric goes up and the customer goes down across three or more releases, the rubric has stopped tracking value. The fix is not a bigger judge; it is a humbler dashboard with two independent signals next to each other.

This is the inspection showing you exactly what the inspection is for — and exactly where it ends. The same eval that caught the 38-point gap in chapter 01 can become a 6-release lie when nobody refreshes the rubric against ground-truth user value. That is gap one of five.

The rule: every quality claim is a projection, and projections lose information

State the load-bearing truth plainly: a rubric score is the projection of real user value onto the dimensions a human team thought to write down. Anything the team did not write down is invisible to the rubric. Anything the team wrote down in the wrong direction is mis-rewarded. The program in chapter 13 makes the rubric repeatable, auditable, and fast — but does not make the projection lossless. No eval program ever will.

The discipline that survives this fact is not to abandon evals (the alternative is chapter 01 again). It is to refuse to let any single signal — the rubric, the judge, the CSAT, the complaint queue — be the deciding signal alone. Triangulate, name the blind spots, refresh, and keep at least one human in the loop on the slices that hurt most when they fail.

Teacher voice. Treat the eval like a thermometer. A thermometer is a real measurement. A patient is more than a temperature. A team that reads only the thermometer eventually misses the cough, the pallor, and the chest pain.


1) Goodhart's law in evals — when the score rises but value falls

The refund-bot trace above is the canonical instance. The pattern recurs whenever a single metric becomes the optimization target. The deeper diagnosis is structural: any rubric you can write in a week is simpler than the user value it approximates, and the optimizer (whether SFT, RLHF, or prompt tuning) finds the cheapest path to the rubric, not the cheapest path to the user.

What changes is how the path diverges. In R3 above, it was hedge inflation — every answer ended with two policy paragraphs the rubric scored as grounded, the user scored as bureaucratic. In a hypothetical R7 it might be tone flattening — a brand-voice dimension keeps every answer in such tight stylistic bounds that warmth disappears. The cure is to keep refreshing the rubric against fresh qualitative review, and to keep one independent signal — usually CSAT, sometimes refund-overturn rate, sometimes a small weekly human audit — pinned next to the rubric on the same dashboard. The rubric alone is a thermometer. The rubric next to CSAT is a clinic.

The deepest signature of Goodhart is not the falling CSAT. It is the rubric score no longer correlating with CSAT across releases. Chapter 01's load-bearing rule returns under harder pressure: a quality claim covers only the sample the measurement actually sampled — and a rubric is itself a sample of value, drawn from the team's imagination.

2) Coverage is unprovable — you cannot prove your eval saw every failure

Chapter 03 built a golden set. Chapter 04 expanded it synthetically. Chapter 13 promoted every caught failure into the suite. After eighteen months, the suite has 3,400 rows, sliced across forty-seven categories, and the team is genuinely proud. The honest engineer asks the impossible question: "how do we know we cover the actual distribution of production failures?" The honest answer is you don't, and you can't.

Coverage is unprovable for a deep reason: production traffic is non-stationary, adversarial, and long-tailed. Tomorrow's traffic mix is shaped by user behavior, vendor changes, model swaps, regulatory news, weather, and accidents. A query category that did not exist last month — "can I get a refund because my flight was diverted due to a wildfire?" — appears with non-zero probability tomorrow, and your suite contains zero rows for it. You can extend the suite afterward; you cannot enumerate it before.

Mini-FAQ. "Can we just generate more synthetic rows?" You can broaden. You cannot close. Synthetic rows are drawn from the team's model of the distribution — and the failure that hurts you is usually a shape the team did not yet model.

The honest reaction is to keep a "novelty hunt" running: a weekly review of the most recent 50–100 production conversations sampled outside the existing suite categories, scored cold by a human, with anything surprising added to the suite. It is the spot check redirected at the suite's edge instead of the suite's center. It does not prove coverage. It buys recency.

3) The judge ceiling — even calibrated LLM judges mis-grade ~10–20% of cases

Chapter 06 introduced LLM-as-judge. Chapter 08 calibrated it against humans until weighted Cohen's kappa hit 0.74 on the rubric. That number sounds high. It is also the ceiling that has not meaningfully moved across three model generations on three specific failure modes.

The first is negation. "The user is not eligible for a refund unless the package was opened." Judges mis-grade about 12–18% of negation-heavy rubric items in published evaluations of GPT-4-class judges, depending on rubric framing. A judge will quietly call an answer policy-correct when the model dropped a not, because the surrounding text is fluent and policy-shaped.

The second is quantitative precision. "Refund within 30 days of delivery." When the answer says "within a month or so", a strict rubric should mark it wrong; LLM judges mark it acceptable about 1 in 5 times, because the answer "feels close enough." This is the failure that hurts in regulated jurisdictions, where a month or so is a compliance violation.

The third is implicit context. The conversation has eleven prior turns; the model answers turn twelve in a way that contradicts turn three. A trace-level judge with the whole context might catch it; a single-turn rubric judge almost never will, because the rubric was written for turn twelve in isolation.

The ceiling is honest about itself: human-vs-LLM agreement is a real wall, not a tuning problem. A team that promotes an LLM judge to a release gate without periodic human spot-checks on negation, numbers, and multi-turn implication is shipping a 10–20% blind spot as a hard line.

4) Multi-turn and agentic evals are an open frontier

Chapter 02 distinguished single-turn from trace-level evals. The honest admission is that single-turn rubrics dominate practice because trace-level rubrics are hard. A multi-turn conversation has a state space too large to enumerate; an agent with tools has a graph of intermediate calls that the rubric author has not seen. Public benchmarks for agent traces — τ-bench, AgentBench, SWE-bench-Live — are early-generation, with high variance and unclear ceiling behavior.

What an honest program does here is downscope the question. Instead of "is this multi-turn conversation acceptable?" (mostly unanswerable in a single rubric), it asks "did the agent reach a terminal state we recognize as acceptable?" and "did any intermediate step violate a stated invariant?" The first is outcome-based and brittle; the second is trace-level and partial. Both leave a gap. The gap is where module 25 lives — when the eval cannot tell you what went wrong inside a multi-step agent run, you have to reach for trace debugging instead, and the eval program's job is to flag, not to diagnose.

Mini-FAQ. "So should we even bother with agent evals?" Yes, but downscope. Score the terminal outcome; score named invariants on the trace; accept that the in-between is currently inspected, not measured.

5) Causal attribution is hard — evals tell you correlation, not cause

CSAT drops 0.3 points in a week. The dashboard shows three things changed: a prompt edit on Tuesday, a model version bump on Wednesday, a retrieval index refresh on Thursday. The eval suite — even chapter 13's tightest version — tells you which of those changes correlated with which sliced score movement. It does not tell you which one caused the CSAT drop.

The honest tool here is not a better eval; it is a careful experimental loop — revert one change at a time, hold the suite steady, watch the signal. That loop is slow, and product teams under launch pressure skip it. The result is post-hoc story-telling: the team picks the most plausible cause and writes it up as a finding. Sometimes the story is right; sometimes the system has three independent regressions stacked, and rolling back one improves nothing because the dominant cause is a different one. The eval program will not save you from this. Only disciplined ablation will, and the eval suite's role is to make each ablation cheap to verify, not to identify which ablation to run.

This is also where the industry-vs-textbook contradictions get loudest. Textbooks recommend BLEU and ROUGE because they are reproducible across papers. Production teams find BLEU near-useless on open generation — "the user's order shipped" and "your order has been shipped" score wildly different BLEU and identical user value. Textbooks recommend large public benchmarks (MMLU, GSM8K, HumanEval) as quality proxies. Production teams find those benchmarks weakly correlated with in-domain performance after the first sigmoid bend, which is why every serious team rebuilds an in-domain eval the day the public benchmark stops moving the right things.


Score-only gating vs eval+CSAT triangulation vs human-blind audits

Three release-gate strategies, three different risk profiles. They are not ranked; they fit different workloads.

Score-only release gates

A single eval score, possibly composite, must exceed a threshold to promote. Cheap, fast, and the natural state of chapter 13 if nobody pushes back.

Fits: low-stakes internal tools, prototype features, A/B candidates where the loser will be rolled back fast.

Pathology: Goodhart over months. The refund-bot trace above is exactly what happens when this gate runs unchallenged for two quarters.

Eval + independent business signal triangulation

The score gate stays, and a second independent signal — CSAT, refund-overturn rate, human spot-check pass rate on a small weekly sample — must also be acceptable. Either signal failing blocks promotion.

Fits: customer-facing production at moderate stakes — support bots, copilots, recommendation surfaces. This is the equilibrium most mature teams settle into.

Pathology: alert fatigue when the two signals oscillate independently for boring reasons. The team has to learn which divergences matter and which are noise.

Human-blind audits at the high-stakes slice

A rotating panel of human reviewers grades a small (n=30–50 per week) representative sample blind to model version and to each other. Their pass rate is treated as ground truth on the slices where the cost of being wrong is irrecoverable.

Fits: medical, legal, financial, life-safety, regulated. Anywhere the cost of a confident wrong answer is bigger than the cost of slower release cadence.

Pathology: expensive, slow, and politically hard. The team can rarely afford it on every slice; the discipline is to identify which slices earn it. For the refund bot, the EU jurisdiction slice from chapter 01's Apply Now would be the candidate.

A team running all three on the same release is paying a real cost. A team running only score-only on a high-stakes slice is paying a much larger cost that nobody has yet billed. The honest comparison is not which strategy is best; it is which strategy matches the worst tolerable failure of the slice it gates.

Operational signals — when the eval program itself is going stale

The eval program has its own health, and the honest team watches it the way the chapter 12 dashboard watches the product.

The healthy signature: rubric refresh dates within the last quarter, judge-vs-human agreement re-measured at every model swap, CSAT/rubric correlation still positive over the trailing four releases, novelty-hunt cadence weekly, suite size growing month over month. None of those numbers alone is the program's health; together they say the inspector is still inspecting.

The first thing to degrade is the eval-CSAT correlation. When the correlation across the last five releases starts trending toward zero, the rubric has stopped tracking value. By the time the correlation goes negative — the Goodhart curve above — the team has been shipping the divergence for a full release cycle, and the visible CSAT damage is not yet bottomed out. Catch this at near-zero, not at firmly-negative.

The next signal is judge-vs-human agreement decay at model swaps. A judge calibrated on GPT-4-class behavior in 2025 may quietly disagree with humans on GPT-5-class outputs in 2026 because the failure shapes have shifted. The wrong response is "the judge is fine, agreement was 0.74 last year." The right response is to re-measure on a fresh 100 rows after every meaningful model change.

The third is coverage staleness. The suite has not gained a category in two months; the novelty-hunt sample has not added a row in three weeks. Either the team has genuinely caught up with the long tail (rare) or the novelty hunt has been deprioritised (common). The misleading metric a beginner watches is suite size. The graph an expert opens first is number of new failure shapes added to the suite per week, rolling 8-week trend.

The fourth is judge calls dominating human reviews. When 100% of release reviews lean on judge scores and 0% involve a person reading raw conversations, the program has slid back toward chapter 01's blindness — just dressed in a dashboard.

Boundary — when "we have evals" is enough, and when it is dangerous

The eval program is enough when three things hold: the cost of a single confident wrong answer is bounded (refunds, apologies, retraining are all available); the user base tolerates occasional bad outputs without irreversible harm; and the team can roll back changes faster than damage accumulates. Internal tools, B2B copilots, search ranking, content recommendation — all sit comfortably in this band.

The eval program is dangerous as a primary signal when the cost of a confident wrong answer is irrecoverable, when the slice that fails most is the slice that hurts most, or when adversarial inputs are part of the live distribution. Mata v. Avianca taught the legal profession that a confident wrong citation is irrecoverable; Air Canada taught the airline industry that a confident wrong policy quote is binding in tribunal. In those workloads, "we have evals" is the start of the conversation, not the end of it.

The scale limit is also worth naming. At one million conversations per day, a 99% eval pass rate still produces ten thousand failures. The eval program catches the shape of those failures; it does not staff the response queue, write the post-mortems, or absorb the brand damage. Past a certain scale, evals are necessary infrastructure but insufficient operational risk control.


Wrong model: a green eval is a green light

The seductive sentence is "the eval is green, ship it." It is wrong in five specific ways the chapter has now named.

First, the rubric is a projection of value, not value itself. Goodhart is a feature of any sustained optimization against a single metric. Second, your suite cannot prove coverage of failures it has not yet seen. Third, your judge has a 10–20% ceiling on negation, precision, and implicit context. Fourth, your single-turn rubric is silent on what your multi-step agent did between turn one and turn five. Fifth, your eval shows you which signals moved together this week, not which change caused what.

Replace the wrong model with: a green eval is a license to look harder, not a license to stop looking. The dashboard says the failures you know how to catch did not happen this week. It does not say no failures happened. The kitchen log still has corners the inspector did not enter. The shift change still produced novel surface area the rubric was not written for.

Teacher voice. Treat green as quiet. Quiet is good. Quiet is not the same as safe. The most dangerous Friday of a launch is the one where the dashboard was green on Thursday and nobody read a transcript.

Six recurring failure shapes evals still miss

  • Rubric inflation. Every team that adds a rubric dimension to chase a complaint adds one more axis for the optimizer to find a cheap path on. After four dimensions, the rubric is harder to game on any one axis and easier to game across all of them.
  • Stale suite passing while novelty fails. The 3,400-row suite is at 92%; the novelty-hunt sample from last week is at 71%. The aggregate hides the leading edge.
  • Judge agreement decay across model swaps. The judge calibrated against last year's model quietly disagrees with humans on this year's outputs. Nobody re-measures until a customer complaint forces it.
  • Multi-turn invariant violation without single-turn flag. Every turn passes the single-turn rubric. The conversation as a whole contradicts itself between turn three and turn ten. No eval row captures this.
  • Correlation read as cause. Three changes shipped Tuesday; CSAT dropped Wednesday. The team rolls back the change with the loudest score delta and CSAT does not recover, because the dominant cause was a different change.
  • Adversarial drift outside the eval distribution. Users learn the bot's failure modes faster than the team learns to add them to the suite. Prompt injection, false-premise attacks, and policy-pretexting all live here.

Cross-topic references — where the same pressure recurs

  • Same shape, deeper layer. Module 27's guardrails take the adversarial drift outside the eval distribution signal and turn it into a runtime check. Evals catch what they have seen; guardrails catch what evals did not.
  • Same invariant, harder ground. Module 28's red-team work is the coverage is unprovable problem under intelligent attack. The novelty hunt is the friendly version; red-teaming is the adversarial version.
  • Same pressure, different consequence. Module 26's incident-response play assumes evals will sometimes fail to catch a failure; the post-mortem mechanics are what catch the eval's misses, the way the inspection catches the demo's misses one module earlier.
  • Recurring tradeoff. The rubric-vs-CSAT tension in this chapter is the same shape as faithfulness-vs-helpfulness in module 13's RAG chapter — every projection of value into a measurable dimension trades off against the unmeasured parts of value.

A fast self-test before you decide the eval program is "mature"

  • Can you state, this week, three failure shapes your eval suite has publicly admitted it does not yet cover?
  • Is judge-vs-human agreement re-measured within thirty days of every model swap, or only at the original calibration?
  • Is CSAT (or another independent business signal) pinned next to rubric score on the same dashboard product managers actually open?
  • For the highest-stakes slice, is there a human-blind audit running on at least 30 conversations per week?
  • When the dashboard is green, does at least one engineer still read raw transcripts before signing the release?

Five yeses means the program is mature enough to be trusted and skeptical of itself. One or more nos means the program has matured into the wrong kind of confidence.


Where eval-limit failures have been publicly admitted

Each entry below names a team that publicly acknowledged an eval blind spot or a failure their eval program did not catch. The pattern across them is the chapter: the eval was real, the failure was also real, the gap was the rubric's projection.

  • Air Canada chatbot (Moffatt v. Air Canada, 2024) — a confident wrong refund policy answer was binding in tribunal; no slice eval existed for the bereavement-fare policy the bot misquoted.
  • Mata v. Avianca (2023) — six fabricated case citations entered a federal filing; no citation-existence eval was run pre-submission, and the lawyer trusted ChatGPT's fluency as evidence of correctness.
  • Bing Chat early launch (2023) — Sydney persona, threatening tone, factual hallucinations on extended sessions; Microsoft publicly acknowledged single-turn evals had not covered multi-turn personality drift.
  • Apple Intelligence notification summaries (late 2024) — BBC and Washington Post headlines summarized into false statements; Apple paused the feature for news; the summarization eval had no faithfulness check against the cited source.
  • Galactica (Meta, 2022) — pulled in 72 hours after launch over confident scientific fabrication; the eval program optimized for academic-text fluency, which the model achieved while inventing citations.
  • CNET AI-written finance articles (2023) — 41 of 77 articles required corrections after publication; the publication's eval was editorial taste, which did not catch quantitative errors in compound-interest examples.
  • Bard launch demo error (Google, 2023) — the JWST first-exoplanet claim was wrong on stage; Google publicly acknowledged the demo was not gated on a factual eval.
  • OpenAI's published "evals are unsolved" stance (2024–2025 posts) — OpenAI Evals exists because the team explicitly says single-vendor benchmarks are not sufficient; the platform is built to let users encode their own gaps.
  • Anthropic constitutional-AI work — public Anthropic posts acknowledge that helpfulness/harmlessness rubrics have known tensions and that LLM judges still disagree with humans on certain failure shapes; the lab treats this as an open problem.
  • DeepMind Sparrow paper — published rule-violation rates that the human evaluators caught at higher rates than rule-based classifiers, an explicit admission of judge ceiling.
  • Microsoft Copilot for Security launch posts — public acknowledgment that incident-response eval is harder than code-completion eval, with explicit caveats about coverage in adversarial settings.
  • GitHub Copilot Chat blog (2023–2024) — acknowledgment that pass@k on held-out repos does not predict project-specific refactoring quality; team relies on a separate developer-preference signal.
  • Harvey — public material describes BigLaw partner review as the calibration anchor that the firm-internal eval cannot replace; an explicit human-in-the-loop on the high-stakes slice admission.
  • Casetext CoCounsel — added a citation-existence check after Mata v. Avianca, explicitly because the prior eval did not gate citation reality.
  • Cursor — public posts describe tool-call success rate as a leading indicator that has sometimes risen while user-reported task completion has not, prompting a separate user-success metric.
  • Perplexity — published faithfulness work that explicitly bounds what citation-accuracy evals can and cannot tell you about whether the answer is useful.
  • Glean — engineering posts on the trap of green offline nDCG with falling click-through — the in-house Goodhart story, taught to customers.
  • Notion AI Q&A retrospectives — workspace-context blind spots that golden sets did not cover until production traffic surfaced them.
  • Salesforce Einstein Trust Layer — explicit acknowledgment that adversarial-input evals lag the live attack surface and require continuous refresh.
  • AWS Bedrock observability docs — explicit pitch that aggregate eval is not a substitute for retrieval-failure inspection, the cloud-vendor admission of the chapter's rule.
  • Bloomberg GPT model card — finance-domain evals were built precisely because public benchmarks did not cover the failure modes regulators care about.
  • Vectara HHEM — exists as a commercial product because customer faithfulness rubrics kept missing real hallucinations the team could only catch by training a separate model.
  • Patronus AI, Galileo, Arize, LangSmith, LangFuse posts — each vendor publicly documents specific eval blind spots their tooling tries to address, an industry-wide admission that no single program covers the space.
  • Stripe Radar publications — fraud-model evals must be measured on production sampling, not synthetic distributions, because adversarial drift moves faster than the synthetic generator.

The pattern across this list is consistent: every team that runs evals seriously has, at some point, publicly named what the evals did not catch. That public naming is the discipline this chapter asks the reader to internalise.


Recall — can you reconstruct the chapter cold?

  1. State the Goodhart signature in eval programs in one sentence, using the refund-bot trace as evidence.
  2. Why is coverage of failure shapes unprovable in principle, not just difficult in practice?
  3. Name the three failure modes where LLM judges still mis-grade ~10–20% of cases after calibration.
  4. Why are multi-turn and agentic evals an open frontier rather than a solved problem?
  5. Why does an eval program show you correlation rather than cause?
  6. Name one industry-vs-textbook contradiction in eval practice and explain why production diverges.
  7. What is the first operational signal that the eval program itself is going stale?
  8. State the chapter's "green eval" wrong model and the correct replacement.

Interview Q&A

Q1. Your judge score has gone from 3.8 to 4.6 across six releases. CSAT has gone from 4.2 to 3.6. What is the first thing you do?

A. Stop optimizing against the judge. The Pearson correlation across releases has gone strongly negative — that is the Goodhart signature, and continuing to ship will deepen the divergence. Pull 50 recently-low-CSAT conversations, have a human re-score them against the current rubric, and inspect the cases where the rubric says acceptable and the user said unhappy. The rubric needs new dimensions or new anchors. CSAT is the truth; the rubric is a proxy that has stopped tracking truth. Common wrong answer to avoid: "Tune the rubric upward — the judge score is rising, so quality must be rising too."

Q2. A peer claims their suite of 5,000 golden rows now "covers production failures." What do you challenge?

A. Two things. First, coverage is unprovable — production is non-stationary, adversarial, and long-tailed, so tomorrow's failure shape may not be in any suite no matter how large. Second, suite size is a misleading metric on its own; the leading indicator is number of new failure shapes added to the suite per week. If that has been zero for two months, the team has either solved the long tail (rare) or stopped looking (common). Ask for the novelty-hunt cadence, not the row count. Common wrong answer to avoid: "Large enough suites approximate completeness."

Q3. Why is human-vs-LLM-judge agreement still a ceiling rather than a tuning problem?

A. Because the three failure shapes where judges underperform — negation, quantitative precision, and implicit multi-turn context — are not artifacts of judge calibration. They are structural: negation flips meaning with one token, precision is binary against compliance bars, and implicit context lives outside the rubric's scope. A bigger judge moves the agreement by points, not by orders of magnitude. The honest mitigation is periodic human spot-checks on those three failure shapes, not a stronger judge. Common wrong answer to avoid: "GPT-5-class judges will close the gap."

Q4. Cumulative — the chapter 12 dashboard is green, the chapter 13 promotion gate passed, and CSAT is down 0.3 points. Where do you look first?

A. The eval-CSAT correlation across the last five releases. If it has trended toward zero or negative, the rubric has decoupled from value — chapter 13's gate is faithfully gating something that no longer matches the user, and you have a Goodhart problem dressed in green. If the correlation is still positive but CSAT moved, the cause is probably outside the release path entirely — a vendor change, an upstream prompt edit, a retrieval-index refresh — and the right next move is a one-change-at-a-time ablation, because the eval suite gives you correlation, not cause. Common wrong answer to avoid: "If the dashboard is green, CSAT is wrong."

Q5. Why is BLEU still in textbooks and almost never in production eval gates for open generation?

A. BLEU measures n-gram overlap against a reference, which works for translation where references are constrained and works poorly for open generation where many fluent answers are equivalently correct. "Your order has been shipped" and "the user's order shipped" score wildly different BLEU and identical user value. Textbooks teach BLEU because it is reproducible across papers; production teams discard it because reproducibility against a fixed reference is not the same as quality against open user intent. The replacement is rubric-based scoring or task-success metrics. Common wrong answer to avoid: "BLEU is fine if the reference set is big enough."

Q6. You promote an LLM judge to a release gate. What should you do every time you swap the underlying model?

A. Re-measure judge-vs-human agreement on a fresh 100-row sample within thirty days of the swap. Model behavior changes; failure shapes shift; a judge calibrated on the old model's outputs may quietly disagree with humans on the new model's outputs. The wrong response is to trust the original kappa indefinitely. The right response is to make re-calibration part of every model-promotion checklist, the same way chapter 13 made eval-suite execution part of every PR. Common wrong answer to avoid: "A calibrated judge stays calibrated."

Q7. When is "score-only release gate" acceptable, and when is it dangerous?

A. Acceptable when the cost of a single confident wrong answer is bounded, the user base tolerates occasional errors, and rollback is fast — internal tools, prototype features, recommendation surfaces. Dangerous when the slice that fails is the slice that matters most (regulated, medical, legal, financial), when adversarial inputs are part of the live distribution, or when the harm is irrecoverable. Mata v. Avianca and Air Canada are the publicly-litigated versions of "score-only was the wrong gate for that slice." Common wrong answer to avoid: "Score-only is fine if the threshold is strict enough."

Q8. A leadership room asks "we have evals, are we safe?" — what is the honest sentence?

A. "Evals catch the failure shapes we know how to ask about. Five categories of failure are still hard to catch — Goodhart drift against the rubric, coverage gaps we cannot prove closed, judge ceilings on negation and precision, multi-turn and agent traces, and causal attribution when multiple changes ship together. We mitigate each one with a specific habit — CSAT pinned to the dashboard, weekly novelty hunts, human spot-checks on the three judge-weak shapes, downscoped trace invariants, and disciplined one-change-at-a-time ablation — but none of them closes the gap completely. The right answer is safer than without evals, not safe in absolute terms." Common wrong answer to avoid: "Yes, the dashboard is green."

Apply now (10 min)

Step 1 — model the exercise. Take the refund-chatbot trace from this chapter. Here is the honest blind-spot table the team should keep public.

Blind spot Concrete shape on the refund bot Current mitigation Residual risk
Goodhart drift rubric rising, CSAT falling across R3–R6 pin CSAT next to rubric on the launch dashboard; refresh rubric dimensions quarterly rubric still lags lived value by 1–2 releases
Coverage unprovable wildfire-diversion refund case not in suite weekly novelty hunt on 50 fresh production conversations tomorrow's shape may still be missing
Judge ceiling "not eligible unless opened" mis-graded as policy-correct monthly 30-row human re-grade on negation and quantity rows ~10% of negation cases still slip
Multi-turn blind turn 12 contradicts turn 3, single-turn rubric green trace-level invariant check: "no policy contradiction across the conversation" most multi-turn shapes still inspected, not measured
Causal confusion CSAT drop after prompt+model+index changed same week one-change-at-a-time ablation on the next regression slower release cadence; some weeks no signal

Notice the structure. Every row names the shape, the mitigation, and the residual — because every mitigation is partial. A team that keeps this table public is doing the discipline correctly.

Step 2 — your turn. Take your own AI product. Write your three biggest eval blind spots in the same three-column shape — concrete shape, current mitigation, residual risk. If you cannot fill in the concrete shape column with a specific example, the blind spot is not yet real to you, and the first work is to find a real one.

Step 3 — reproduce from memory. Without scrolling up, draw the judge-vs-CSAT Goodhart curve from the running example. Mark the release where the correlation went negative. Then connect it back to chapter 01's load-bearing rule about samples in one sentence. If you can do this cold, you carry the chapter and the module.

What you should remember

This chapter named five failure shapes a mature eval program still misses: Goodhart drift between rubric and lived value, unprovable coverage of failure shapes, the LLM-judge ceiling on negation and precision and implicit context, multi-turn and agent traces as an open frontier, and correlation-vs-cause confusion when multiple changes ship together. The refund-bot trace made the first one concrete: judge score rose from 3.8 to 4.6 across six releases while CSAT fell from 4.21 to 3.61 and refund-overturn rate rose from 6.4% to 9.7%. Same model family, same suite, optimised against a rubric the team trusted. The damage was real; the dashboard was green; the gap was the projection.

You learned the discipline that survives this fact: triangulate every release against at least two independent signals, refresh the rubric against fresh qualitative review, re-measure judge-vs-human agreement at every model swap, run a weekly novelty hunt outside the suite's existing categories, and keep at least one human-blind audit on the slice whose failures are irrecoverable. None of these closes the gap completely. Together they keep the gap visible, which is the only honest goal an eval program can hold. Module 13 said make evals the inner loop. This chapter said do not confuse the inner loop with the world. Both statements are load-bearing.

Carry this diagnostic forward: when somebody says "the eval is green, ship it", ask three questions — "what is the eval-CSAT correlation across the last five releases?", "when was the rubric last refreshed?", "what was the last failure shape we added to the suite?". If the answers are "we don't track it", "more than a quarter ago", and "last month", the eval program has matured into the wrong kind of confidence, and the next failure is already on the way. The inspection is necessary. The inspection is not sufficient. Reading the kitchen log still beats trusting the shift change alone, and the spot check at the suite's edge buys you the recency the suite's center cannot.

Remember:

  • A rubric score is a projection of value; every projection loses information. Goodhart drift is what the lost information looks like at scale.
  • Coverage of production failures is unprovable. Suite size is the misleading metric; new failure shapes added per week is the leading one.
  • LLM judges still mis-grade ~10–20% of negation, quantitative precision, and implicit multi-turn cases after calibration. Re-measure agreement at every model swap, not annually.
  • Multi-turn and agentic evals are early-generation. Score terminal outcomes and trace invariants; accept that the in-between is currently inspected, not measured.
  • The eval suite tells you what correlated this week. It does not tell you what caused. Ablate one change at a time when CSAT moves.
  • A green dashboard is a license to look harder, not a license to stop looking. Treat green as quiet, not as safe.

Bridge. This module made you good at catching the failures evals were designed to catch, and honest about the failures evals still miss. The next module picks up exactly where the missing failures land — in the trace. When the eval is green but CSAT is down, when the multi-turn rubric is silent but the agent did something wrong, when correlation refuses to become cause, you stop reading dashboards and start reading the kitchen log itself. That is debugging in production: trace-level forensics on the failures the inspector did not see.

../03_agent_observability_debugging/00-eli5.md