05. The metrics zoo — three families, one honest truth, many lying numbers¶

~18 min read. A metric is a proxy for a question. Every proxy lies on the questions it was not built for. The job of this chapter is to stop you from reading the wrong number and feeling reassured.

Builds on 04-synthetic-generation.md. The rubric decides what acceptable means; a metric is the machine that grades against it. The spot check has no meaning until you say which metric is doing the checking, and the inspection is only as honest as the metric layer underneath it.

What golden and synthetic gave us, and what still cannot ship¶

The previous four chapters built up to one thing: a labelled set you can trust. Chapter 02 sorted the eval types so you ask the right question at the right time. Chapter 03 gave you a curated golden set with owners and versions. Chapter 04 used synthetic generation to cheaply broaden coverage into the awkward and weird quarters of the user distribution. After all that, you can stand in front of a launch review and say, here are 1,000 representative prompts the team agrees are fair.

You still cannot tell whether the system passed.

A labelled set is a question. A metric is the answer machine. Choose the wrong metric and a system that genuinely helps users will score badly, or — worse, the common case — a system that quietly fails users will score well. The refund chatbot from chapter 01 will return through this chapter scored five different ways, and the five scores will disagree by enough that no two of them would produce the same launch decision. The chapter teaches you which disagreement is signal and which is noise.

What this file solves¶

This file separates the three families of metrics that all live in production eval stacks but answer fundamentally different questions: text-similarity (does it look like the gold answer?), behaviour (did it do the right thing?), and product-outcome (did the user get unstuck?). It shows the same 100 refund chats scored by all three, demonstrates that the numbers contradict each other, and then teaches you which family to lean on for which decision. By the end, you know why a +5pt ROUGE bump is often meaningless, why a 98% LLM-judge pass rate can hide a churn problem, and why deflection rate is the truth but arrives a month too late.

Why "is it a good answer?" is the wrong question to score¶

A metric is a function from (reference, candidate, context) to a number. The function only knows the things you feed it. The instant the question the user actually cares about depends on something you did not feed it, the metric becomes a confident liar. This is not a bug in the metric. It is a category error in how you read the score.

Three questions hide inside the phrase "is this a good answer?"

QUESTION                              METRIC FAMILY THAT ANSWERS IT
─────────────────────────────────     ──────────────────────────────
"Does it look like the gold text?"    text-similarity (ROUGE, BLEU, F1)
"Does it do the right thing?"         behaviour (task success, format, refusal)
"Did the user get unstuck?"           product-outcome (deflection, CSAT, retention)

A team that quotes a 0.71 ROUGE score and concludes "the bot is good" has answered question one and pretended it answered question three. The numbers are not interchangeable. The rubric must say which question the team is trying to answer this week, and the metric must be chosen to fit that question — not the one that was easy to instrument.

Teacher voice. No single metric is a quality metric. Every metric is a proxy for one specific question. Quality is a portfolio of proxies, not one number. The discipline of this chapter is learning which proxy lies in which direction.

The naive repair, the visible break, the diagnosis¶

The first instinct, after chapter 04, is to grab the obvious cheap metric — ROUGE or BLEU — run it across the 1,000 prompts, declare a number, and ship. ROUGE is free, deterministic, and famous. It even has the comfortable shape of an academic citation behind it.

It also lies. On open-ended generation, two answers can mean opposite things and have the same ROUGE. The reference says "Refund approved in 3 days." The candidate says "Refund not approved." The candidate has half the reference's content words. ROUGE gives it credit. The user gets denied. A regression test that uses ROUGE only sees a green tile.

The next instinct is to upgrade to an LLM-as-judge — described in chapter 06 — and use it as the universal metric. That is closer to right, but it has its own lie shape: a judge can be lenient on style, harsh on phrasing, and totally blind to whether the user actually got unstuck in the product. Two weeks after switching to a judge-only metric, the team finds CSAT falling while the eval keeps climbing. The rubric Goodharted itself.

Not a cheap-metric problem. Not a judge-quality problem. A proxy mismatch problem — the team is reading metrics that answer a different question than the one the launch decision actually depends on. So the natural question is: how do we combine multiple proxies so each one's lies cancel the others' truths?

When two answers mean opposite things and score the same¶

Take one prompt from the refund chatbot's eval set. The user asks "Did my refund go through? Order 4481." The team has labelled the gold answer:

Reference:  "Yes, your refund for order 4481 was approved on 12 May
             and should reach your card within 3 business days."

Two candidate replies come back from two prompt variants:

Candidate A: "Refund approved within 3 business days for order 4481."
Candidate B: "Your refund for order 4481 was denied."

A human reads these in two seconds and says "A is correct, B is the worst possible answer." Now score them by token overlap, the way the previous file's worked example showed:

Reference tokens   = [refund, approved, order, 4481, 12, may, 3, business, days]
Candidate A tokens = [refund, approved, within, 3, business, days, order, 4481]
Candidate B tokens = [refund, order, 4481, denied]

Overlap with reference:
  A  -> [refund, approved, order, 4481, 3, business, days] = 7
  B  -> [refund, order, 4481]                              = 3

F1 (precision × recall, harmonic mean):
  A  precision = 7/8 = 0.875,  recall = 7/9 = 0.778,  F1 = 0.824
  B  precision = 3/4 = 0.750,  recall = 3/9 = 0.333,  F1 = 0.462

B's F1 is 0.46 — half of A's, but not zero. A regression test running on a million such cases will pass plenty of denial-instead-of-approval answers because the overlap math doesn't know that the word denied inverts the meaning. The spot check powered by ROUGE alone will silently approve a model that flips outcomes on 8% of refunds, because flipped-outcome answers still share the words refund, order, and the order number with the reference.

The rule: a metric only measures what its math sees¶

State it plainly: a metric scores the dimension its math can see and is silent on every other dimension. ROUGE sees token overlap. BERTScore sees embedding overlap. An LLM-judge sees whatever the rubric told it to look for. Deflection sees whether the customer came back. None of them sees all four. A claim like "the bot is at 0.74" with no metric named is therefore not a claim about quality. It is a number with no footnotes.

This is the rule the rest of the chapter enforces. Every metric has a blind spot — a class of failure its math cannot detect by construction. Combining metrics is the discipline of stacking blind spots so none of them line up.

Teacher voice. Pick any metric. Ask: what is the worst answer that still scores well here? If you can describe that worst answer in one sentence, you understand the metric's blind spot. If you can't, you are about to ship on a number you don't actually understand.

1) Family one — text-similarity, the cheap dumb yardstick¶

The oldest family. Born from machine translation, where the question genuinely was "does the candidate look like the reference?" — because in translation, paraphrase still counts as success and the reference is a real ground truth produced by a human translator.

METRIC      WHAT IT MEASURES                            COST PER CALL
──────      ────────────────                            ─────────────
BLEU        n-gram precision against reference(s)       ~$0.0001
ROUGE-L     longest common subsequence overlap          ~$0.0001
F1 (token)  token-level precision × recall              ~$0.0001
Exact-match string equality                             ~$0.00001
BERTScore   embedding-similarity, not surface tokens    ~$0.001

All five answer the same shape of question: how textually close is candidate to reference? They differ in how lenient they are about paraphrase. Exact-match is brutal — one character off, zero. BLEU and ROUGE forgive synonyms only if the synonyms share tokens. BERTScore embeds both strings and measures cosine similarity, so "refund approved" and "reimbursement granted" score high because their embeddings are close, even though their tokens are not.

Where this family fits is narrow but real: translation, summarisation with a tight reference, code generation against a unit test, structured-field extraction where the field has one right value. Where it lies is everywhere else.

The biggest ELI5 placeholder callback for this family: the inspection that asks a chef "does your dish look like the photo in the menu?" will pass plenty of inedible dishes that happen to be the right colour. Token overlap is the photo. It does not taste the food.

When text-similarity is honest¶

Machine translation against a reference corpus. BLEU was built for this and it works because paraphrase is bounded by the source sentence.
Code generation graded against a canonical solution that you literally want byte-identical (or close).
Structured extraction: "what is the order ID in this message?" — exact-match is the right metric.

When text-similarity lies¶

Open-ended generation: chat, agents, summarisation of multi-source content. There is no single reference; any number of phrasings are equally good.
Faithfulness questions: ROUGE cannot see hallucination. A fluent paraphrase of a hallucinated fact scores beautifully.
Polarity flips: as shown above, "approved" and "denied" differ by one token; the math says they are 88% similar.

For the refund chatbot, text-similarity is a triage tool. It catches catastrophic format breaks (the model emits XML when the rubric expects prose) and complete topic shifts (the model answers about pizza). Anything subtler than that, it cannot see.

Mini-FAQ. "Why does the field still use BLEU then?" Because it is cheap, deterministic, and useful for regression detection — catching that today's model is wildly different from yesterday's — even when it is useless for quality measurement. Use it as a tripwire, not as a quality bar.

2) Family two — behaviour, the rubric-grader¶

Behaviour metrics ask the second question: did the system do the right thing? They do not care if the words match a reference. They care if the action the system took or recommended matches what the rubric calls acceptable.

Concrete behaviour metrics for an LLM app:

Task success rate. The user wanted X. Did the system produce X? Graded by humans, by a judge, or by an automated check (did the refund API actually get called with the right arguments?).
Instruction-following. The prompt said "reply in under 3 sentences in Hindi". Did the reply obey both constraints?
Format compliance. The schema said the response is JSON with fields decision and next_step. Did it parse?
Refusal rate. On adversarial or out-of-policy prompts, did the system refuse with the policy-mandated language?
Hallucination rate / faithfulness. Are all factual claims in the response grounded in the retrieved context (chapter 13's RAG pressure)?
Tool-call success. Did the agent (module 16) call the right tool with valid arguments?

Most behaviour metrics in 2025 are computed by the rubric evaluated by an LLM judge — chapter 06 is the next file precisely because behaviour metrics are where the modern eval stack actually lives. Some are cheaper: format compliance can be checked with a JSON parser, refusal rate with a regex, tool-call validity with a JSON-schema validator. Those should be done deterministically before any judge runs, because there is no reason to spend judge tokens on something a parser can decide.

The cost shape is mid-tier. A judge call against a 500-token response runs roughly $0.001–0.02 depending on judge model and rubric complexity. Across a 1,000-prompt eval that is $1–$20 — affordable per run, expensive enough to matter when you re-run on every PR.

For the refund chatbot, the behaviour layer is where most of the meaningful signal lives. It catches the invented refund exceptions from chapter 01 (faithfulness check fails), the missing account details (handoff-completeness rubric fails), and the rude escalations (brand-voice rubric fails). The 38-point gap from chapter 01 is fundamentally a behaviour-metric gap, not a text-similarity gap.

Mini-FAQ. "If behaviour metrics catch everything that matters, why do I need the other two families?" Because behaviour metrics are themselves a proxy. They measure what the rubric told them to measure. If the rubric is wrong, the behaviour score is confidently wrong in the same direction. Text-similarity catches catastrophic shifts the rubric author didn't think to write down. Product-outcome catches rubric-Goodharting.

3) Family three — product-outcome, the truth that arrives late¶

Product-outcome metrics ignore the response entirely and look at what the user did. They are the closest thing to ground truth a production system has, and they are also the slowest to read and the hardest to attribute.

METRIC                  WHAT IT ASKS THE USER         WHEN IT ARRIVES
──────                  ─────────────────────         ─────────────────
Deflection rate         "did you not need a human?"   minutes to hours
Click-through           "did you act on the answer?"  immediately
Conversion              "did you complete the goal?"  hours to days
Time-to-resolution      "how long to be done?"        minutes to days
CSAT / thumbs-up        "did you like it?"            on-session
Retention               "did you come back?"          weeks
Escalation rate         "did you need agent help?"    on-session

These are not eval metrics in the strict sense. They are business metrics. Their relationship to the LLM is causal but noisy — a falling CSAT this week could be the model, could be a pricing change, could be a holiday, could be a Reddit thread. They cannot be run pre-launch on a sample set because they require live users to exist.

What they can do is anchor the upstream metrics. Every behaviour metric and every text-similarity metric must eventually correlate to a product-outcome metric, or it is measuring a fiction. This correlation check is what saves a team from the eval-up-CSAT-down trap from chapter 01.

For the refund chatbot, the load-bearing product metric is deflection rate — what fraction of refund chats resolved without a human agent stepping in? The business pays for the bot because deflection saves agent-hours. If deflection is flat after a model swap, the eval numbers do not matter; the project is not earning its cost. If deflection rises but CSAT falls, the bot is deflecting unsatisfied users, which is worse than the original cost it tried to save.

Teacher voice. Product-outcome metrics are slow, noisy, and final. Treat them like an end-of-quarter audit — you cannot fly the plane with them, but you cannot land it without them.

4) The refund chatbot, scored five ways on the same 100 chats¶

Same 100 conversations from chapter 01. Same model, same prompt, same week. Now scored by five metrics. Watch the numbers disagree.

THE 100-CHAT SAMPLE — SAME CHATS, FIVE METRICS

  Metric                       Score      What it actually measured
  ──────                       ─────      ─────────────────────────
  ROUGE-L (vs gold answer)     0.71       token overlap with reference
  BERTScore F1                 0.88       embedding similarity to reference
  LLM-judge (gpt-4o, rubric)   0.84       rubric pass rate by judge
  Human SME pass rate          0.62       rubric pass rate by senior agent
  Deflection rate              0.47       chats that closed without human

  Cost of this run:
    ROUGE-L         100 × $0.0001 = $0.01
    BERTScore       100 × $0.001  = $0.10
    LLM-judge       100 × $0.008  = $0.80
    Human SME       100 × $2.50   = $250.00
    Deflection      log-only      = ~free, but n=100 is statistically thin

Five honest answers to five different questions. The launch decision is not the same depending on which row you read.

A team that reads the ROUGE row (0.71) and the BERTScore row (0.88) declares the bot good and ships. The team that reads the human SME row (0.62) — the same number from chapter 01 — blocks the launch. The team that quietly waits a month and reads the deflection row (0.47) realises the bot is barely doing the job it was paid to do, regardless of which eval said what.

Look at the disagreements, because each one teaches something specific.

ROUGE 0.71 vs human SME 0.62. ROUGE is 9 points more generous than the human. Why? Because ROUGE rewards the invented refund exception answers — they contain the words refund, approved, the order ID, the policy clause name — even though the policy clause does not exist. The human SME marks them as confidently-wrong; ROUGE cannot see the wrongness dimension at all.

LLM-judge 0.84 vs human SME 0.62. The judge is 22 points more generous than the human. This is not a small calibration gap; this is the judge being systematically lenient on something. In practice, two things usually drive this much gap: (1) the rubric was written by the same team that built the bot, so it has the same blind spots, and (2) the judge is reading the bot's fluent confidence as competence. Chapter 06 is about closing exactly this gap.

Human SME 0.62 vs deflection 0.47. The human says 62% are acceptable answers. Only 47% of conversations actually closed without human help. The 15-point gap is the acceptable-but-the-user-still-escalates band — answers that pass the rubric but somehow do not finish the user's job. Inspecting that band almost always reveals a rubric blind spot: the answer was technically policy-correct but missed a follow-up question the user had not asked yet, or used a tone that made the user distrust it.

DISAGREEMENT MAP — same 100 chats

  ROUGE  ─── 0.71 ────┐
                      │  ← +0.09: text similarity rewards "looks right but wrong fact"
  BERTScore           │
        ─── 0.88 ──┐  │
                   │  │
  LLM-judge        │  │
        ─── 0.84 ──┤  │  ← +0.22: rubric-blindness, judge leniency
                   │  │
  Human SME        │  │
        ─── 0.62 ──┴──┘
                   │
                   │  ← -0.15: rubric-acceptable but user still bounces
  Deflection       │
        ─── 0.47 ──┘

The disagreement shape is the diagnosis. Text-similarity and behaviour metrics are upstream — they grade the response. Product-outcome is downstream — it grades what the response caused. When upstream is high and downstream is low, the rubric is missing something the user feels. When upstream is low and downstream is high, the rubric is too strict — the user got helped anyway.

Not a metric-choice problem. A metric-stacking problem. Any one of these numbers, read alone, would mislead a launch decision. Read together, they form a triangulation: if all three families say good, the answer is robustly good. If they disagree by this much, you have a specific known thing to fix before shipping.

5) The cost–signal Pareto — why no team uses only one metric¶

                  signal strength →
   weak                                              strong
   │                                                    │
   │ ROUGE/BLEU       BERTScore       LLM-judge         │
   │ ~$0.0001         ~$0.001         ~$0.001-0.02      │
   │ deterministic    semantic        rubric-driven     │
   │ blind to facts   blind to facts  blind to outcome  │
   │                                                    │
   │                                  Human SME         │
   │                                  ~$1-5/sample      │
   │                                  slow, gold        │
   │                                                    │
   │                                  Product outcome   │
   │                                  ~free per chat    │
   │                                  noisy, late, true │
   │                                                    │
   └────────────────────────────────────────────────────┘
                  cost / latency / scarcity →

Every team runs a portfolio because each cell occupies a different point on the Pareto frontier. Cheap-deterministic metrics run on every PR. Mid-cost LLM-judges run on every nightly eval. Expensive human SME runs on a small calibration sample weekly. Free-but-late product metrics run continuously in production.

A reasonable production stack looks like this:

Every PR: ROUGE/exact-match on tiny regression set + JSON parse + tool-call schema check. Fast tripwires.
Every nightly: LLM-judge on 1,000-prompt eval set with the rubric. The launch-gate number.
Every week: Human SME labels 50–100 from yesterday's live traffic. Calibrates the judge.
Continuously: Deflection, CSAT, escalation tracked in the product dashboard. Final-truth lagging signal.

This is the discipline that the inspection actually requires. Not one metric. A stack, where each layer compensates for the layer above's blind spot.

6) Alternative comparison — which family fits which workload¶

Workload	Best primary metric	Why	Worst choice
Machine translation	BLEU + chrF	reference exists, paraphrase is bounded	LLM-judge alone (slow, redundant)
Code generation	unit tests + exact-match	ground truth is executable	BLEU (tokens lie about semantics)
Structured extraction	field-level exact-match	fields have single right values	ROUGE (rewards almost-right)
Single-turn factual QA	LLM-judge + faithfulness	grounding is the question	BLEU (blind to hallucination)
Open-ended chat	LLM-judge + human SME + CSAT	no reference, style + outcome	ROUGE (catastrophically misleading)
Agent task completion	task-success + tool-call validity	actions, not text	BERTScore (tokens don't equal actions)
Summarisation	judge faithfulness + ROUGE tripwire	both surface and fact	ROUGE alone (rewards verbatim copy)
Refund-style policy chat	judge against policy rubric + deflection	policy + outcome	BLEU alone (fluent denial passes)

The pattern: text-similarity metrics fit workloads where the reference is canonical. Behaviour metrics fit workloads where the action matters more than the words. Product-outcome metrics fit workloads where the user has a goal you can detect post-hoc. Choose by workload shape, not by what is easy to compute.

7) Operational signals — what tells you the metric layer is healthy or rotting¶

A healthy metric stack has three signatures. The launch dashboard shows at least one metric from each family side-by-side, not a single aggregate. The judge-vs-SME calibration plot is refreshed weekly and the correlation stays above 0.85. The eval score and the deflection score move in the same direction over a quarter — not perfectly correlated, but never inversely correlated for more than a sprint.

The first signal of rot is metric collapse: the launch review starts quoting a single number again, usually the LLM-judge pass rate, and the slice table disappears. This is the comfort signal — one number is easier to celebrate. It is also the signal that the team has forgotten the chapter-01 lesson at the metrics layer.

The second signal is the judge-SME gap widening. If the judge says 0.84 and the human says 0.62 in May, and the judge says 0.91 and the human says 0.61 in July, the judge has been Goodharted by something — usually a prompt tweak that learned to flatter the rubric. The bot did not improve. The judge did.

The deepest signal is eval up, deflection flat. The launch metric keeps rising. The product metric does not move. This means the team is improving against an upstream proxy that does not transmit to the downstream business. The fix is rubric work, not model work.

The metric a beginner watches first is aggregate pass rate. The metric an experienced team watches first is the judge-SME calibration delta. The graph an expert opens before any other is the upstream-downstream divergence plot — eval pass rate on one axis, deflection on the other, week by week, looking for the moment they decouple.

8) Boundary of applicability — where each metric stops working¶

ROUGE/BLEU work: translation, summarisation against tight references, code-completion regression tripwires, format-shift detection.

ROUGE/BLEU break: open-ended chat, any task where the right answer can be phrased ten ways, any task where polarity (approved vs denied) sits in one word.

BERTScore works: paraphrase-tolerant similarity, multilingual comparison, mid-quality regression checks where you need more than tokens but cannot afford a judge.

BERTScore breaks: faithfulness (embeddings of hallucinated text are close to embeddings of true text on the same topic), polarity flips (similar to ROUGE but slightly better), tasks where the gold answer is itself low-quality.

LLM-judge works: rubric-driven behaviour grading, hallucination detection with retrieved context, instruction-following, refusal correctness, any place a human could grade in 30 seconds.

LLM-judge breaks: high-stakes domains without SME calibration (legal, medical, regulated finance), tasks where the judge model has the same training-data bias as the model under test, very long contexts where the judge's own attention drifts.

Human SME works: always, in the sense of correctness. The boundary is cost and throughput, not accuracy.

Human SME breaks: scale. You cannot grade a million chats a day with humans. The mechanism is to use humans to calibrate the cheaper layers, not to grade the whole eval.

Product-outcome works: the final question of did this actually pay off? over weeks of live data.

Product-outcome breaks: pre-launch. Anything that requires real users to exist. Also breaks on attribution — if pricing changes the same week the prompt changes, the deflection delta cannot be cleanly attributed.

9) Common wrong mental model — "higher BLEU means better answer"¶

The seductive belief is that metric scores are commensurable — a higher number means a better system, regardless of what the metric measures. This is wrong in three specific ways.

First, higher is not always better. A model that perfectly memorised the training references would have a perfect BLEU and zero generalisation. A judge that pass-rate-inflates by being lenient produces a higher score but a worse signal. On open-ended tasks, BLEU above some ceiling correlates negatively with creativity because the model is being penalised for not copying.

Second, scores from different metrics are not comparable. A 0.71 ROUGE and a 0.84 LLM-judge do not mean the judge is "13 points more confident" than ROUGE. They are measuring different dimensions. Treating them as a single quality axis is a category error — like averaging temperature and rainfall and calling the result weather.

Third, a metric improvement does not imply a quality improvement unless you've shown the metric correlates with what users feel. The eval-up-CSAT-down pattern is the canonical proof. The metric got "better"; the user got worse. The replacement mental model: every metric is a hypothesis that scoring well on it correlates with user happiness, and that hypothesis must be re-tested whenever the system or user base changes.

Replace "higher BLEU is better" with "each metric is a proxy with a known blind spot; combine proxies so the blind spots don't line up."

10) Six recurring metric pathologies¶

Single-metric tunnel. The team optimises one number for a quarter. The number rises. The product gets worse. The metric is fine; the optimisation pressure was misallocated.
Reference rot. The golden answers in the eval set were written six months ago. The policy changed. The reference is now wrong. ROUGE-against-reference scores the model on outdated truth.
Judge-model collusion. The judge and the system-under-test are the same model family. The judge marks its own dialect as fluent and a competitor's dialect as awkward. Score is inflated for in-family models.
Embedding-similarity hallucination blindness. A confident fabrication on a real topic embeds close to the truth. BERTScore says 0.92. The fact is invented.
Format-compliance theatre. The team adds a strict JSON-schema metric. Compliance hits 99.9%. The team celebrates. Nobody noticed the JSON is valid and the content is garbage.
Deflection-without-CSAT. Deflection rises (the bot refuses to escalate). The user gives up. CSAT falls. The product metric without its companion was the wrong product metric.

Each pathology is a metric reading itself instead of the dimension behind it. Fixes are always add a metric from a different family, not tune the existing one harder.

11) Cross-topic references — same shape, different layer¶

Same blind-spot pressure as RAG retrieval. Chapter 08_rag_system_design/01-confident-wrong-answer.md is the same pressure as text-similarity blindness, one layer earlier: a retrieved chunk that looks like the answer is not the same as a chunk that contains the answer. Surface similarity is a confident liar in both layers.
Same proxy-mismatch as model selection. Chapter 12_model_selection/03-benchmark-vs-task.md's argument that MMLU is not your task is structurally identical to this chapter's argument that BLEU is not quality. The metric-vs-question gap recurs whenever a measurement is borrowed from a different workload.
Optimization pressure echoed. Goodhart's law shows up in module 03_agent_observability_debugging again: any metric used as a target ceases to be a measurement. The defence is the same — read multiple metrics from multiple families, refresh the rubric quarterly, calibrate to a human anchor.
Cost/latency tradeoff recurs. The Pareto in this chapter is the same shape as the inference-cost/quality Pareto in chapter 15_inference_optimization. Both have a cheap-fast-weak corner, an expensive-slow-strong corner, and the engineering question is always what mix do I run at what frequency?

12) Fast design test before you trust a metric¶

Can you describe the worst answer that still passes this metric in one sentence?
If the model regressed catastrophically on this metric, would the failure be on a class of inputs you care about?
Does this metric agree with a human SME above 0.85 correlation on your task?
Do you have a companion metric from a different family that catches this metric's blind spot?
Has the metric ever caught a bug the team didn't already know about? If not, it is a confirmation tool, not a measurement tool.

Five yeses and the metric is earning its place in the inspection. One no, and you have just identified the blind spot you are about to ship into.

Where the metric families live in the wild¶

The market reveals the families by who specialises in which.

Text-similarity and academic-leaning benchmarks

BLEU, ROUGE, METEOR, chrF, sacreBLEU — canonical reference-based metrics from machine translation and summarisation; standardised so cross-paper comparison is possible.
BERTScore, BLEURT, COMET — embedding-based similarity that survives paraphrase; widely used as a BLEU-replacement in summarisation papers.
MTEB, BeIR — embedding-benchmark and retrieval-benchmark suites; the field's shared scoreboards for similarity-and-retrieval quality.
HumanEval, MBPP, SWE-bench — code-generation benchmarks where unit tests act as ground-truth behaviour metrics, not text similarity.
MS MARCO MRR@10, nDCG@10 — search/ranking metrics on top of retrieval; reused inside many RAG eval stacks.
Berkeley Function-Calling Leaderboard (BFCL) — public leaderboard for tool-call schema adherence; a behaviour metric pretending to be a text metric.

Behaviour-and-judge-driven eval products

OpenAI evals — open-source framework that pairs golden sets with judge graders; the default starting point.
Anthropic eval cookbook — recipes for rubric design and judge calibration, paired with Claude's tool-use APIs.
RAGAS — open-source RAG-eval library that bundles faithfulness, answer relevance, and context precision as ready-made judge metrics.
TruLens — instrumented LLM-app evals with feedback functions; faithfulness, groundedness, relevance as composable behaviour metrics.
DeepEval — pytest-style assertion framework for LLM outputs; treats behaviour metrics as test cases.
Promptfoo — CLI for prompt comparisons with assertion-based checks; lightweight regression layer.
Galileo Genie / context-adherence — productised hallucination and groundedness metrics for enterprise.
Patronus AI Lynx — open-source hallucination-detection model; specifically a judge that targets faithfulness blind spots.
Vectara HHEM — Hughes Hallucination Evaluation Model; ranks LLMs by hallucination rate, exists because faithfulness was the missing metric.
Arize Phoenix — open-source tracing-plus-eval; pairs the kitchen log with judge metrics in one UI.
LangSmith — LangChain's eval product; offline and online eval workflows with dataset versioning.
Braintrust — eval platform focused on dataset versioning and experiment tracking; treats every metric as a versioned function.
Helicone, LangFuse — observability platforms with judge integrations; product-outcome and behaviour metrics in one pane.

Product-outcome and domain-specific

Intercom Fin — deflection rate is the product metric; the whole sales pitch is the number.
Glean (enterprise search) — nDCG offline plus click-through online; the Goodhart-detector pair.
Harvey (legal) — SME calibration by BigLaw associates; the human-grade layer is the load-bearing metric, not a judge.
Cursor — tool-call success rate as the launch gate; pure behaviour metric on a public benchmark.
Perplexity — citation-accuracy gates every model swap; behaviour + faithfulness in one.

The pattern: nobody ships on one metric. The serious teams pair a cheap deterministic tripwire, a mid-cost rubric judge, a small human calibration sample, and a slow product-outcome anchor. The companies that sell eval tooling all sell into the same gap — that one metric is not enough, and the market keeps proving it.

Recall — can you reconstruct the chapter cold?¶

Name the three families of LLM-app metrics and the question each one answers.
Why does ROUGE 0.71 and human SME 0.62 mean the model is worse than ROUGE suggests, not better?
Give one concrete blind spot of each: ROUGE, BERTScore, LLM-judge, deflection rate.
What is the typical cost-per-sample for ROUGE vs BERTScore vs LLM-judge vs human SME?
State the chapter's load-bearing rule about what a metric can see.
Why is eval-up, CSAT-down a Goodhart signal and not a measurement problem?
When is BLEU honest and when does it lie? Give one workload for each.
Why does a healthy production stack run all four metric tiers at different frequencies?

Interview Q&A¶

Q1. Your team's eval ROUGE went from 0.68 to 0.73 after a prompt change. PM wants to ship. What do you check?

A. ROUGE alone is regression-detection, not a quality bar. I'd run the LLM-judge on the same 1,000-prompt set and inspect the disagreement between ROUGE delta and judge delta. If the judge moved with ROUGE, ship-pending-SME-calibration. If the judge didn't move, the prompt change is teaching the model to use more reference vocabulary without actually getting better — a classic BLEU-game pattern. Either way, the ship gate is the judge plus a slice table, not the ROUGE point. Common wrong answer to avoid: "ROUGE went up, ship it."

Q2. Your LLM-judge says 0.84. Your human SME on the same 100 says 0.62. What do you do?

A. The 22-point gap is the calibration debt, not the model. Inspect the 30 examples where the judge said pass and the SME said fail. Almost always you find the rubric has a hole the judge cannot see — usually fluent but unfounded claims, or a tone dimension the rubric didn't name. Fix the rubric, re-run the judge, re-measure the gap. Targets like judge-SME correlation above 0.85 should be on the launch gate, not the raw judge score. Common wrong answer to avoid: "Trust the judge, humans are inconsistent."

Q3. Why is BERTScore not enough for a faithfulness eval on a RAG system?

A. Because embedding similarity rewards topical closeness, and a confident fabrication on the right topic embeds very close to the true answer on that topic. BERTScore will give a hallucinated answer about a real policy clause a similar score to the correct answer about the same clause. Faithfulness requires checking each claim in the response against the retrieved context — that is a behaviour metric (judge with grounding rubric or a specialised faithfulness model like Vectara HHEM), not a similarity metric. Common wrong answer to avoid: "BERTScore is the modern ROUGE, it handles paraphrase."

Q4. PM asks "what's our quality number?" You have ROUGE, judge, SME-spot, deflection. What do you say?

A. "We have four numbers because no one of them tells the truth alone." I'd present a one-row dashboard: ROUGE for regression tripwire, judge pass rate as the launch gate, SME calibration delta as the judge-trust signal, deflection as the product-outcome anchor. The PM should learn to read all four, because optimising any one in isolation breaks the others. "What's our number?" is the wrong question; "are all four moving the right way?" is the right one. Common wrong answer to avoid: "0.84, that's our eval pass rate."

Q5. You're launching an MT system. Which metric stack do you build?

A. BLEU + chrF on a reference corpus, because the workload is the original BLEU workload and the reference is canonical. COMET as a learned-metric companion to catch fluency-without-faithfulness. Human bilingual reviewer for a 200-segment calibration sample. No LLM-judge needed unless the domain is specialised (legal, medical), because BLEU's blind spots are well-understood here and a judge would mostly add cost. This is the one workload where text-similarity is honest. Common wrong answer to avoid: "LLM-judge for everything, BLEU is dead."

Q6. The eval has been at 0.85 for three months. Deflection has dropped 4 points. What do you investigate first?

A. The rubric. A stable judge score and a falling product outcome is the canonical decoupling signal. Pull 30 recent low-deflection conversations and grade them by the current rubric — the cases where the rubric says pass and the user escalated anyway are the rubric-blind-spot pile. New rubric dimensions need to be added (probably around answer-completeness or tone), the judge needs to be re-prompted, and the calibration sample needs to be re-run before the launch gate moves. The model is fine; the measurement decayed. Common wrong answer to avoid: "Retrain or change the model."

Q7. Cumulative — your golden set scores 0.91 on the judge. Your synthetic set scores 0.62 on the judge. Live sample scores 0.74. Which number do you trust?

A. All three are honest about their sample; none is honest about production alone. The golden set 0.91 is the easy half — the team picked clean cases. The synthetic 0.62 is the adversarial half — synthetic generation deliberately broadened into awkward and weird quarters (chapter 04). The live 0.74 is the real distribution. The launch gate is the live number, with the synthetic number used to predict where live failures will cluster and the golden number used as a regression tripwire. Reporting only the golden number is the chapter-01 vibes failure dressed in a number. Common wrong answer to avoid: "0.91 is the real quality, the others are biased."

Q8. When would you skip the LLM-judge and use only exact-match?

A. Structured-extraction tasks where the answer is one field with one right value — order ID, amount in cents, country code, ICD-10 code. The judge would add cost, latency, and a calibration burden for a check a == operator does perfectly. Use the judge only when there is a judgement to make. "Is this string equal to that string?" is not a judgement. Common wrong answer to avoid: "Always use a judge, it's the modern way."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the refund chatbot's metrics dashboard I would present at the launch review:

Metric	Value	Source	Cost	Decision role
ROUGE-L vs gold	0.71	deterministic	$0.01/100	regression tripwire only
BERTScore F1	0.88	local embed	$0.10/100	semantic-shift tripwire
LLM-judge (gpt-4o)	0.84	judge + rubric	$0.80/100	launch gate candidate
Human SME pass	0.62	senior agent	$250/100	judge calibration anchor
Deflection rate	0.47	product log	~free	product-outcome truth

Notice how the launch decision depends entirely on which row the room reads. The 0.84 row would ship the bot. The 0.62 row would block it. The 0.47 row would say the model is fine, the bot's job design is wrong. All three are simultaneously true about different questions.

Step 2 — your turn. Take one AI feature in your own product. Write down (a) one metric from each family you can compute today, (b) the cost per 100 samples, (c) the worst answer that would still pass that metric, and (d) the product-outcome metric that would notice if your launch metric Goodharted itself. If you can't fill all four rows, you don't have a metric stack yet — you have one metric and a hope.

Step 3 — reproduce from memory. Without scrolling up, draw the cost-vs-signal Pareto diagram from section 5 and the five-metrics-on-100-chats table from section 4. Mark where each metric sits and which blind spot it has. If you can do this cold, you carry the chapter into the next launch review.

What you should remember¶

This chapter explained why no single metric is a quality metric. Three families exist — text-similarity, behaviour, product-outcome — and each one answers a different question. The refund chatbot scored 0.71 by ROUGE, 0.84 by judge, 0.62 by human SME, and 0.47 by deflection on the same 100 chats. The disagreements are not noise; each one diagnoses a specific blind spot in the metric above or below it.

You learned the load-bearing rule: a metric scores the dimension its math can see and is silent on every other dimension. The discipline is not to pick the best metric — that question has no answer. The discipline is to stack metrics from different families so their blind spots don't line up, run the cheap ones on every PR, the mid-cost ones nightly, the human-anchored ones weekly, and the product-outcome ones continuously. The rubric decides what acceptable means; the metric stack decides who got close enough. The spot check without metric-family awareness is the inspection that grades only one dimension and prints a confident number on the rest.

Carry this diagnostic forward: when somebody quotes a single metric number, ask three questions — "which family is this metric in?", "what is the worst answer that still passes it?", and "which metric from a different family is moving in the same direction?" If those three answers don't reconcile, you have just identified the next launch's surprise failure mode.

Remember:

Three families, three questions: looks like the gold? (similarity), did the right thing? (behaviour), user got unstuck? (outcome). No metric answers all three.
Every metric has a blind spot you can describe in one sentence. If you can't, you don't understand the metric.
ROUGE is a regression tripwire, not a quality bar. BLEU is honest only when the reference is canonical.
The judge-SME gap is your calibration debt. Widening gap means the judge is being Goodharted, not that the model improved.
Eval up, deflection flat is the rubric-decay signal. Fix the rubric, not the model.
A production metric stack runs cheap-on-PR, judge-nightly, human-weekly, outcome-continuously. Not one number — four, at different frequencies.

Bridge. The rubric judge appeared eight times in this chapter as the load-bearing mid-cost metric. We treated it like a black box that returns a number. That black box has its own failure modes — leniency, position bias, self-preference, length bias, prompt-injection vulnerability — and the next chapter opens it up. We solved which metrics to combine, but that creates how to trust the metric doing most of the work.

→ 06-llm-as-judge.md