02. Eval taxonomy — four axes, one decision per cell¶
~16 min read. The taxonomy is not vocabulary. It is a routing table. Pick the wrong axis and you measure a real thing very accurately, then ship the wrong system.
Builds on 01-shipping-on-vibes.md. The inspection is the habit; this chapter teaches which kind of inspection to reach for, because the wrong clipboard answers the wrong question with full confidence.
What chapter 01 proved and what now goes wrong one level down¶
Chapter 01 settled one fight. Demos answer "is the best case good?" and production needs "is the worst case acceptable?" — the refund chatbot scored 100% on five hand-picked prompts and 62% on 100 live ones, and the gap was a sampling failure, not a model failure. That argument is over. The inspection is now part of the team's vocabulary, and nobody on this team would ship a launch on five curated prompts again.
But "we now run evals" is the start of a new mistake, not the end of the old one. The team that just learned to sample honestly is about to make the same category error one level down. Somebody will say "our faithfulness score is 0.94, ship it," while the live thumbs-up rate quietly falls. Somebody else will say "the model beats GPT-4 on MMLU, upgrade," while the refund bot starts inventing exceptions on a slice the public benchmark never covered. Somebody will argue that a single pointwise rubric replaces user studies. Each of these is the chapter-01 disease wearing a labcoat: a measurement that is rigorous about itself and silent about the decision in the room.
This chapter teaches the routing table that prevents that. Evals split along four axes — offline vs online, single-turn vs trace-level, rule-based vs LLM-judge vs human, and capability vs behaviour vs product-outcome. Each axis answers a different question. Each cell of the resulting grid is a different decision. Confusing two cells reproduces chapter 01's category error in metric form. The threaded example is the same refund chatbot, walked through several cells so you can feel why one axis cannot stand in for another.
What this file solves¶
A team with a working sampling habit still has to choose which kind of eval to run before the next launch, the next prompt change, and the next on-call shift. This file gives you the four axes the choice is made along, walks the refund chatbot through six concrete checks across those axes, and shows the specific failure shape produced when an axis is collapsed — the offline-only blindness, the trace-level miss that single-turn checks cannot see, the LLM-judge bias rule-based assertions catch, and the capability-score win that the product-outcome dashboard refuses to celebrate. By the end you can label any proposed eval against the four axes and predict the decision it can and cannot make.
Why one number cannot decide three different questions¶
A team that has internalised chapter 01 still tends to ask "what is our eval score?" as if a single number existed. It does not. The refund chatbot is being judged at three different moments by three different audiences. Engineering wants to know whether a new prompt regressed yesterday's behaviour. Compliance wants to know whether the bot ever promises a refund the policy disallows. Product wants to know whether users actually got their problem solved this week.
Those three questions cannot share a number, because they do not share a sample, a unit, or a time-base. The engineering question runs on a frozen set of historical conversations, measured before the prompt ships. The compliance question runs on a curated adversarial set, measured continuously and treated as a hard gate. The product question runs on live traffic, with the user as the labeller, days after the answer was given. Forcing them into one dashboard cell produces the worst kind of false agreement — a green light that means three different things to three different people, none of which is "safe to ship."
The four axes exist because the choice of sample, scorer, unit, and outcome each independently determines what the resulting number is evidence of. The inspection is honest only when the eval's axes match the decision's axes.
Teacher voice. A taxonomy is not a vocabulary quiz. It is a routing table. When somebody proposes an eval, you should be able to label its four coordinates in one breath; if you cannot, you do not yet know what it would tell you on the day it disagrees with another number.
The naive collapse, the visible break, the diagnosis¶
The naive repair after chapter 01 is "let's pick one strong eval and run it everywhere." It feels disciplined. It is the same shape of mistake as the curated demo, one floor up. A faithfulness judge averaged over 200 conversations is genuinely useful for catching grounding regressions in a RAG step, and genuinely silent on whether the human agent inheriting the chat could continue the conversation. Both facts are true. A team that adopts faithfulness as the eval will stop noticing handoff failures around week three, because the green number tells them nothing is wrong.
The second naive repair is "let's compute every eval on every change." That is cost as theatre. Running an MMLU benchmark before a prompt tweak does not help; MMLU does not test your task. Running a full human review on a typo fix does not help; humans cost too much per chat for that cadence. Both moves spend budget without buying a decision.
Not a measurement quantity problem. Not a measurement quality problem. A measurement matching problem: each decision in the team's week needs an eval with a specific shape, and the shape comes from four independent choices. So the natural question becomes: "what are the four axes the eval choice is being made along, and which cell of the resulting grid does each decision actually live in?" The answer is the taxonomy that follows, and the test of whether you have understood it is whether you can label any eval somebody proposes in less than ten seconds.
When the refund bot is checked four different ways and three numbers disagree¶
Take the same refund chatbot from chapter 01, on the same week's traffic. Run four different evals against it. Watch the numbers diverge — not because any one is wrong, but because each is answering a different question.
SAME CHATBOT, SAME WEEK
Eval A — faithfulness judge on RAG step, 200 single-turn replies
offline | single-turn | LLM-judge | capability
score: 0.94
Eval B — policy-violation assertion suite, 500 adversarial prompts
offline | single-turn | rule-based | behaviour
score: 41/500 violations = 8.2% violation rate
Eval C — end-to-end resolution review, 80 full conversations
offline | trace-level | human | product-outcome
score: 62% (matches chapter 01's live sample)
Eval D — live "did this solve your issue?", 4200 production chats
online | trace-level | human | product-outcome
score: 54%
Eval A is genuinely useful and genuinely misleading at the same time. It is honest about the RAG step's grounding. It is silent about the eight percent of conversations where the bot cites a real clause from policy but applies it to the wrong order. Not a faithfulness problem; an application problem. A team reading only A ships and then waits for B and D to fail in production.
Eval B is brutal and narrow. It catches the eight percent invented-exception failures cleanly, because the adversarial set explicitly contains "ask for a refund the policy disallows" prompts and the assertion checks one observable thing: did the bot's reply mention a refund eligibility that contradicts the policy doc? It cannot tell you whether tone is right. It cannot tell you whether the agent handoff works. It is a hard gate on one failure shape, not an eval of the whole product.
Eval C is the chapter-01 inspection in its mature form: humans walking through full traces, scoring each against a rubric that covers policy, handoff, and tone together. It is expensive — eighty conversations, multi-turn, real grading. It is also the only eval that catches the "polished tone but no account details for the human agent" failure, because that failure only exists at trace level. Single-turn checks miss it by construction.
Eval D is the user's vote. It is the truth, eventually, but with footnotes. It is biased toward users with the energy to click a thumb. It arrives days after the conversation. It cannot tell you why a chat was unresolved — only that it was. It is the lagging indicator that validates whether A, B, and C were measuring the right things.
The numbers disagree because the cells disagree. Reading the four together is a richer story than any one alone. Reading only one is the chapter-01 mistake one level down.
Mini-FAQ. "Why does Eval C match the chapter-01 number?" Because chapter 01's "100 live chats reviewed against a rubric" is exactly the offline / trace-level / human / product-outcome cell of this grid. The previous chapter showed one eval. This chapter shows the other five cells that change a decision around it.
The rule: each axis chooses a different kind of evidence¶
State it plainly. The four axes of an eval are not styles or preferences; each one independently chooses what kind of evidence the score is. Pick offline and you choose pre-launch evidence with no user harm. Pick trace-level and you choose evidence about the whole conversation instead of one reply. Pick rule-based and you choose evidence a machine can defend in a postmortem. Pick capability and you choose evidence about the model in isolation, not the product around it. The cell — the combination of all four — is the actual evidence type. A single-axis label is half a sentence.
This rule is the chapter's load-bearing truth. Every later section — the four axes, the worked classification, the table of failures, the operational signals, the boundary, the wrong mental model — is a different consequence of it.
Teacher voice. When somebody says "our eval score is X", the right next question is not "what is the threshold?" It is "in which cell?" Until that is answered, you cannot tell whether X is a green light, a warning, or noise.
1) Axis one: offline vs online — when the user is the test subject¶
The first axis splits by who pays the cost of a failure caught by this eval. Offline evals run before a change reaches users; the team pays in compute, dataset cost, and review time, and a failure means a launch is delayed. Online evals run on real traffic; the user pays first, and a failure means somebody got a wrong answer.
Offline buys prevention. It can be slow, expensive per item, and adversarial — you can run worst-case prompts the team invents on purpose, because no real user is being exposed. The cost is staleness: the offline distribution drifts away from live traffic, and an offline win can be an online loss when the test set no longer represents the user.
Online buys truth. It catches the failures the team did not imagine, because the live distribution does the imagining for you. The cost is exposure — a failure detected online is also a failure shipped. So online evals tend to be lighter (latency, thumb-up rate, escalation rate, hard rule violations) and offline evals tend to be heavier (rubric review, judge ensembles, slice tables).
For the refund chatbot, Eval A and B and C are offline. Eval D is online. A team that runs only offline ships clean against its own imagination and blind to the surprise the user always brings. A team that runs only online catches surprises but pays for them in customer churn. The inspection at maturity is both, with offline gating the deploy and online validating it survived contact with users.
2) Axis two: single-turn vs trace-level — when the failure spans more than one reply¶
The second axis is the one teams forget first. A reply is a unit. A conversation is a different unit, and many failures live only at the conversation level: a polite first reply followed by a contradictory second one, a tone shift mid-handoff, a missing account detail that only matters when the human agent picks up at turn five.
Single-turn evals score one prompt and one reply. They are easy to dataset, easy to score, easy to A/B, and structurally blind to anything that happens across turns. Eval A and B in the refund example are single-turn. The faithfulness judge looks at one reply; the policy assertion checks one reply. Both can pass on a conversation that overall fails.
Trace-level evals score the whole interaction — every turn, every tool call, every retrieval, every handoff signal. Eval C in the refund example is trace-level: a human reviewer reads the full eighty conversations end to end, including the agent-handoff moment. This is where the chapter-01 "missing account details" failure was caught, and it could not have been caught by a single-turn eval no matter how strict, because the failure was "the agent did not get enough context across turn five and six," and turns five and six did not exist as a unit in any single-turn dataset.
Agent systems push this axis hard. A multi-tool agent has many internal turns the user never sees; trace-level evals are the only way to score whether the tool selection sequence made sense. This connects forward to module 25 on debugging agents — the same trace artifact that the eval scores is the trace the on-call engineer reads at 2 a.m.
SINGLE-TURN TRACE-LEVEL
─────────── ───────────
prompt → reply prompt₁ → reply₁ → tool₁ →
▼ prompt₂ → reply₂ → handoff →
score one ...
reply against ▼
rubric score whole conversation
(policy, handoff, tone,
tool use, resolution)
▲ ▲
catches: catches:
- hallucination - cross-turn contradiction
- policy violation - missing handoff context
- tone violation - tool misuse
in that one reply - unresolved escalation
Bold callback worth making here: the inspection was introduced in chapter 01 as a sampling habit; this axis is the reminder that the sampled unit can be a turn or a trace, and the team must choose deliberately.
3) Axis three: rule-based vs LLM-judge vs human — three scorers, three failure modes¶
The third axis is who decides. Three scorers exist, and each is wrong in a different direction.
A rule-based scorer is a programmatic assertion: a regex, a JSON schema check, a numeric comparison, a "this string must appear" or "this string must not appear" test. It is cheap, deterministic, and defendable in a postmortem. It is also brittle — it cannot judge tone, salience, or whether an answer is "good enough." For the refund bot, rule-based works on policy-violation detection (does the reply mention an ineligible refund type?) and dies on tone-quality detection (is the reply rude in a way the customer will feel?).
An LLM-judge is another language model scoring the first model's output. It is cheap relative to humans, scales to thousands of items, and handles nuance rule-based scorers cannot. Its failure modes are documented and predictable: position bias (it favours the first option in a pair), length bias (it favours the longer answer), self-preference (it favours outputs from a similar model), and a tendency to be lenient on confident-sounding wrongness. Chapter 06 will return to judges in depth and chapter 08 to their calibration; for now, the taxonomy point is that LLM-judges are the middle scorer — better than rules for nuance, worse than humans for ground truth.
A human scorer is a labelled person reading the output against a rubric. They are the ground truth for ambiguous quality, the source of training data for judges, and the only scorer who reliably catches harms a rule cannot enumerate and a model cannot recognise. They are also slow, expensive, and inconsistent across labellers unless the rubric is anchored (which is chapter 07's problem).
The taxonomy point — who scores is independent of what is sampled and whether it is offline or online. The refund bot's policy-violation assertion is rule-based. The faithfulness check is LLM-judge. The end-to-end resolution review is human. All three are offline, all three score different cells, none of them substitutes for another.
Mini-FAQ. "Can we just replace humans with an LLM-judge if we calibrate it?" For some cells, yes — calibrated judges on well-anchored rubrics can reach 80-90% agreement with humans on narrow tasks. For nuance, policy edge cases, and brand-voice rubrics, humans remain the ground truth, and the judge's job is to triage to humans, not to replace them. Replacing the human entirely is the route to the eval/CSAT divergence from chapter 01.
4) Axis four: capability vs behaviour vs product-outcome — three layers, three audiences¶
The last axis is the one that catches benchmark-driven decisions in the act. It asks what is being measured about.
Capability evals measure the model in isolation: MMLU, HumanEval, GSM8K, instruction-following benchmarks. They are useful for broad positioning and vendor selection. They tell you almost nothing about your task. A model can win on MMLU and lose on refund-policy reasoning because the slices are unrelated. Capability is the wrong evidence for product decisions and the right evidence for "is this model worth piloting at all?" decisions.
Behaviour evals measure how the model behaves inside your product: does it follow your prompts, your tools, your policy, your tone, your handoff protocol? The policy-violation assertion suite is a behaviour eval. So is the faithfulness check on your specific RAG pipeline. Behaviour evals are the engineering team's daily instrument — they catch prompt regressions, schema drift, and policy slips before users see them.
Product-outcome evals measure whether the user actually got what they came for: did the refund get processed, did the issue get resolved, did the user come back tomorrow? These are the lagging indicators — CSAT, resolution rate, deflection rate, retention. They are what the business actually cares about. They are also slow, noisy, and confounded by everything that is not the AI.
Eval A and B are behaviour. Eval C and D are product-outcome. None of the four is capability — which is exactly why a +12% MMLU vendor claim does not move them. The rubric from chapter 00 belongs at behaviour level for engineering and at product-outcome level for the user.
LAYER AUDIENCE REFUND BOT EXAMPLE
───── ──────── ──────────────────
capability vendor selection MMLU, HumanEval
behaviour engineering faithfulness 0.94, violations 8.2%
product-outcome product/business resolution 62% offline, 54% online
A team that confuses these layers ships on the wrong evidence in a predictable direction. Engineering will over-weight behaviour, product will over-weight outcome, leadership will over-weight capability ("the new model beats GPT-4!"), and the slice that matters most slips through the gaps between them. The shift change from the ELI5 — the deploy — is the moment all three layers must agree, not just one.
5) The cell-by-cell decision — six checks, six different roles¶
Here is the refund chatbot's full eval slate, mapped to all four axes. The exercise is not to memorise the table; it is to feel why each row exists.
| Check | Offline/Online | Turn/Trace | Scorer | Layer | Decision it makes |
|---|---|---|---|---|---|
| ROUGE on 500 saved summaries | offline | turn | rule | behaviour | regression on wording overlap |
| Faithfulness LLM-judge | offline | turn | LLM | behaviour | RAG grounding gate |
| Policy-violation assertions | offline | turn | rule | behaviour | hard launch gate |
| End-to-end human review | offline | trace | human | product | launch readiness |
| Thumbs-up rate | online | trace | human | product | live quality signal |
| Latency p95 | online | turn | rule | behaviour | experience SLA |
Six checks, six cells, six different decisions. No two are interchangeable. Drop the policy-assertion row and adversarial misses ship. Drop the trace-level review and handoff failures ship. Drop latency and a correct-but-slow bot ships and infuriates users anyway. Drop thumbs-up and the team has no online ground truth to calibrate the offline numbers against. This is the routing table. The decision being made selects the row, and the row dictates the eval shape.
This is also where the spot check and the kitchen log from the ELI5 land: the spot check is the offline trace-level human review (Eval C), and the kitchen log is what makes online trace-level evals (Eval D and its diagnostic siblings in chapter 11) possible — without logging, online traces cannot be reconstructed and re-scored.
6) Why this instead of a single unified score¶
The plausible alternative is to roll everything into one composite — weight faithfulness, behaviour, and outcome together, produce one quality KPI, and ship on it. It looks elegant. It dies under workload.
Composite scores hide the cell. A 0.81 composite cannot tell you whether the policy-violation rate doubled and the latency improved, or whether faithfulness fell and resolution rose. The actions for those two cases are opposite. A single-number eval is a single point of failure for judgment — the same Goodhart trap chapter 01 warned about, structurally guaranteed by averaging across unlike axes.
The right alternative to a composite is a slice-axis-layer table with each cell visible. On launch review day, the team reads all six rows and can answer the three audiences' questions independently: engineering on behaviour, compliance on assertion violations, product on resolution. The table is more cognitive load than a single number and orders of magnitude more decision-useful.
Cost in plain terms: a composite is one number and zero diagnoses. A six-cell table is six numbers and six diagnoses. The cost difference between them is reading six numbers instead of one. The capability difference is every actionable insight in the chapter. This is the kind of asymmetry where the cheaper-looking option is the more expensive one in practice.
7) Operational signals — what tells the team the taxonomy is being honoured¶
A healthy team's launch review opens with the cell map, not a single chart. Each of the four axes has at least one row in the review deck. When a number moves, the conversation immediately routes to the right owner — behaviour regressions to engineering, outcome regressions to product, capability movements to model selection.
The first signal of degradation is axis collapse: the team starts quoting one cell as if it were the whole grid. "Faithfulness is 0.94, we're fine" is the classic. The second signal is the slow drift of one cell winning the review's attention budget — usually whichever cell has the prettiest dashboard. The third is the appearance of a composite score on the leadership deck, which always feels like progress and is always the beginning of axis collapse.
The metric a beginner watches first is the offline single-turn LLM-judge score, because it is the cheapest to compute and the easiest to dashboard. The metric an experienced team watches first is the delta between offline and online product-outcome on the same week's traffic — that delta, when it widens, is the early signal that the offline distribution has drifted from live and the rest of the offline numbers are starting to lie. The graph an expert opens before any other is the four-axis heatmap: rows are evals, columns are slices, cells are pass rates. One screen, all four axes visible, no axis allowed to hide.
Teacher voice. A dashboard with one cell per metric is a dashboard the team can defend. A dashboard with a composite KPI is a dashboard nobody can debug.
8) Where the taxonomy holds, where it breaks¶
The four-axis frame holds for any product where the AI is acting in a structured workflow — chatbots, agents, RAG systems, copilots, search, classification. In those settings, the cells map cleanly onto the team's daily decisions, and the routing table works.
The frame degrades for genuinely open-ended creative tasks where "is this output good?" has no rubric a labeller could anchor. Image generation aesthetic quality is one example; long-form creative writing is another. There, the scorer axis collapses toward humans-only, the layer axis collapses toward product-outcome, and the value of the four-axis frame drops because three of the four cells are not populated.
The pathology — when the taxonomy itself becomes the problem — appears when teams add eval cells faster than they add decisions. A grid with thirty cells and four decisions to make is a bureaucracy, not an instrument. The rule for sizing: each cell in the dashboard must correspond to a specific decision somebody on the team owns. If a cell does not change anyone's behaviour, it is cost without information, and the next dashboard refactor should drop it.
At extreme scale — millions of conversations a day — the trace-level human cell becomes the bottleneck. Teams solve this by sampling traces stratified by failure-likelihood and routing only the suspect ones to humans, using LLM-judges and rules as the triage layer. Chapter 06 returns to this stratified pipeline.
9) Wrong model: "more evals is more rigour"¶
The seductive mental model after this chapter is "the more evals we run, the more rigorous we are." This is half right and dangerously half wrong. Adding evals that all live in the same cell adds cost without adding evidence. A team that runs five different offline single-turn rule-based behaviour checks has one axis of evidence at five different times. Adding a sixth check in the same cell does not protect against a trace-level failure.
The correct mental model is "rigour is axis coverage, not eval count." One eval in each of six distinct cells beats six evals in one cell, every time. The diagnostic for axis coverage is the four-axis-checklist: when a launch decision is being made, can the team point to at least one passing eval that covers each axis the decision depends on? If yes, the launch is rigorously evidenced. If no, adding another eval in an already-covered cell does not fix it.
This is the chapter-01 disease one level deeper. Vibes-on-five-prompts is one-cell evidence. A single faithfulness judge is also one-cell evidence — just a more sophisticated cell. The cure is the same in both cases: broaden the sample, broaden the cell, until the evidence shape matches the decision shape.
10) Six recurring failure shapes a single-axis eval program produces¶
- The faithfulness mirage. Offline single-turn LLM-judge looks great. Trace-level reviews show contradictions across turns that single-turn checks cannot see.
- The benchmark upgrade trap. Capability score on a public benchmark improves; product-outcome resolution on your task regresses because the benchmark slice did not resemble your task.
- The composite KPI collapse. A weighted score rolls behaviour and product-outcome together; one halves while the other doubles; the composite barely moves; the team ships the half that halved.
- The offline-only blindness. Adversarial offline suite passes; live thumbs-up falls. The offline distribution drifted, and no online cell existed to catch it.
- The online-only fire-fighting. No offline gate exists; every regression is discovered in production by users; the team's job becomes incident response, not engineering.
- The trace-blind agent. Single-turn evals say every reply is fine; the multi-step agent fails at handoff or tool selection because the failure only exists at trace level.
Each of these is a specific failure to populate one of the four axes. Each disappears when the cell map is the launch review's opening artifact.
11) Cross-topic reinforcement — where these axes recur¶
- Same failure shape, different module. The "composite hides the slice" failure mode is structurally identical to module 13's "fluent-but-ungrounded" RAG failure: one signal is high, another is low, the team reads only the high one. Both are axis-collapse pathologies.
- Pressure carried forward. Chapter 03 (golden datasets) and chapter 04 (synthetic generation) are mechanisms for populating specific cells of this grid — golden sets feed offline trace-level human evals; synthetic generation feeds offline single-turn behaviour evals. The cells decide which dataset shape is needed.
- Invariant restated. Chapter 09 (drift detection) is this chapter's online cells under time pressure: when does the offline-online delta widen enough to invalidate the offline numbers? That diagnostic is impossible to ask until the cells are named.
- Cross-module echo. Module 25 (debugging agents) consumes the trace-level cell's artifact directly — the trace the eval scored is the trace the on-call reads. Same artifact, different downstream use, same axis discipline.
Where this lives in the wild¶
The four axes show up under different names across products, but the cells always exist.
- Intercom Fin — offline trace-level human review of sampled tickets gates each model swap; deflection-rate is the online product-outcome row that validates the offline number survived contact with users.
- GitHub Copilot Chat — capability-axis pass@k on held-out repos lives next to behaviour-axis tool-call success rate; the two are reported separately because they answer different decisions.
- Cursor — offline code-task evals (behaviour) plus online acceptance rate (product-outcome) plus latency p95 (behaviour, online); three cells, three owners.
- Anthropic model cards — explicitly publish capability evals (MMLU, GPQA) and refuse to claim they predict product-outcome on customer tasks; the customer must populate the behaviour and outcome cells themselves.
- OpenAI Evals platform — the product is a way to define and run evals across cells; the platform is axis-agnostic by design because customers' cell needs differ.
- Perplexity — citation-accuracy is a behaviour-axis rule-based gate at the trace level; CSAT is the online product-outcome cell; the two are read together at release review.
- Harvey — BigLaw associate review is the offline trace-level human cell; capability benchmarks are explicitly demoted because legal partners distrust them as task-irrelevant.
- Khanmigo — human rubric review for pedagogical nuance (offline trace human) sits alongside automated factual checks (offline turn rule); the two cells protect different failure modes.
- Duolingo Max — pass rate sliced by CEFR level is an offline trace-level human eval; the slicing is mandatory because aggregation across levels hides collapsed cells.
- LangSmith / LangFuse / Braintrust — eval platforms that let you define cells across axes; the existence of these products tells you cell-design is the bottleneck most teams hit.
- Arize Phoenix — traces are the artifact, evals are layered on top; offline and online trace-level cells share infrastructure deliberately.
- Galileo / Patronus AI / Vectara HHEM — judge-axis products that scale the LLM-judge cell; their value is that they replace a human cell at lower cost on bounded behaviour rubrics.
- promptfoo — local rule-based behaviour-eval runner; populates the offline turn rule-based cell extremely cheaply and stops there by design.
- Glean — nDCG (offline behaviour) plus CTR (online product-outcome) plus drift dashboards (online behaviour) populate three cells; the team learned that nDCG-only is the axis-collapse trap.
- Stripe Radar — production sampling of real transactions is the online behaviour cell; synthetic adversarial cases are the offline behaviour cell; the two cover different fraud shapes.
- Notion AI — offline golden-set behaviour evals at release; online usage signals at product-outcome layer; both review weekly.
- Salesforce Einstein Copilot — adversarial offline assertions (prompt injection, false-premise) are a rule-based behaviour cell; CRM trust layer audits are a separate compliance gate.
- AWS Bedrock Knowledge Bases — observability product specifically markets retrieval-failure analysis at trace level because customers asked for the diagnostic, not the aggregate.
- Air Canada (2024) — the chatbot's invented refund policy passed every single-turn faithfulness check the team had; the missing cell was offline behaviour-axis policy-violation assertion. The cell-map gap is what shipped.
- Casetext CoCounsel — citation accuracy became a launch-blocking rule-based behaviour eval after Mata v. Avianca; the demo was always polished, but the cell that mattered did not exist until forced into existence by a public failure.
- Microsoft Copilot for M365 — Graph-aware behaviour eval injects org context to reduce vague-query failures; the dashboard separates capability, behaviour, and outcome explicitly.
- Vercel AI SDK eval helpers — schema-conformance rules and judge-based scoring live as separate eval types in the SDK because they populate different cells of the same grid.
- Berkeley Function-Calling Leaderboard (BFCL) — capability-axis benchmark for tool use; usable for vendor selection, explicitly not for product-outcome.
- Pydantic AI eval module — structured behaviour evals over typed tool outputs; rule-based, single-turn, offline by default.
- Slack AI — channel-summary eval set deliberately includes long-tail channel types because aggregation across channel types was hiding collapsed cells.
Recall — can you label evals on the four axes cold?¶
- Name the four axes and state, in one sentence each, what each axis decides about the resulting evidence.
- In the refund-bot table, why does Eval A score 0.94 while Eval C scores 62% on the same week?
- Why is a single-turn eval structurally blind to handoff failures?
- State the chapter's load-bearing rule about what one eval cell is and is not evidence of.
- Give the cell coordinates of an MMLU score. Why is that cell almost never the right one for a product launch decision?
- What is the failure shape called axis collapse, and what is its earliest visible symptom?
- Why is a composite KPI considered an anti-pattern by this chapter when it feels like progress?
- Give two real conditions under which the four-axis frame degrades.
Interview Q&A¶
Q1. You inherit a project where the only eval is a 0.94 faithfulness LLM-judge. The team says "we're rigorous." What's your read?
A. One-cell evidence. The eval lives in offline / single-turn / LLM-judge / behaviour. It is silent on trace-level failures, on hard policy assertions, and on product-outcome. The team has rigour about one axis and zero coverage on three others. The action is not to argue with the 0.94; it is to add at least one eval in each of the three missing axes — an offline trace-level human review for product-outcome, a rule-based behaviour assertion suite for policy, and an online product-outcome signal for ground truth. Common wrong answer to avoid: "0.94 is high, the team is fine."
Q2. Your model vendor claims +12% on MMLU. The team is about to upgrade. Cell map says what?
A. MMLU lives in offline / single-turn / rule-based / capability. The product-outcome cell is empty for this upgrade decision. The right move is to run the new model on your behaviour cell (refund-policy assertions on your prompts) and your product-outcome cell (offline trace-level human review on your week's traffic) before the swap. Capability lifts do not predict behaviour or outcome on a specific task. Common wrong answer to avoid: "+12% capability beats current, ship the upgrade."
Q3. Why split rule-based vs LLM-judge vs human as three scorers instead of two?
A. Each has a distinct failure mode that the other two do not. Rules miss nuance and cannot judge tone. Judges have position, length, and self-preference biases. Humans are the ground truth but slow, expensive, and inconsistent without rubric anchoring. Collapsing judges into "automated" hides the bias profile; collapsing rules into "automated" hides the brittleness. Three named scorers force the choice to be explicit. Common wrong answer to avoid: "Judges are just cheaper humans."
Q4. The launch review shows faithfulness 0.94, policy-violation 8.2%, resolution 62%. PM says "average is 67%, ship." Cumulative diagnosis — what's the failure?
A. Composite-KPI collapse. The three numbers live in different cells and cannot be averaged. The 8.2% violation rate is a compliance-grade hard gate; on a refund product it likely means regulatory exposure. The 62% resolution is a product-outcome regression that connects directly to chapter 01's slice analysis. The 0.94 faithfulness, in this context, is the misleading-high cell that makes the average look acceptable. The action is to read the three cells separately, escalate the violation rate, and refuse the average as a decision input. Common wrong answer to avoid: "67% average passes the bar."
Q5. A teammate proposes adding three more faithfulness checks to "increase rigour." What do you say?
A. Three more cells in the same coordinate is one axis of evidence at three timestamps. The marginal information is near zero, the marginal cost is real. The leveraged move is to add an eval in a missing cell — a trace-level human review if no offline product-outcome eval exists, a rule-based assertion suite if no policy gate exists. Rigour is axis coverage, not eval count. Common wrong answer to avoid: "More evals always means more rigour."
Q6. Cumulative — chapter 01 used "the inspection" as a sampling habit. How does this chapter change that?
A. Chapter 01 established that a single sample drawn from live traffic against a rubric beats a curated demo. This chapter adds that the rubric and the sample shape are not unique — there are four axes along which both can vary, and each cell is a different evidence type. The chapter-01 inspection was implicitly the offline / trace-level / human / product-outcome cell. The cell was correct for the launch decision in chapter 01. The mistake this chapter prevents is treating that one cell as universally sufficient, the way chapter 01 dismantled the curated demo. Common wrong answer to avoid: "Chapter 01's inspection was already the complete eval; this chapter just adds vocabulary."
Q7. Latency p95 is degraded but faithfulness is up. Which cells, and how do you reason?
A. Latency is online / single-turn / rule-based / behaviour. Faithfulness is offline / single-turn / LLM-judge / behaviour. Same axis on three of four — both are behaviour — but different cells on online/offline and scorer. The team has improved one behaviour signal and regressed another. The product-outcome cell decides: if online resolution rate held, the latency hit is being absorbed by users without churning; if it dropped, the faithfulness gain was bought at user cost. The four-axis frame routes the question to the right next dashboard automatically. Common wrong answer to avoid: "Faithfulness up means quality up, ignore latency."
Q8. Define an eval cell in one sentence a stranger could grade against.
A. "An eval cell is the specific combination of offline-vs-online, single-turn-vs-trace, scorer type, and capability-vs-behaviour-vs-outcome that determines what evidence a score is, and therefore which decision the score can defend." That sentence forces all four axes to be present. "An eval is a measurement of quality" is not a definition; it is a label, and label-only evals are how the chapter-01 disease comes back. Common wrong answer to avoid: "An eval cell is any quality measurement."
Apply now (10 min)¶
Step 1 — model the exercise. Take the refund chatbot's six checks from section 5. Here is the cell map I would put on the launch review's first slide:
| Decision being made | Cell coordinates | Which check |
|---|---|---|
| Launch gate (regression) | offline / turn / rule / behaviour | ROUGE on saved set |
| RAG grounding gate | offline / turn / LLM-judge / behaviour | Faithfulness |
| Hard compliance gate | offline / turn / rule / behaviour | Policy-violation assertions |
| Launch readiness | offline / trace / human / product | End-to-end review |
| Live quality signal | online / trace / human / product | Thumbs-up |
| Experience SLA | online / turn / rule / behaviour | Latency p95 |
Notice each row owns exactly one decision. Notice no row substitutes for another. Notice three different audiences read three different rows first.
Step 2 — your turn. Take one AI feature in your own product. Write the six (or fewer) decisions your team actually makes about it across a quarter — launch, prompt change, model swap, on-call response, weekly review, exec readout. For each decision, write the cell coordinates of the eval that should defend that decision. Mark any decision currently being made on the wrong cell, or on no cell.
Step 3 — reproduce from memory. Without scrolling up, draw the four-axis frame (offline/online; turn/trace; rule/LLM-judge/human; capability/behaviour/outcome) and place the refund bot's six checks into the correct cells. Then connect the picture back to chapter 01's load-bearing rule about samples in one sentence — the cell is the sample shape, and the sample shape determines the evidence.
What you should remember¶
This chapter explained why a working eval habit still ships the wrong system when the eval lives in the wrong cell. The same refund chatbot scored 0.94 faithfulness, 8.2% policy violations, 62% offline resolution, and 54% online resolution in the same week, and each of those numbers was honest about exactly one cell of a four-axis grid. The mistake this chapter prevents is reading one cell as if it covered the others — the chapter-01 disease one level down, dressed in metric language instead of demo language.
You learned to label any eval against four axes — offline vs online, single-turn vs trace-level, rule-based vs LLM-judge vs human, capability vs behaviour vs product-outcome — and to read the cell as the actual evidence type. You learned that the inspection at maturity is a small table of cells, one per decision the team owns, not a single number. You learned why composite KPIs feel like rigour and produce axis collapse, why a single-turn eval cannot catch a handoff failure, and why a capability score does not predict a product-outcome result on your specific task.
Carry this diagnostic forward: when somebody quotes an eval number, ask one question — "in which cell?" If the answer cannot name all four axes, the number is half a sentence and you cannot tell whether it is a green light, a warning, or noise. Use the cell map on launch reviews, on prompt-change debates, on vendor evaluations, and on weekly readouts. The same axes apply every time.
Remember:
- An eval's evidence type is the cell, not the number. Four axes, each independently chosen, define the cell.
- Rigour is axis coverage, not eval count. One eval per cell across six cells beats six evals in one cell.
- Single-turn evals are structurally blind to trace-level failures. Choose the unit deliberately.
- Capability scores do not predict behaviour or product-outcome on your task. Vendor benchmarks belong in vendor selection, not launch gates.
- Composite KPIs hide cell collapse. A six-cell table is more cognitive load and orders of magnitude more decision-useful.
- When somebody quotes one number, the diagnostic question is "in which cell?" — not "what is the threshold?"
Bridge. Naming the cells fixes the routing problem, but most cells are empty until somebody fills them with data. The offline trace-level human cell — the one that caught chapter 01's 62% — needs a trusted set of conversations with known-good outcomes, versioned, owned, and refreshed; otherwise the cell exists in name only and the launch review opens with an empty row.