Skip to content

01. Shipping on vibes — when a flawless demo hides a 38-point quality drop

~14 min read. A demo is a still photograph. Production is a video. Confusing one for the other is how teams launch a bot that wows a boardroom on Friday and angers customers on Monday.

Builds on the ELI5 in 00-eli5.md. The inspection — the systematic measurement that replaces stage-managed applause — is what closes the gap between a curated demo set and the messy distribution of real traffic.


What you already know and what is about to break

You have already shipped one AI feature. It demoed beautifully. Five carefully chosen prompts, five clean answers, applause in the room. The team felt good, leadership signed off, the launch went live on Monday morning. By Wednesday, the support inbox started filling with screenshots of replies that looked fine but were quietly wrong. By Friday, somebody was asking the question every team gets asked: "how did this pass review?"

Modules 01–23 taught you to build the system — prompts, tools, retrieval, reasoning, agents. This module teaches you the discipline that decides whether what you built is actually working. The first chapter is the most uncomfortable one. Before you reach for metrics, judges, dashboards, or A/B frameworks, you need to feel the gap between a polished demo and live traffic in concrete numbers. Otherwise every other tool in this module is a solution looking for a problem.

What this file solves

A demo of five prompts proves that the system can answer well. It proves nothing about how often it does answer well, or which slices fail first. This file walks one refund chatbot from a 100% demo score to a 62% live score, decomposes the 38-point drop into four named failure types, and shows why the right reaction is to build the inspection, not to argue about whether the demo was rigged. By the end you can explain to a skeptical PM why "the demo went great" is the most dangerous launch signal an AI team can mistake for evidence.

Why a passing demo is almost no evidence at all

A demo is a sample. Like every sample, it has a size, a sampler, and a selection rule. In a typical product review the size is five to ten conversations, the sampler is the team that built the feature, and the selection rule is "pick the prompts that make our work look good." Three biases stack: small-n, in-group sampling, and outcome filtering. The mathematical name for the resulting estimate is not an estimate. It is a story with examples. Stories with examples are useful for alignment and excitement. They are not useful for predicting failure rates.

Now picture the population the system actually serves. Customer questions arrive in a long tail: half of them are easy ("when does my order ship?"), a quarter are awkward ("the address auto-filled wrong but I already paid"), and a quarter are weird ("I bought it for my mother who passed away last week, what happens to the warranty?"). The demo set captured zero of the awkward quarter and zero of the weird quarter. The demo's 100% score is mathematically honest about the demo. It tells you almost nothing about the awkward and weird quarters, which is where production pain lives.

Teacher voice. A demo answers the question "is the best case good?" Production needs the question "is the worst case acceptable?" A team that confuses the two has not failed at engineering. They have failed at picking which question to answer.

The naive repair, the visible break, the diagnosis

The first response a smart team reaches for is "we'll be more careful about the prompts we pick for next demo." That feels rigorous. It is not. The new five prompts will pass too, because the team will keep picking until they do. Curated demos are an unfalsifiable signal — there is no version of "five carefully chosen prompts" that produces a failing score that survives the curation step.

The second response is "we'll watch user complaints." That feels honest. It is also wrong, but more subtly. Complaints are one end of the failure distribution: loud, late, and biased toward users with the energy to type. Silent failures — the customer who reads a wrong answer, shrugs, never escalates, and quietly churns — never reach the inbox. Worse, the complaints that do arrive are often the system's least dangerous failures (obvious wrongness that the user spotted) rather than the most dangerous ones (confident wrongness that the user trusted).

Not a curation problem. Not a complaint-handling problem. A sampling and measurement problem. So the natural question becomes: "what would it take to estimate the failure rate honestly, on the distribution we actually serve, before users feel it?" The answer to that question is the rest of this module, and the first step is to look at what an honest sample shows on the same system the demo loved.

When the same chatbot is asked 100 questions instead of five

Here is the refund chatbot's demo trace, then its live trace, on the same model and prompt. Same code, same weights, same temperature. Different sample.

DEMO SET — five prompts hand-picked by the team
  prompts:        5
  acceptable:     5
  unacceptable:   0
  pass rate:      5/5 = 100%

LIVE SAMPLE — 100 chats drawn uniformly from a representative week
  chats reviewed: 100
  policy-correct: 62
  failures broken down:
    - missing key account details (agent handoff broken)   18
    - invented refund exceptions  (confident hallucination) 12
    - rude or curt escalations    (brand-voice violation)    8
  pass rate:      62/100 = 62%

GAP:              100% − 62% = 38 percentage points

The 38-point drop is the chapter's whole story. It is also the chapter's whole anti-story, because no individual conversation in either sample is fake or mishandled. The model really did answer the demo prompts well. It really did answer 62 live prompts well. The gap is not in the answers; it is in which prompts got asked. A team that ships on the demo number ships the 38-point delusion as a free bonus.

Look at the failure breakdown. Each row teaches a different lesson. The 18 missing-account-details cases are an integration failure — the bot does not pull enough context, so the human agent inheriting the chat has to ask the customer to repeat themselves. The 12 invented-exceptions cases are a grounding failure — the bot speaks confidently about a refund clause that does not exist in the policy. The 8 rude-escalations cases are a style and policy failure — the bot's escape hatch triggers an unfriendly handoff. Three failure shapes, one root cause: the demo set never asked anything that would have stressed any of them.

Mini-FAQ. "Couldn't we have just looked at one of those failure types in the demo?" Possibly, if the team had known what to look for. The point of evals is that you stop relying on the team's imagination of what could go wrong. You sample from the live distribution, and the live distribution chooses the failures for you.

The rule: anything you did not sample, you cannot trust

State the load-bearing truth plainly: a quality claim covers only the population the measurement actually sampled, and nothing beyond it. The demo's 100% claim covers five team-curated prompts. The live sample's 62% claim covers a representative week. Neither claim covers next month's traffic if the user mix shifts, and neither claim covers a brand-new product line.

This is the rule that all the rest of the module enforces. Golden datasets exist to make the sample explicit. Synthetic generation exists to cheaply broaden the sample. Drift detection exists to notice when the live distribution has moved off the sample. A/B testing exists to compare two systems on the same sample. Logging and tracing exist so a failed sample can be reconstructed and rerun. Every later chapter is a different mechanism for keeping the sample honest. They all rest on this rule.

Teacher voice. Treat every quality claim like a financial statement. The number is meaningful only when you can see the footnotes — sample size, sample source, sample date, who labelled the outcomes. A demo has no footnotes. A live eval has every footnote.

How the inspection actually replaces vibes

The mechanism is not a new model or a more clever prompt. It is a four-step measurement habit, run before launch and again continuously after. First, decide what acceptable means in writing — a rubric the team agrees on before reading any outputs. Second, sample the live (or near-live) distribution. Third, score each sampled output against the rubric, by humans or by a calibrated judge. Fourth, slice the results by user type, policy category, and intent so a single aggregate cannot hide a collapsed slice.

  vibes-based launch                    eval-based launch
  ────────────────────                   ────────────────────
  team picks 5 prompts                   team writes rubric first
            │                                      │
            ▼                                      ▼
  team picks 5 outputs                   sample 100 live prompts
            │                                      │
            ▼                                      ▼
  team likes outputs                     score each by rubric
            │                                      │
            ▼                                      ▼
  team ships                             slice by user/policy/intent
            │                                      │
            ▼                                      ▼
  user feedback after launch             team sees 62% before launch
            │                                      │
            ▼                                      ▼
  surprise on Monday                     team fixes the 38pp first

The right-hand column is not slower. It is the same calendar week, with the painful number arriving on a Wednesday review instead of a Friday outage. Teams that have run the inspection twice never want to run a launch without it, because the shape of the failures gives them an actionable plan instead of an apology.

Vibes vs evals — when each one actually fits

Vibes are not always wrong. They are correct evidence for some questions and incorrect evidence for others. The honest framing is workload-dependent.

Question being asked Vibes are the right tool when... Evals are required when...
Can the system answer at all? early prototype, ship-or-kill decision in one afternoon the system is already in front of users
Is the best case impressive? sales demo, investor pitch, internal excitement almost never the right question for production
Is the worst case acceptable? never always — this is the eval's job
Did a prompt change make things better? side-by-side on 3 examples for intuition A/B on a held-out set with statistical power
Is quality drifting over time? impossible to answer required — drift detection lives here

Vibes belong on questions about possibility. Evals belong on questions about distribution. The category error of shipping on vibes is using a tool meant for possibility on a question about distribution. The opposite category error also exists — running an eval on the question "can we build something?" — but it is rare because evals feel heavy. Vibes feel light, so they are over-used.

Operational signals — what tells you the eval discipline is healthy or rotting

Healthy behaviour for a team that has internalised the inspection has three signatures. The launch review opens with a slice table, not a transcript. Aggregate score, slice scores, and a 24-hour rolling pass rate are visible on the same dashboard product managers use for revenue. When a new prompt or model is proposed, the first question in the room is "what is the eval delta?", not "how does it feel?"

The first signal that the discipline is degrading is also the easiest to spot: the eval set has not been updated in three months. A frozen eval set becomes a known-good test that the team learns to pass, which is not the same as learning to ship. The next signal is more subtle — slice tables disappearing from review decks, replaced by a single aggregate. Aggregates feel cleaner and are easier to celebrate, but a 78% aggregate can hide a 30% pass rate on a high-stakes slice. The deepest signal, the one that takes a quarter to notice, is the eval/satisfaction correlation drifting toward zero — your eval keeps going up while your CSAT keeps going down, which means the rubric has stopped measuring what the user actually values.

The metric a beginner watches first is demo pass rate. The metric an experienced team watches first is live pass rate on a sample drawn after the last prompt change. The graph an expert opens before any other is the slice-by-intent pass-rate distribution — a one-screen plot that shows which intent collapsed since last week.

Where vibes-only stays safe, and where it stops

The boundary of "vibes are fine" is narrower than teams imagine. They are genuinely fine when three conditions hold at once: the user base is small enough that a single failure is recoverable in person, the cost of a wrong answer is bounded (no money, no medical decision, no legal advice), and the team can roll back faster than failures accumulate. Internal tools for a four-person team often satisfy all three. A consumer-facing chatbot does not.

The pathology is the good-enough trap. Vibes work on the easy half of the distribution, the team ships, the easy half keeps working, and the long tail — the awkward and weird quarters from earlier — silently produces the worst failures. Six months later the team has accumulated an unmeasured failure debt that no longer maps cleanly to a single fix, and the proposal "let's add evals" arrives at the worst possible moment.

At scale the trap gets sharper. A system that handles a million conversations a day produces ten thousand failures even at 99% pass rate. Without the inspection, those ten thousand failures arrive as noise — uncorrelated screenshots in support tickets, hard to triage, hard to prioritise. With the inspection they arrive as a sorted table of failure classes with counts and example IDs, which is what an engineering team can actually fix.

The wrong mental model — "real users would complain"

The seductive belief is that production traffic is the eval, because real users will tell you when something is wrong. This belief is wrong for three reasons that stack.

First, complaints are sparse. Even good products receive complaint signals on a single-digit percentage of failures. The other 90%+ are silent — users who shrug, restart, or churn without ever filing feedback. A complaint-driven view of quality systematically under-estimates failure rate by an order of magnitude.

Second, complaints are biased. The users who complain are not a random sample. They are louder, more technical, more invested, or angrier than average. Their failure mix is different from the population's failure mix. Optimising for them shifts the system toward their concerns and away from the silent majority.

Third, complaints arrive after the damage. By the time a user complains, the wrong answer has already reached them. For most products this means a refund or an apology. For some — medical, legal, financial — it means harm that cannot be undone with an apology.

Replace the wrong model with the right one: complaints are an audit signal, not a measurement signal. Measurement comes from a representative sample scored against a rubric before the failure ships. Complaints validate that the measurement is finding the right things. Use both. Trust neither alone.

Six recurring failure shapes that vibes-only launches keep producing

  • Curated-demo regression. The launch reviewers see the demo's prompts pass. The same prompts appear in future regression checks. The system never gets tested on anything it has not already passed.
  • Aggregate-hides-slice. A 78% aggregate looks fine. The high-value slice (enterprise customers, paid tier, regulated jurisdictions) is at 41%.
  • Silent-failure debt. Failures users do not complain about pile up. By the time anyone notices, the fix touches three components and two teams.
  • Drift-after-launch invisibility. The launch was at 78%. Three months later it is at 64%. Nobody knows because nobody measured a second time.
  • Prompt-tuning amnesia. A prompt change was made on Tuesday. On Friday nobody can answer "did Tuesday's change help?" because no baseline was captured before the change.
  • Vendor-claim acceptance. A new model vendor claims +12% on MMLU. The team upgrades. Two weeks later, the team's own task is degraded — MMLU is not the team's eval, and the vendor never claimed it was.

Each of these is a specific failure of the sample and score habit. Each one disappears when the inspection is in place, not because the model improved, but because the team noticed.

Cross-topic references — where this pressure shows up again

  • Same failure shape, deeper module. The "confident wrong answer" pattern in 08_rag_system_design/01-confident-wrong-answer.md is the same pressure as shipping on vibes, applied to retrieval rather than launch. Both arise when the team has no instrumented way to detect that fluent output and correct output have diverged.
  • Optimization pressure echoed. Module 03_agent_observability_debugging returns to this rule under harder constraint — when failures span multiple agents and tools, the demo lie compounds. Same anti-pattern, larger blast radius.
  • Invariant carried forward. Every chapter in this module restates: a quality claim is only as strong as the sample that generated it. Drift detection, judge calibration, A/B testing, alerting — all of them are mechanisms for keeping that sample honest under different operational pressures.

A fast self-test before you sign a launch

  • Can you state the acceptable threshold in a sentence a stranger could grade against?
  • Did the eval set this week contain prompts the team did not personally write?
  • Can you point at slice-level pass rates for at least three user segments?
  • Would a 5-point drop in next week's eval be visible to anyone without checking?
  • If a vendor claimed +12% on a benchmark, do you have your own eval to verify it on your task?

Five yeses means the inspection exists. One or more nos means the next launch is partly on vibes, whether the room realises it or not.

Where the vibes-vs-evals gap shows up in shipped products

The market reveals the discipline by who has it.

  • Intercom Fin — published deflection-rate evals are the product. The team grades policy accuracy on sampled real tickets, not on staged ones, because the contract with customers is "we cut your support cost", which is measurable.
  • GitHub Copilot Chat — pass@k on a held-out repo set is the launch gate; demos are forbidden as the only signal because the team learned early that a chat assistant can look magical on canned prompts and fail on messy real repos.
  • Harvey — validates legal drafting across realistic matter types and uses BigLaw associate review as a calibration anchor; a "great showcase document" is dismissed as evidence because partners explicitly distrust it.
  • Duolingo Max — pass rate is sliced by learner CEFR level so a polished A1 lesson cannot hide a weak C1 tutor turn.
  • Anthropic Claude releases — the model card lists task-specific eval deltas instead of "feels smarter", because shipping decisions cannot be made on vibes at the scale Claude is deployed.
  • OpenAI evals platform — exists as a product because every team running serious LLM applications has rebuilt this layer themselves; the platform commoditises the discipline.
  • Cursor — the team publicly tracks tool-call success rate on a held-out repo benchmark and rejects releases that regress; the demo loop is explicitly downstream of the eval loop.
  • Perplexity — citation-accuracy eval gates every model swap; "the answer felt good" does not pass review.
  • Glean — enterprise search runs nDCG and CTR together; the team learned that a green offline metric and a falling click-through together is a Goodhart signal, not a victory.
  • Notion AI Q&A — internal eval over workspace golden sets before each release; production traffic catches the drift, but the gate is the eval.
  • Salesforce Einstein Copilot — CRM trust layer treats prompt injection and false-premise attacks as failure modes; the eval set explicitly contains adversarial inputs.
  • AWS Bedrock Knowledge Bases — observability product specifically markets retrieval failure analysis because customers asked for the diagnostic, not just the aggregate.
  • Bloomberg GPT, JP Morgan DocLLM — finance domain evals exist because regulators ask, and regulators do not accept demos.
  • Casetext CoCounsel — citation accuracy was a launch blocker post the Mata v. Avianca incident; the demo of legal drafting was always polished; the eval of legal drafting now decides ship.
  • Air Canada (2024) — counter-example. The chatbot promised a refund the policy did not allow, tribunal found the airline liable. The visible failure was a wrong answer; the invisible failure was no eval on the policy-violation slice.
  • Microsoft Copilot for M365 — Graph-aware eval reduces vague-query failures by injecting org context; the dashboard is reviewed weekly.
  • Stripe Radar — fraud models gate on production sampling of actual transactions, not on synthetic test cases alone; the team knows synthetic over-represents easy cases.
  • Slack AI — channel-summary eval includes long-tail channel types because the team noticed that early demos all came from sales channels.
  • Zendesk AI agentsneeds-human signal trained from past failed automations; the eval set is "places we failed before."
  • Vectara HHEM — exists as a product because customer deployments kept missing real failures with faithfulness-only scoring; the eval gap created the company.
  • Galileo, Patronus AI, Arize Phoenix, LangSmith, LangFuse — five companies whose entire pitch is "your eval is not enough". The market size tells you how often launches still go on vibes.

The pattern is consistent. The teams that ship reliable AI are the teams whose launch reviews open with a slice table.

Recall — can you reconstruct the chapter cold?

  1. Why is a 100% demo score on five prompts not evidence of a 95%+ production quality?
  2. Name the three biases that stack inside a hand-curated demo set.
  3. In the refund chatbot trace, name the three failure shapes and one specific symptom each one produced.
  4. State the chapter's load-bearing rule about what a quality claim covers.
  5. Name the four steps of the inspection as a measurement habit.
  6. Give two real conditions under which vibes are honestly enough evidence to ship.
  7. Why are user complaints an unreliable single signal of production quality?
  8. What is the first operational signal that an eval discipline is rotting?

Interview Q&A

Q1. Your CEO asks "the demo went great, can we ship?". You have no live eval. What do you say?

A. The honest answer is "the demo is necessary but not sufficient evidence." I would ask for a 24–48 hour pre-launch window to sample 50–100 real-looking conversations against a rubric we agree on now. If the slice table comes back clean, we ship. If it does not, we have a sorted list of fixes before users find them. The cost is a few days; the upside is that we never explain a Friday outage to a Monday all-hands. Common wrong answer to avoid: "Yes, the demo is the eval."

Q2. A teammate proposes shipping based on watching user complaints. Why is that not enough?

A. Three reasons that stack. Complaints capture maybe 5–10% of failures, biased toward loud users; the silent majority churns invisibly. The complaint mix does not match the failure mix, so optimising for complaints shifts effort away from the most common bug. And complaints arrive after the wrong answer has already reached the user. Complaints are an audit signal, not a measurement signal — useful for validating an eval, not for replacing one. Common wrong answer to avoid: "Real users will tell us when something is broken."

Q3. The launch eval shows 78% aggregate pass rate. The PM wants to ship. What is the next question?

A. "What is the slice table?" An aggregate at 78% can hide a 41% pass rate on the highest-stakes slice — enterprise customers, paid tier, regulated jurisdiction. The shipping decision depends on whether the slice that matters most is above the bar, not whether the average is. If the slice table is not in the room, the eval is not finished. Common wrong answer to avoid: "78% is well above our 70% bar, let's ship."

Q4. You inherit a project where the eval set has not changed in six months. What's the risk?

A. The model has been tuned to pass that exact set. The score keeps rising; production quality does not necessarily move with it. The eval set has become a known-good test, and the team has lost its instrument for catching new failure shapes. The immediate action is to refresh the set with recent live samples — pull a new 100 conversations, score them, and compare to the frozen set's score. The gap is your real quality. Common wrong answer to avoid: "If the score keeps going up, the system is improving."

Q5. A vendor claims their new model is +12% on a public benchmark. Your team is about to upgrade. What is your check?

A. Run the vendor's new model on your eval, on your sample, with your rubric, and compare to your current model. Public benchmarks are averages over tasks that may not resemble yours. The +12% on MMLU is real for MMLU; it is unknown for refund-policy reasoning. The right cost is a few hours of eval re-runs against the upgrade candidate. Common wrong answer to avoid: "+12% beats our current model, upgrade."

Q6. Why is "this prompt change feels better" a dangerous statement in a code review?

A. Because it is unverifiable and unrepeatable. Two reviewers may disagree; the reviewer who feels strongest wins. Six months later nobody can reconstruct whether the prompt change helped. The replacement is a delta on a held-out eval set, with statistical power if the change is borderline. Vibes belong upstream of the eval (generating hypotheses), not downstream of it (validating them). Common wrong answer to avoid: "Senior engineers can tell when a prompt is better."

Q7. Cumulative — your dashboard shows faithfulness 0.95, eval pass rate 82%, CSAT down 8 points. What do you investigate first?

A. The rubric. A rising eval and a falling CSAT is the canonical Goodhart signal: the team has been tuning to a rubric that no longer measures what users value. Pull 30 recently-low-CSAT conversations, score them by the current rubric, and inspect the cases where the rubric says acceptable but the user said unhappy. The rubric needs new dimensions or new anchors. Faithfulness 0.95 and pass rate 82% are facts; CSAT is the truth. Common wrong answer to avoid: "Faithfulness is high, so users must be wrong."

Q8. Define acceptable for a refund chatbot in one sentence.

A. "The reply is policy-correct, includes the account details a human agent would need to continue the conversation, and uses the brand-voice tone in the style guide." That sentence has three observable criteria — policy match, handoff completeness, tone match — which a labeller can grade and a judge can check. "Helpful and friendly" is not a definition; it is a wish. The eval's quality is bounded by how concrete the acceptable sentence is. Common wrong answer to avoid: "Acceptable means the customer is happy."

Apply now (10 min)

Step 1 — model the exercise. Take the refund-chatbot trace from this chapter. Here is the slice table I would build at launch review:

Slice n Pass rate First failure mode Decision
All chats 100 62% invented exceptions block launch
Enterprise customers 22 50% missing account details block until handoff fixed
Free tier 48 71% rude escalations ship with style guard
Regulated jurisdictions (EU) 14 43% invented exceptions block — compliance risk
Multi-turn conversations (≥3 turns) 16 38% invented exceptions block — root cause = no retrieval

Notice how the aggregate (62%) hides the EU slice (43%) and the multi-turn slice (38%). The slice table changed the launch decision from "close to bar, ship with caveats" to "two specific things to fix first."

Step 2 — your turn. Take one AI feature in your own product (or one you admire). Write three slices it should be measured on — by user type, by query type, by policy category. For each slice, predict whether the aggregate pass rate would be higher or lower, and write the failure mode that would dominate.

Step 3 — reproduce from memory. Without scrolling up, draw the demo-pipeline / inspection-pipeline diagram from the vibes vs evals comparison. Mark the four steps of the inspection. Then connect it to the chapter's load-bearing rule about samples in one sentence. If you can do this cold, you carry the chapter.

What you should remember

This chapter explained why a flawless demo is the most dangerous launch signal an AI team can mistake for evidence. The refund chatbot scored 100% on five hand-picked prompts and 62% on 100 live ones — same model, same prompt, same week. The gap is not a model failure or a prompt failure; it is a measurement failure that any system without the inspection will eventually meet. The 38-point delta decomposed into three named failure shapes — missing handoff details, invented exceptions, rude escalations — and each was invisible to the demo because the demo never asked anything that would have stressed them.

You learned the load-bearing rule: a quality claim covers only the population the measurement sampled and nothing beyond it. Every later chapter in this module — taxonomy, golden sets, synthetic generation, metrics, judges, calibration, drift, A/B, logging, alerting, EDD — is a different mechanism for keeping that sample honest under a different operational pressure. You also learned to distrust the good-enough trap: vibes work on the easy half, the team ships, the long tail accumulates failure debt, and by the time someone notices, the fix touches three components and two teams.

Carry this diagnostic forward: when somebody says "the demo went great", ask one question — "what is the slice table on 100 live-shaped prompts?" If the answer is "we don't have one yet", you have just identified the most leveraged half-week of work in the project. Vibes belong on questions about possibility. Evals belong on questions about distribution. The category error of using one for the other is the most common reason AI launches fail in public.

Remember:

  • A 100% demo score on five prompts is honest about the demo and silent about production. Demos answer "is the best case good?"; production needs "is the worst case acceptable?"
  • A quality claim covers only the sample that generated it. Treat every number like a financial statement — meaningful only with footnotes.
  • Aggregate pass rates hide collapsed slices. Always read the slice table before the headline.
  • Complaints are an audit signal, not a measurement signal. They under-count by an order of magnitude and arrive after the damage.
  • The eval/satisfaction correlation drifting toward zero is the quarter-scale signal that the rubric has stopped measuring what users value. Refresh the rubric before chasing the score.

Bridge. Once vibes are exposed as the wrong evidence for distribution questions, the next problem is taxonomy — what kind of eval, when? Offline against a golden set, online against live traffic, single-turn vs trace-level, rule-based vs judge-based: each is the right tool for a different decision, and confusing them produces the same category error this chapter just dismantled, one level down.

02-eval-taxonomy.md