13. Eval-driven development — when the test is written before the prompt¶

~18 min read. Most teams ship the change and then write the eval. EDD inverts the cadence — the change does not exist until a failing eval already names what it must fix. The prompt diff stops being the unit of work; the eval delta becomes it.

Builds on 12-alerting-dashboards.md. Dashboards and alerts catch what is already broken in production. The inspection closes the loop only when every caught failure becomes a permanent test case — a regression bullet the system must always pass before the next ship. EDD is that closing.

What the last twelve chapters earned, and the loop they still leave open¶

Chapter 01 made you feel the 38-point gap between a curated demo and a live sample. Chapter 03 gave you golden sets with owners and versions. Chapter 07 taught rubrics with anchors. Chapter 08 calibrated the judge so its scores survive a second labeller. Chapter 09 caught drift after launch. Chapter 11 traced what happened inside a multi-step run. Chapter 12 wired the dashboard and the alert so a regression at 3 a.m. wakes someone up.

Read those twelve chapters carefully and you have a working inspection: representative samples, defensible rubrics, judges that agree, traces you can replay, dashboards a PM trusts, alerts that fire on the right slice. What you do not yet have is a rule for how the eval set grows. A frozen eval set is a known-good test the team learns to pass. A growing one — one where every production miss earns a permanent bullet within hours of being noticed — is the difference between a team that ships AI and a team that nurses one. This chapter teaches that growth rule and the inner loop it forces on every prompt, retrieval, tool, and model change.

What this file solves¶

Most teams ship a prompt change and then, if a complaint arrives, write an eval to remember the lesson. EDD inverts that order: a fix is not allowed to land until a failing case is captured and the eval set is rerun with the change applied. The unit of progress stops being "the diff looks better" and becomes "the eval delta is positive on the right slice without regressing the others." This file shows the full inner loop on the refund chatbot — bug report, captured case, prompt-and-retrieval change, before/after eval numbers per slice, ship decision — and names the team-size and risk-profile boundaries where the discipline pays off versus where it just adds tax.

Why prompt iteration without an eval feedback loop stays slow forever¶

Take any team running an LLM product without EDD. Watch the prompt-iteration cadence. A PM forwards a complaint. An engineer reads the trace, edits the system prompt, eyeballs three test prompts, and ships. Three days later a different complaint arrives — the same root cause, a slightly different surface. Nobody is sure whether the original fix helped, hurt, or was orthogonal. The team is doing real work and accumulating nothing.

The reason is structural, not motivational. Every "did this change help?" question requires a comparison against something. Without a stored set of cases the team has rehearsed against, that comparison happens in someone's head against five remembered prompts, and the answer is a feeling, not a number. Feelings don't compose: yesterday's feeling about Tuesday's prompt cannot be added to today's feeling about Thursday's prompt. So every iteration starts roughly from scratch. The team is on a treadmill, not a staircase.

The naive repair is "we'll add evals later, once the system stabilises." That is the trap. The system stabilises only when iteration produces durable knowledge, and iteration produces durable knowledge only when every change can be scored against the same target. Not a motivation problem. Not a skill problem. A feedback-loop latency problem. So the natural question becomes: what would it take to make every prompt change carry its own before-and-after evidence, cheaply enough that nobody routes around it? That question is the entire reason EDD exists.

When a "tiny prompt fix" silently regresses three other slices¶

Here is the smallest example before the rule. The refund chatbot has been flagged for one bug. A customer reported: "the bot promised a refund on an order that was 45 days old, our policy is 30 days." An engineer reads the trace, adds one sentence to the system prompt — "Never approve a refund after 30 days from the order date" — eyeballs three prompts that previously failed similar cases, sees them now pass, and ships.

A week later, three new complaints arrive. The bot is now refusing refunds on day-29 orders for users who had legitimate damage claims. The bot is also handing off to humans more aggressively, because the new sentence is being applied as a hard rule the model cannot reason around. Aggregate pass rate dropped from 68% to 61% — same model, same week, one sentence of prompt change.

PROMPT FIX, NO EVAL LOOP
  cases reviewed before ship:        3 (the obvious failing case + 2 similar)
  cases reviewed after ship:         3 weeks of complaints
  slice regression detected at:      Friday standup, week 2
  attribution to the one sentence:   inferred, not measured

PROMPT FIX, EDD LOOP
  cases reviewed before ship:        47 (full fast suite) + 5 new captured slices
  pass-rate delta per slice:         visible in the PR
  ship decision:                     made on numbers, not vibes
  attribution to the one sentence:   measured against the prior commit

The failure shape is not "the engineer is sloppy." The engineer applied judgment correctly given the cases they saw. The shape is that the cases they saw were not the right sample to predict the change's behaviour on the live distribution. EDD is the discipline that fixes the sample.

The rule: a fix is not shipped until a failing case is captured and the eval delta is positive on the right slice¶

State the load-bearing rule plainly: no prompt, retrieval, tool, or model change is merged unless (a) a failing eval case exists that names what the change must fix, and (b) running the full fast suite against the candidate produces a non-negative delta on every protected slice and a positive delta on the captured case. The order matters. The eval case exists before the change. The delta is measured per slice, not as an aggregate. The protected slices are the ones a regression there would be expensive — enterprise, regulated jurisdictions, money-mutating intents.

That rule has three consequences the team must accept. Every bug fix grows the eval set by at least one case. The fast suite must stay fast enough that a developer runs it on every commit without complaining. Some legitimate-feeling fixes will fail the gate and have to be rethought, because the change helped one slice and hurt another. The third consequence is the one teams underestimate — and it is also the source of EDD's biggest velocity gain, because the bad fix that fails the gate is the bad fix that did not ship.

Teacher voice. EDD is not "evals plus engineering discipline." It is the rearrangement of the inner loop so that engineering decisions and eval results refer to each other within the same pull request. If a PR description says "feels better" instead of citing an eval delta, the loop has broken and the team is back on vibes — they just have a dashboard now to feel sophisticated while doing it.

The inner loop — what every change actually executes¶

Below is the canonical EDD inner loop, in ASCII because this is the chapter's core mental model. The placeholder names from the ELI5 anchor it: the inspection is the eval suite, the rubric scores each case, the spot check is the small fast subset, the kitchen log is the trace that survives a failure, the shift change is the merge gate.

   ┌───────────────────────────────────────────────────────────────┐
   │ EDD INNER LOOP — one bug to one merge                         │
   ├───────────────────────────────────────────────────────────────┤
   │                                                               │
   │   bug arrives (complaint, alert, trace, manual review)        │
   │            │                                                  │
   │            ▼                                                  │
   │   capture as eval case  ──→  added to suite, tagged by slice  │
   │            │                                                  │
   │            ▼                                                  │
   │   run baseline ── current system fails the new case           │
   │            │                                                  │
   │            ▼                                                  │
   │   propose change (prompt / retrieval / tool / model)          │
   │            │                                                  │
   │            ▼                                                  │
   │   run fast suite (the spot check, 5-15 min)                   │
   │            │                                                  │
   │     ┌──────┴──────┐                                           │
   │     │             │                                           │
   │  delta good     delta bad on protected slice                  │
   │     │             │                                           │
   │     ▼             ▼                                           │
   │  run full suite  rethink change, do NOT merge                 │
   │     │                                                         │
   │     ▼                                                         │
   │  full delta good? ── merge; the new case is now permanent     │
   │                                                               │
   └───────────────────────────────────────────────────────────────┘

Notice what the loop has and what it does not have. It does not have "engineer reviews the prompt diff and decides if it reads better." Human review of the diff exists, but it is downstream of the eval delta, not the gate. The gate is numeric. The inspection is what decides ship. The kitchen log of the failing case is what justifies the case's existence in the suite. The shift change is gated on the numbers in the PR body.

Mini-FAQ. "Doesn't human review still matter?" Yes — for catching new failure modes the eval set cannot yet see, for security and policy reasoning the rubric cannot encode, and for style judgments the judge underrates. Human review is the complement to the gate, not the replacement. The eval gate is the floor; human review is the ceiling.

The refund chatbot, one EDD cycle, end to end¶

Here is one complete cycle on the refund chatbot you've been threading since chapter 01. The bug report is the one from the tiny example: "bot promised refund on a 45-day-old order, policy is 30 days."

Step 1 — capture the case.

id: refund-policy-age-2026-05-04-001
slice: [policy-window, enterprise]
input: |
  Customer message: "Hi, I bought this on March 21 and it broke today.
  Can I get a refund?"
  Context: today is May 4. Order date 2026-03-21. Order age 44 days.
  Customer tier: enterprise.
expected_behavior:
  - decline refund based on 30-day policy
  - explain the policy window plainly
  - offer warranty path or repair option
  - keep account-handoff fields populated
forbidden_behavior:
  - approve refund
  - invent an exception clause
  - hand off rudely

Step 2 — baseline. Current system, on this case: "Yes, I can process that refund for you, one moment." Result: fail (approved refund outside policy window, invented permission).

Step 3 — propose change. The engineer's hypothesis is that the system prompt does not mention the 30-day window explicitly, and that retrieval is pulling the warranty policy document instead of the refund one when the customer's message mentions damage. Two changes go in the same PR:

 # system prompt
+ Refund policy window: refunds may be issued only when the order is at most
+ 30 calendar days old at the time of the request. For older orders, decline
+ the refund and offer the warranty or repair path.
+
+ When the customer mentions damage, do not treat that as a refund override.
+ Damage may trigger a warranty claim, not a refund outside the policy window.

 # retrieval
- query_template: "{customer_message}"
+ query_template: "{customer_message} refund policy window age days"
+ rerank_top_k: 5
+ filter_doc_type: ["refund-policy", "policy-exceptions"]

Step 4 — fast suite. The fast suite is 47 cases covering refund-eligible, refund-ineligible, damage-with-warranty, damage-without-warranty, multi-turn, and rude-tone-detection. It runs in 4 minutes 12 seconds on a parallel runner. Before and after, by slice:

Slice	n	Pre pass	Post pass	Delta	Protected?
refund-eligible (in window)	12	92%	92%	0	yes
refund-ineligible (out of window)	10	40%	90%	+50	yes
damage with warranty	6	67%	83%	+16	yes
damage no warranty	4	75%	75%	0	no
multi-turn	8	50%	50%	0	yes
tone / handoff	7	71%	71%	0	yes
all	47	66%	79%	+13	—

The captured case (refund-policy-age-2026-05-04-001) goes from fail to pass. No protected slice regresses. The aggregate moves +13 points. The PR is allowed past the fast-suite gate.

Step 5 — full suite. The full suite is 412 cases; it runs in 38 minutes on the nightly runner. The engineer triggers it manually because the change touches retrieval as well as the prompt. The full suite confirms +9 aggregate and no protected-slice regression beyond noise. The PR is merged. The captured case is now a permanent member of the suite, tagged so a future "refactor the system prompt" PR cannot quietly remove it.

Teacher voice. This is the entire EDD value proposition in one PR. The change is two diffs that together cost an afternoon. The evidence justifying the change is a slice table in the PR body. Six months from now, when somebody asks "why is the prompt so specific about the 30-day window?", the answer is in the suite — there is a case that will fail without that sentence, and the case has an ID, a date, and a customer-shaped story attached.

EDD vs ship-then-fix vs human-review-only¶

Three loops claim to ship safely. They are not interchangeable. The right choice depends on team size, traffic volume, and blast radius of a single bad answer.

Loop	What gates merge	Strongest fit	Where it breaks
Ship-then-fix	engineer judgment + dashboard alerts	one-engineer prototype, internal tools, low blast radius	falls apart by the time the system has 2+ engineers and one PM, because nobody can prove a change helped
Human-review-only	senior engineer reads the prompt diff	small teams with one expert prompt-writer, low traffic	does not scale past 3-4 engineers; the reviewer becomes the bottleneck and starts approving on vibes
EDD	eval delta on protected slices	3+ engineers, customer-facing traffic, regulated/money/medical contexts	overhead is wasteful for a one-person experiment that no user will see

The category error is using the wrong loop for the wrong context. A solo founder running ship-then-fix on a paid medical-advice product is courting harm. A two-person internal-tool team running full EDD on every weekend prototype is paying tax for safety they do not need. EDD's overhead is real — typically a couple of engineer-days to stand up the harness, then 5–15 minutes per change to run the fast suite — and it pays off only when the alternative is more expensive than that.

A useful rule of thumb: EDD pays off the moment the cost of one undetected bad answer (refund, hallucinated policy, customer churn, regulatory fine) exceeds the cost of running and maintaining the suite for a month. For most production AI products that crossover happens in the first week.

Why the cost ratio inverts faster than teams expect¶

The seductive intuition is that evals slow iteration. More evals = slower iteration feels true because each eval run takes minutes. But the intuition flips once you count what iteration actually costs without evals.

Without EDD: every change is followed by 3–7 days of ambient uncertainty about whether it helped. The next change is gated on the team's confidence in the previous one. Confidence accumulates slowly because there is no number to lean on. The result is slow iteration disguised as fast commits. The commits land in hours; the durable progress lands over weeks, sometimes never.

With EDD: every change carries a number. The next change starts with the previous change's number as the new baseline. Confidence accumulates per-PR, not per-quarter. The fast suite costs 5–15 minutes; the change itself costs an afternoon; the answer to "did this help?" arrives the same day. The team is on a staircase, not a treadmill.

                  TIME-TO-FEEDBACK PER CHANGE
                  ──────────────────────────
  no evals          3-7 days, sometimes never
  smoke tests       hours, only catches gross regressions
  EDD fast suite    5-15 min on every PR
  EDD full suite    30-90 min, triggered before merge or nightly

The dollar cost matches. A fast suite of 50 cases at $0.002/case for a small judge model is $0.10 per run. A full suite of 500 cases at $0.01/case for a stronger judge is $5 per run. Even with five PRs a day and three nightly full runs, the bill is dollars per day. Compared to one badly-shipped policy bug, the suite is the cheapest insurance the team buys.

Teacher voice. Not a tax — a tempo. EDD does not slow iteration; it sets the tempo at which iteration is allowed to claim it produced progress. Teams that resist EDD often think they are protecting velocity. They are usually protecting the comfortable illusion of velocity.

Operational signals — when the EDD loop is healthy and when it is rotting¶

Three numbers tell you whether the discipline is alive.

Eval set growth rate. A healthy EDD team's eval set grows by 1–5 cases per week from production. Faster than that means the system is failing in genuinely new ways, which is a model or design problem worth pausing for. Slower than once a fortnight means either the system is mature and low-traffic, or — much more commonly — the team has stopped capturing live failures. The capture step is the one that decays first, because adding a case is dull work compared to writing a fix.

Regression catch rate. Of every 100 PRs merged, the fast suite should block 5–15 of them on a captured-slice regression. If the block rate is zero, the suite is too small, too easy, or the team is bypassing the gate. If the block rate is above 25%, the team is either making too many speculative changes or the protected-slice rules are too strict for the system's current maturity.

Mean time from bug to permanent test. A healthy team turns a production miss into an eval case in hours, not days. The longer the latency, the more bugs you forget about. The deepest signal of decay is the team that runs a postmortem, agrees the bug was important, and never gets around to adding the eval case — and the same bug shape recurs in three months.

The first metric a beginner watches is the aggregate pass rate. The first metric an experienced team watches is the per-slice delta on the most recent PR. The first graph an expert opens is eval-set growth versus alert-fired growth — when alerts grow but the eval set does not, the team has stopped converting alerts into permanent learning, and the dashboard has become entertainment.

Where EDD pays for itself, and where it just hurts¶

EDD's overhead is roughly two engineer-days to stand up, plus 5–15 minutes per change. It pays off whenever:

The system is customer-facing and a wrong answer costs more than $1 of remediation.
More than one engineer is editing prompts, retrieval, or tools.
The system is being changed often enough that "did this help?" is asked at least weekly.
The product crosses a regulatory or trust boundary — money, medical, legal, identity, safety.

EDD becomes overhead theatre when:

A solo engineer is running a 48-hour experiment that no user will see.
The system is genuinely one-shot — a research probe, a one-time data clean — and there is no "next change" to gate.
The team's eval rubric is so unstable that running the suite produces non-reproducible noise. (The fix is to stabilise the rubric — see chapter 07 — not to abandon EDD.)
The fast suite has ballooned to 45 minutes; nobody runs it; the loop is broken in spirit even if the harness exists.

The pathology at scale is the EDD theatre trap — the harness exists, the dashboards exist, the PR template asks for the eval delta, but the suite has not had a new case added in two months and the protected slices have not been reviewed since launch. Symptoms look healthy; the loop is dead. The fix is not more tooling; it is restoring the bug → case → PR habit and making it visible in standup.

The wrong mental model — more evals = slower iteration¶

The most seductive wrong belief about EDD is that it slows the team down. It is the belief that fights every introduction of EDD inside an unfamiliar team, because each individual eval run feels like friction. The belief is also wrong, and naming why is one of the chapter's load-bearing moves.

The belief confuses per-change cost with per-progress cost. A single change is faster without evals — type a sentence, push, done. The change costs minutes. But the progress — the durable, accumulated improvement that lets the team confidently say next week's version is better than last week's — costs days or weeks without evals because the team is forever re-litigating whether the last change helped. With evals, the change costs an afternoon plus 10 minutes of suite, and the progress costs the same afternoon plus 10 minutes. Per-progress, EDD is dramatically faster.

Replace the wrong model with the right one: evals are the only way to make prompt iteration compound. Without them, iteration is a random walk with no memory. With them, iteration is a hill-climb with a history. The kitchen log of every past failure is now permanent; the rubric scores each candidate; the inspection decides whether the change graduates. The team stops re-discovering the same bugs and starts accumulating durable answers.

Six more failure shapes EDD prevents¶

The undocumented prompt sentence. A new engineer joins, reads the system prompt, deletes a "redundant" sentence — and 18 months of carefully captured policy slices silently regress. EDD prevents this because the suite fails on the deletion before merge.
The retrieval regression hidden by a prompt win. A change improves prompt wording (+5 points) while degrading retrieval recall (-9 on a protected slice). Aggregate looks slightly negative; slice table makes the cause obvious. Without EDD, the team ships the bundled change because "the prompt clearly reads better."
The judge-drift false positive. A judge model upgrades and starts scoring 4 points higher across the board. Without EDD's per-slice baselines from the prior judge version, the team mistakes the judge drift for a real win and ships nothing. With EDD, the per-slice deltas don't move uniformly; the judge change is detected and the baseline is reset.
The "we fixed it" without a case. A bug is reported, an engineer ships a one-line fix, marks the ticket closed. Six months later the same bug returns with a slightly different phrasing. Nobody remembers the first fix. EDD requires the case to exist before the fix; the second occurrence trips it.
The model upgrade trap. Vendor releases a new model. Team upgrades because aggregate evals look fine on the vendor's public benchmark. EDD requires running your suite on the upgrade candidate; the protected slices reveal a 9-point regression on enterprise refund handoffs that the public benchmark could not see.
The prompt-bloat death spiral. Each bug fix adds another sentence to the system prompt. After 40 fixes the prompt is 6,000 tokens and the model is ignoring half of it. EDD's protected-slice gate makes prompt-shrinking refactors safe — you can rewrite freely as long as the suite passes.

Where this lives in the wild¶

The teams that ship reliable LLM products almost all run a version of EDD, with different names for the harness and different views on how aggressive the gate should be.

Anthropic's release pipeline — every Claude release is gated on a battery of internal evals across capability, safety, and behaviour; new failure transcripts captured from previous releases become permanent regression cases for future ones, which is why a "smaller and faster" release rarely regresses on prior protected slices.
OpenAI's evals platform — the publicly released framework is the externalisation of an internal habit: every product team running on top of GPT models was rebuilding the same harness, so the platform commoditised the loop.
Promptfoo — the CI-first eval runner explicitly designed around the EDD inner loop; the configuration is checked into the repo next to the prompts, and the GitHub Action posts the delta directly to the PR.
Braintrust — experiment tracking and dataset versioning sit at the centre; the product encodes "every PR is an experiment, every experiment has a baseline, every baseline is a stored dataset."
LangSmith CI — the "evaluators on every commit" workflow is the team's bet that EDD is the future of LLM engineering, not an option; the integration with LangChain chains is the reason adoption is sticky.
Inkeep and Mendable — RAG-first documentation assistants that publicly describe an EDD workflow where every customer-reported citation error becomes a permanent test case, and the harness blocks ships that regress on the canonical set.
Vellum — prompt versioning where each version is associated with an eval run; the UI literally cannot promote a version that regresses on a flagged slice.
BAML — the typed DSL bakes retries and structured output validation into the same pass that runs the eval; the prompt is treated like a typed function and the test suite is type-narrowing.
Pydantic AI evals — uses Pydantic models as both the schema and the eval contract, so the same definition that constrains the model also defines what passes.
Cursor's internal eval gates — the team publicly tracks tool-call success rate on a held-out repo benchmark and refuses to ship releases that regress; the eval loop is upstream of the demo loop.
GitHub Copilot Chat — the held-out repo set is the launch gate; new failure transcripts captured from telemetry feed back into the regression bank.
Perplexity — citation-accuracy eval gates every model swap; "the answer felt good" is not allowed to clear the gate.
Notion AI Q&A — workspace golden sets must pass before any model or retrieval change; the captured-case set has owners and SLAs for refresh.
Harvey — eval gates anchored to BigLaw associate review for legal drafting; captured failures from real matters become permanent cases.
Casetext CoCounsel — post the Mata v. Avianca hallucination incident, citation-accuracy evals are a hard gate; that incident is itself a permanent regression case.
Intercom Fin — deflection-rate evals against sampled real tickets; every customer-flagged miss converts to a captured case within the same week.
Patronus AI and Galileo — eval-platform vendors whose pitch is that the captured-case loop is too important to leave to a homegrown script; the existence of the market is itself evidence of the discipline's universality.
Arize Phoenix — tracing-first observability with eval primitives, designed so a production trace can become a captured eval case with one click.
LangFuse — open-source equivalent; the "convert trace to dataset row" feature is the most-used button in the product.

The pattern across all of these: the eval is upstream of the change, the capture step is treated as sacred, the protected slices are reviewed quarterly, and the merge gate is numeric.

Cost and tempo — concrete numbers for the refund chatbot¶

What	Cases	Time	Cost per run	Frequency
Fast suite (the spot check)	47	4 min 12 s	$0.10	every PR
Slice-protected subset	12	1 min 8 s	$0.03	every commit
Full suite	412	38 min	$4.10	before merge + nightly
Adversarial / red-team subset	28	9 min	$1.40	weekly + on safety-touching PRs
Captured case (one new bug)	1	6 s	$0.002	each new bug
Eval-set audit (rubric refresh)	full	2 hrs human	engineer time	quarterly

Read the table for the tempo it implies. A developer's inner loop costs cents and minutes. The full pre-merge check is dollars and tens of minutes — once per PR, not per commit. The whole month's eval bill is dwarfed by one mishandled enterprise refund.

Mini-FAQ. "What if our suite runs in 45 minutes and developers hate waiting?" Split it. The fast subset is the per-commit gate; the full suite is the pre-merge or nightly gate. If even the fast subset is slow, the cases are probably too long or the judge model is too expensive — both fixable without compromising the loop.

Self-test before you call your team "eval-driven"¶

Can every engineer point at the PR template field that asks for the eval delta?
Does the suite have at least one case added from a production miss in the last 14 days?
Is the fast suite green on main, every day, without anyone fixing it manually?
Does the merge gate block on per-slice regression, not only on aggregate?
Can you reproduce a six-month-old bug by running its case ID against today's system?

Five yes means the loop is real. One no means the discipline is rotting somewhere a postmortem will eventually find.

Pause and recall¶

State the EDD rule in one sentence: when is a fix allowed to merge?
Name the five steps of the inner loop on the refund chatbot.
Why is the per-slice delta load-bearing? What does an aggregate-only gate miss?
What is the time-to-feedback gap between no-evals iteration and EDD iteration on one change?
Name three operational signals that the EDD loop is rotting.
Why does "more evals = slower iteration" feel true and is wrong?
When does EDD's overhead not pay off?
What is the difference between human review of a prompt diff and the EDD merge gate?

Interview Q&A¶

Q1. A PM proposes a prompt change because a customer complained. You have an EDD loop. Walk through the first five things you do, in order.

A. Capture the complaint as an eval case with input, expected behaviour, forbidden behaviour, and slice tags. Run the case against the current system to confirm it fails — if it passes, the bug is intermittent and a different mechanism is needed. Make the proposed change in a branch. Run the fast suite and read the per-slice deltas, not the aggregate. If a protected slice regresses, do not merge; rethink. If all protected slices are flat or positive and the captured case now passes, run the full suite, then merge. Common wrong answer to avoid: "Edit the prompt, push, and watch the dashboards." That is ship-then-fix, not EDD; you have just thrown away the captured case and the per-slice attribution.

Q2. Your fast suite has grown to 41 minutes. Developers are bypassing it. What do you do?

A. Triage the suite. Split into a per-commit minimal slice (the protected-slice cases plus the most recent captured failures, target sub-5-minute) and a pre-merge fuller pass. Profile the slow cases — usually a few long multi-turn cases dominate; consider a cheaper judge model on the per-commit subset and the expensive judge on the pre-merge pass. The point of the fast suite is that it stays fast enough that nobody routes around it; the moment developers bypass it, EDD is dead in spirit and you have a dashboard pretending to be a gate. Common wrong answer to avoid: "Make the suite mandatory in CI." Mandatory but slow is how engineers learn to skip it via "trivial change" excuses; speed is a feature.

Q3. The aggregate eval is up 4 points on a candidate prompt, but the enterprise slice is down 7. Ship or not?

A. Do not ship without rethinking. The enterprise slice is a protected slice in any sane setup — it is where the revenue and the brand risk live. A 7-point regression there is more expensive than a 4-point aggregate win is worth, even before counting the asymmetry of who notices a regression. Read the failing enterprise cases, identify the mechanism, and either patch the change to preserve enterprise behaviour or split the change into two PRs that each pass the protected-slice gate independently. Common wrong answer to avoid: "Aggregate is up, ship." That is exactly the slice-blind failure chapter 01 spent thirty-eight points dramatising.

Q4. Why are you running EDD instead of just relying on the daily dashboard and on-call alerts?

A. Dashboards and alerts are downstream — they catch failures that already shipped, on production traffic, where the cost of the bad answer has already been paid. EDD moves the catch upstream, into the PR, where the bad change has not yet shipped and the cost of fixing it is an engineering afternoon. Dashboards remain essential for the failures the suite cannot anticipate — drift, distribution shifts, attack patterns — but they should be the secondary catch, not the primary one. The dashboard catches yesterday's bug; EDD prevents tomorrow's. Common wrong answer to avoid: "Dashboards are enough." That accepts that every bug ships at least once, which is the ship-then-fix loop with extra observability.

Q5. Cumulative — your dashboard from chapter 12 fires an alert: the citation-accuracy slice has dropped 6 points in 24 hours. Walk through the EDD-driven response.

A. Pull 20 failing traces from the trace store (chapter 11). Convert the worst three into permanent eval cases tagged citation-accuracy. Run them against the current system to confirm they fail. Diagnose: is this a retrieval bug (chapter 09 drift territory), a judge drift (chapter 08 calibration), or a real prompt or model change that landed today? Branch a candidate fix, run the fast suite — paying attention specifically to the citation-accuracy slice and any slice that shares retrieval — and merge only with a positive delta on the captured cases and flat-or-positive on protected slices. The three captured cases now stay in the suite forever, so this specific drift cannot silently return. Common wrong answer to avoid: "Roll back to yesterday's prompt." A rollback may be the right immediate operational move, but without capturing the cases the same drift returns with the next change.

Q6. Your team has two engineers and a single internal-tool LLM feature. Is full EDD overkill?

A. Probably yes if the tool is one-shot or has tiny blast radius — a research probe, a one-off data clean. Probably no the moment the tool has more than ten users or touches anything externally consequential. The honest middle path is a minimal harness: a 20-case captured set, a one-command runner, a PR comment that posts the delta. That costs two engineer-days to stand up and pays for itself the first time someone asks "did Tuesday's change help?" The trap is not "EDD or nothing" — it is rebuilding the harness three times in twelve months because each version was scoped too small. Common wrong answer to avoid: "Add evals later when the system stabilises." That confuses cause with effect — the system stabilises because of evals, not before them.

Q7. The vendor released a new model. You want to upgrade. What does EDD say?

A. Run your full suite against the candidate model with your rubric on your slices. The vendor's public benchmark covers what the vendor measured; it cannot tell you whether your refund-policy slice survives the upgrade. Read the per-slice deltas. If every protected slice is flat or positive, the upgrade is safe to ship behind a canary (chapter 10). If a protected slice regresses, either negotiate prompt changes that recover it, hold the upgrade, or accept the regression with a documented business reason. The vendor's claim is a hypothesis; your suite is the test. Common wrong answer to avoid: "Vendor says +12% on MMLU, upgrade." This is the same chapter-01 mistake at the model-vendor scale.

Q8. Six months in, your eval set has 800 cases and the suite takes too long. What's the systematic response?

A. Audit the suite. Some cases will be near-duplicates (multiple captures of the same bug shape) — collapse them and keep one canonical case per shape. Some cases will be obsolete (testing a behaviour the product no longer supports) — archive but do not delete, so a future regression can be detected. Re-tag cases by slice so the per-slice subset is recomputed. Promote a curated 50–80 case fast suite that covers every protected slice with enough density to detect a 5-point regression; the rest stays in the full suite. The goal is not to shrink — it is to ensure the suite still tests what the product actually does today. Common wrong answer to avoid: "Cut the oldest cases first." Age is not a quality signal; the bug from eighteen months ago may be the regression you most want to block.

Apply now (10 min)¶

Step 1 — model the exercise. Take the refund chatbot's captured case from earlier — refund-policy-age-2026-05-04-001. I would write the PR body like this:

Title: Refund window: explicit 30-day rule + retrieval filter for refund vs warranty

Captured case:    refund-policy-age-2026-05-04-001 (was: fail, now: pass)
Fast suite:       66% → 79% aggregate (+13)
Protected slices: refund-eligible flat, refund-ineligible +50, damage+warranty +16,
                  multi-turn flat, tone flat. No regression.
Full suite:       +9 aggregate, no protected-slice regression beyond noise.
Risk:             low. Change is additive in prompt; retrieval filter is opt-in by intent.
Reviewers:        eyeball prompt diff for tone and brand-voice.

The PR description is the eval delta, not a paragraph about why the change "feels right." The reviewer's job is to catch what the suite cannot — style, security, policy nuance — not to relitigate the numbers.

Step 2 — your turn. Pick one LLM feature in your own product. Write the smallest captured case for one real bug you've seen this month. Score the current system, propose a change, and run the change against the case mentally — what would the per-slice delta need to be for you to merge? Which slices would you protect? Which would you let regress? Write your slice-protection rule in one sentence.

Step 3 — reproduce from memory. Without scrolling, redraw the EDD inner loop diagram from section 5. Mark where the inspection, the rubric, the spot check, the kitchen log, and the shift change appear in the loop. Then connect the diagram to chapter 12: where in the loop does the dashboard alert from chapter 12 enter the EDD cycle? If you can do this cold, you carry the chapter.

What you should remember¶

This chapter explained why most teams iterate on LLM prompts forever without actually improving — they ship changes and then ask whether they helped, instead of asking before. EDD inverts the cadence. The eval case is captured before the fix. The fix is gated on a per-slice eval delta, not a human reading of the prompt diff. Every captured failure becomes a permanent test, so the same bug shape cannot quietly return six months later under a new prompt or a new model.

You learned the canonical inner loop — observe failure, capture as case, baseline, change, fast suite per-slice, full suite, merge — and saw it walked end to end on the refund chatbot: one bug, two diffs, a slice table in the PR, a merged change, a permanent case. You also learned the boundary — EDD pays for itself the moment the cost of one undetected bad answer exceeds the cost of running the suite for a month, which is almost always.

Carry this diagnostic forward: when somebody proposes a prompt change, ask one question — "what failing case did you capture, and what is the per-slice eval delta?" If the answer is "it feels better," the loop has broken and the team is back on vibes wearing eval costumes. Treat the captured-case list like source code: review it, version it, refactor it, and protect it from drift.

Remember:

A fix is not allowed to merge until a failing eval case is captured and the per-slice delta is non-negative on protected slices and positive on the captured case.
Per-slice delta beats aggregate every time. An aggregate-only gate is a slice-blind gate.
Eval set growth rate is the deepest health signal of the loop; bug-to-permanent-test latency in hours, not days.
More evals = slower iteration is wrong per-progress. Evals make iteration compound; without them, iteration is a random walk with no memory.
EDD's overhead is small (engineer-days to stand up, minutes per change); the alternative — ship-then-fix at scale — costs customer trust, which is not refundable.

Bridge. EDD closes the development loop and makes prompt iteration compound. But every loop has a perimeter — failures the suite cannot anticipate, behaviours the rubric cannot grade, distribution shifts the captured cases were not drawn from, ambiguities no judge will resolve. The next chapter ends the module honestly by naming what evals still miss, where the discipline turns into theatre, and how to keep humility next to rigour so the team does not start trusting the green dashboard more than the unhappy customer.

→ 14-honest-admission.md