08. Prompt eval suites — the taste test that decides what ships¶

~15 min read. A prompt change without an eval is a prayer with a SHA. The eval suite is the daily taste test — the place where bad recipes die before the customer sees them.

Builds on 07-prompt-observability.md. Observability tells you what happened. The eval suite tells you what would happen — before the change reaches a real customer.

1) Hook — the eval that did not exist¶

A team at a B2B AI company ships a small prompt update on a Wednesday. The change loosens a constraint about response length. They had no eval suite. They had vibes. They had one engineer who tested with three sample queries before deploying. The change shipped.

Three days later, a finance customer complains. The bot is now writing 600-word responses to questions that used to get 60-word answers. Their executives are getting "essay-length spam" instead of the concise summaries they had been buying for. The customer threatens to cancel.

The team rolls back. The senior engineer writes the incident postmortem. Root cause is identified — "no eval suite for response length distribution". Action item is created — "build an eval suite for the support bot". The action item sits in the backlog for two months because everyone is busy.

Two months later, the same loosening regresses through a different path — a PM edits a phrase to "be more helpful," eval suite still does not exist, and the bot starts writing 400-word essays again. Same incident, same cause, no defense.

This chapter is the eval suite that should have existed the first time. It is the cheapest insurance you can buy and the one most teams skip.

2) The metaphor — the daily taste test¶

Every morning at the bakery, the head baker tastes yesterday's bread. Same loaves, same proportions, same scoring. She tastes the salt level, the crumb texture, the crust color. If something is off, she catches it before the morning rush. The taste test is not glamorous. It is not creative. It is daily, mechanical, and the reason customers keep coming.

The taste test in the recipe book model is the eval suite. A fixed set of representative inputs. A fixed set of grading dimensions. A reproducible run. A pass/fail decision against the prior known-good version.

The eval is what makes a prompt change reviewable in 5 minutes rather than guessable forever.

3) The anatomy — what an eval suite contains¶

┌──────────────────────────────────────────────────────────┐
│ EVAL SUITE ANATOMY                                       │
├──────────────────────────────────────────────────────────┤
│ 1. EVAL SET     50-500 inputs, stratified                │
│ 2. RUBRIC       what counts as good — per dimension      │
│ 3. JUDGE        how each output is scored                │
│ 4. AGGREGATOR   how per-input scores roll up             │
│ 5. GATE         pass/fail threshold vs prior SHA         │
│ 6. RUNNER       CI job that executes the whole thing     │
└──────────────────────────────────────────────────────────┘

The eval set is the inputs. The rubric defines what good looks like. The judge produces the score. The aggregator combines scores. The gate decides ship/no-ship. The runner is the CI job that ties it all together.

Most teams know about the eval set. Fewer think carefully about the rubric. Almost nobody thinks about the judge until it bites them.

4) The eval set — stratified, fresh, and 50-500 examples¶

EVAL SET STRATIFICATION
───────────────────────
Frequent queries        : 40% of set    → what users see most
Edge cases              : 20% of set    → known difficult cases
Regression cases        : 25% of set    → bugs we already fixed
Domain coverage         : 15% of set    → rare but important domains

Why this stratification:

Frequent queries are the bulk of business. If the eval misses the common case, the eval is useless.
Edge cases are where bad prompts fail. Hard JSON parsing, ambiguous user intent, long contexts.
Regression cases are bug reports turned into eval examples. Every incident in module 11 produces a new regression eval. The suite grows.
Domain coverage ensures rare-but-important customers (finance, healthcare, legal) are represented.

Size — 50 examples is the floor for any production suite. 200-500 is typical for a system with multiple prompt versions. Going above 1000 stops being useful — the marginal eval coverage drops and the CI time becomes painful.

Freshness is essential. Once a month, sample new examples from production traces. Label them. Add the surprising ones to the eval set. An eval suite that does not grow rots — production drifts past it.

5) The rubric — what counts as good¶

A rubric defines the dimensions you grade on. Not all prompts grade on the same dimensions.

For a customer-support greeter, a useful rubric might be:

RUBRIC (customer-support greeter)
─────────────────────────────────
1. Greeting present and uses name correctly      [yes/no]
2. Tone matches brand voice (1 = formal, 5 = warm)
3. Response addresses the user's question        [yes/partial/no]
4. Length appropriate for question complexity    [too short/right/too long]
5. No banned phrases                             [yes/no]

For an extraction prompt, the rubric is different:

RUBRIC (order-id extractor)
───────────────────────────
1. Valid JSON                                    [yes/no]
2. order_id field present                        [yes/no]
3. order_id matches pattern ^[0-9]{6,10}$       [yes/no]
4. amount_cents within plausible range           [yes/no]
5. No hallucinated fields                        [yes/no]

The rubric is task-specific. Generic "is this output good" rubrics produce generic "yeah I guess" scores. Specific rubrics produce diagnostic feedback. When the eval fails, you know which dimension regressed.

6) The judge — and the judge's bias¶

The judge produces the score. Three judge types:

JUDGE TYPES
───────────
1. PROGRAMMATIC   regex, parser, schema validator (cheap, deterministic)
2. LLM-as-JUDGE   another model rates the output (medium cost, has bias)
3. HUMAN          a person rates the output (high cost, gold standard)

Programmatic judges are the cheapest and most reliable, but they only work for objective dimensions — schema validity, banned-word presence, length bounds. Use them whenever possible.

LLM-as-judge handles subjective dimensions — tone, helpfulness, completeness. But LLM judges have biases. They prefer outputs that look like their own outputs. They prefer longer outputs (often). They prefer outputs with hedging language. The bias is real and measurable.

Three mitigations for judge bias:

Ensemble judges. Use 3 different judge models (Claude, GPT, Gemini) and aggregate.
Pairwise comparison. Instead of "rate this 1-5", ask "which is better, A or B" and randomize order. Pairwise is more robust to absolute-score drift.
Calibration set. Hand-label 50-100 examples once. Re-run the LLM judge against the calibration set every quarter. If judge agreement with human labels drops below 80%, the judge has drifted and needs retuning or replacement.

Human judges are the gold standard but expensive. Reserve them for the calibration set and for high-stakes domains (medical, legal, financial).

7) Worked example — a regression suite in CI¶

A team runs a customer-support prompt through this suite on every PR:

EVAL RUN: PR #482, prompt support.agent.greeter
SHA prior:  a8c3f9...
SHA new:    b1d7e4...
Eval set:   220 examples (stratified)
Runtime:    11 minutes (parallel inference, judge cached)

DIMENSION                       PRIOR    NEW    DELTA   PASS?
greeting_correct                95.0%   96.3%   +1.3    ✓
tone_score (1-5)                 4.21    4.18   -0.03   ✓ (within noise)
addresses_question (yes-ratio)  91.4%   92.1%   +0.7    ✓
length_appropriate (yes-ratio)  88.6%   78.2%  -10.4    ✗ (regression)
no_banned_phrases               100%    100%    0       ✓

OVERALL                         PASS (4/5) → BLOCKED on length_appropriate

The eval gate blocks the PR. The author looks at the length dimension. The prompt change loosened a length constraint. The author tightens the phrasing and re-runs. New SHA c4f7a2.... Eval passes 5/5. PR merges.

Without the eval, the length regression would have shipped. The customer complaint would have come four days later. With the eval, the regression is caught in 11 minutes.

Mid-content recall¶

Why does an eval set need stratification rather than just sampling random production traffic?
What three mitigations reduce LLM-as-judge bias?
When is a programmatic judge preferred over an LLM judge?

8) Regression-only eval vs full eval¶

Not every change needs a full eval. A small wording fix that is intended to have no behavioral impact only needs a regression eval — pass = no degradation against the prior SHA. A larger change that adds new few-shot examples or rewrites instructions needs a full eval — measuring both regression and target metric improvement.

CHANGE TYPE                  EVAL TYPE                COMPUTE
───────────────────────────  ───────────────────────  ──────────
small wording fix            regression-only          fast (~5 min)
constraint tightening        regression + drift       medium (~10 min)
new few-shot examples        full eval                slow (~20 min)
section rewrite              full eval + manual spot  slowest (~30 min + human)
new tool description         full eval + agent trace  slow (~25 min)
multi-tenant base change     full eval per tenant     slowest (×N tenants)

The runtime is a real constraint. CI that takes 30 minutes per PR slows iteration. Most teams gate "small fix" PRs with a fast regression eval and reserve the full eval for tagged "behavioral change" PRs.

The author declares the change type. The reviewer verifies the declaration. If a "small fix" turns out to regress on the full eval, the author gets feedback and the team's discipline tightens.

9) Coverage gaps and active learning¶

The eval suite catches what it knows about. It does not catch what it does not know about. Coverage gaps are the most common cause of "the eval passed but the customer complained."

How coverage gaps form:

Production introduces a new query pattern (e.g., voice-input users, who type more loosely).
A new feature ships and downstream prompts have new contexts they did not have in the eval set.
A new tenant onboards with domain-specific vocabulary the eval set never covered.

How to close gaps — active learning:

ACTIVE LEARNING LOOP
────────────────────
1. Sample production traces weekly (100-500 examples)
2. Filter for "interesting" — high latency, low confidence, user feedback
3. Hand-label a subset (20-50)
4. Add the surprising ones to the eval set
5. Re-run all prompts against the updated eval; mark any new regressions

This loop costs ~2-4 hours per week. It is the single highest-ROI activity in keeping an eval suite fresh.

The "interesting" filter is the trick. Random sampling produces mostly typical examples that the suite already covers. Targeting outliers — by latency, by judge disagreement, by user feedback — surfaces the cases the suite is blind to.

10) Failure modes¶

Signal	Likely cause	Fix
Eval passes but production regresses	Coverage gap — production has patterns the eval misses	Active learning loop; sample from production
Eval flaky (passes sometimes, fails sometimes)	Non-deterministic judge or non-deterministic prompt (temperature > 0 unintentionally)	Pin temperature to 0 for eval runs; cache judge calls
Eval too slow, CI bottleneck	Eval set too large or inference not parallel	Parallelize inference; reduce non-load-bearing examples
Eval set ages out	No active learning loop	Schedule weekly sampling and labeling
Judge disagrees with humans	Judge model drift or wrong rubric framing	Recalibrate against the calibration set quarterly
Pairwise eval order-dependent	Position bias in judge	Randomize order; run both A-vs-B and B-vs-A
Eval fails for "stylistic" reasons but humans say output is fine	Judge over-weighting a dimension humans do not care about	Rebalance rubric weights; consult product PM
Eval suite has no regression cases from incidents	Postmortem process does not feed eval	Add regression case to eval as a required action item in every postmortem

The last row is the most common organizational failure. Postmortems produce action items. Eval-as-regression-case is the action item that prevents recurrence.

11) Continuous eval — beyond pre-deploy¶

Most teams treat eval as a pre-deploy gate. Mature teams also run eval on production traces continuously.

CONTINUOUS EVAL
───────────────
Daily:    sample 500 production traces, run rubric, alert on score drop
Weekly:   compare this week's distribution to last week's
Monthly:  recompute calibration set agreement with judge
Quarterly: full re-evaluation against the entire eval set + audit gaps

Continuous eval catches drift that the pre-deploy gate misses — model upgrades that shift behavior, user-input distribution shifts, prompts that "work in eval but degrade slowly in production."

The continuous eval is cheaper than it sounds because the judge calls cache (the same input + output produces the same score). Daily eval on 500 traces costs <$10/day at typical pricing.

Where this lives in the wild¶

Promptfoo — open-source eval framework, declarative test config, judge-agnostic.
DeepEval — Python-first eval framework, RAG and agent metrics built in.
OpenAI Evals — original eval framework, used inside OpenAI for model gating.
Inspect AI (UK AISI) — research-grade eval framework for capability evals.
Patronus AI — eval platform with adversarial test generation.
Galileo — RAG and LLM eval with continuous monitoring.
Phoenix (Arize) — open-source observability + eval, especially RAG.
TruLens — open-source eval and observability.
RAGAS — RAG-specific evals (faithfulness, answer relevance, context relevance).
Braintrust — eval-first platform, pairwise judging, CI integration.
Langfuse evals — eval plus observability in one tool.
LangSmith evals — eval plus tracing within LangChain ecosystem.
Vellum — UI-driven evals for non-engineers.
PromptLayer evals — registry plus eval suite.
Pezzo evals — open-source registry plus eval.
Helicone evaluations — observability-first with eval bolted on.
GitHub Actions — common CI runner for eval suites.
GitLab CI — same.
CircleCI — same, often used at scale.
Argilla — labeling tool for human eval annotation.
Label Studio — open-source labeling for eval set construction.
Prodigy — Explosion AI's labeling tool.
Scale AI / Surge AI / Labelbox — managed human labeling for high-stakes evals.
OpenAI's Model Spec — defines what counts as a good answer at a meta level.

Pause and recall¶

What six elements make up an eval suite?
Why is the rubric task-specific rather than generic?
What three mitigations reduce LLM-as-judge bias?
When is a regression-only eval sufficient vs a full eval?
How does the active learning loop keep an eval set fresh?
What is continuous eval and what does it catch that pre-deploy eval misses?
Why does every postmortem action item need to include a new regression case in the eval set?

Interview Q&A¶

Q1. Walk me through building an eval suite from scratch for a customer-support agent. A. Five steps. (1) Sample 200-500 production traces, stratified by intent and tenant. (2) Define a task-specific rubric — usually 4-6 dimensions including greeting, tone, correctness, length, banned-phrase check. (3) Pick judge type per dimension — programmatic where possible, LLM-as-judge for subjective ones, with an ensemble of 3 LLM judges. (4) Build a CI runner with pass/fail gates against the prior SHA. (5) Establish a weekly active learning loop to keep the set fresh. Trap: "Use a generic eval framework's defaults." Defaults are generic; your rubric must be specific to your product.

Q2. How do you handle LLM-as-judge bias? A. Three mitigations. (1) Ensemble of 3 judges across different model families (Anthropic, OpenAI, Google) — average or vote. (2) Pairwise comparison instead of absolute scoring, with randomized order. (3) A human-labeled calibration set of 50-100 examples that the judge is benchmarked against quarterly. If judge agreement with humans drops below 80%, recalibrate or replace the judge. Trap: "Use a stronger judge." A stronger judge has its own biases, and you have no insight into them.

Q3. Your eval passes but a customer complains. What do you do? A. (1) Reproduce the complaint with the customer's exact input. (2) Add the input to the eval set. (3) Run the suite — if it still passes, the rubric is missing a dimension; if it now fails, the eval just needed coverage. (4) In either case, the new example becomes a permanent regression case. (5) Audit the coverage gap — what other queries are missing that look like this one? Trap: "Add the complaint as a special case." Special cases proliferate; the right fix is rubric or coverage expansion.

Q4. Eval runs are taking 30 minutes in CI and slowing development. What do you fix? A. Three levers. (1) Parallelize inference — the eval is embarrassingly parallel across examples. (2) Cache judge calls — same (input, output) pair produces the same score deterministically. (3) Distinguish change types — small wording fix gets a fast regression eval, behavioral change gets the full eval. (4) If still too slow, prune examples whose marginal coverage is low. Trap: "Reduce the eval set size globally." That weakens the gate. Cut judiciously.

Q5. How do you keep an eval set from rotting? A. Active learning loop. Weekly, sample 100-500 production traces, filter by interestingness (latency outliers, low judge confidence, user feedback), hand-label a subset, add surprising examples to the set. Quarterly, audit for coverage gaps by tenant, intent, and rare-domain. Every postmortem produces at least one new regression case in the eval set. Trap: "We will update the eval set next quarter when we have time." Eval sets rot in weeks, not quarters.

Q6. When is a programmatic judge preferred over an LLM judge? A. Whenever the dimension is objective. JSON validity, schema match, banned-word presence, length bounds, regex match, downstream parser success. Programmatic judges are deterministic, free to run, and have no bias. Save LLM judges for subjective dimensions — tone, helpfulness, completeness — where no programmatic check is possible. Trap: "LLM judges can do everything." They can, but they cost more and add bias. Use them where they earn their cost.

Q7. What is continuous eval and what does it catch? A. Daily sampling of production traces, running the eval rubric, alerting on score drops. Catches drift that the pre-deploy eval misses — silent model upgrades, user-input distribution shifts, prompts that pass eval but degrade slowly in production. Cheap to run because judge calls cache. Trap: "We have pre-deploy eval, so continuous is overkill." Pre-deploy eval freezes a snapshot. Production drifts.

Apply now (5 min)¶

Step 1 — sample 10 production traces. Pick a prompt in production. Pull 10 recent traces. Read them.

Step 2 — draft a rubric. From those 10 traces, draft 4-6 dimensions you would grade on. Be specific — "tone is warm" is vague, "uses first name in greeting" is testable.

Step 3 — judge type per dimension. For each dimension, decide whether a programmatic check or an LLM judge is right. If LLM, write the judge prompt.

The 30-minute version of an eval suite catches 80% of regressions. The full version is incremental on top.

Bridge. A single eval suite gates the prompt you ship to all customers. But customers are not all the same. The next chapter is how the recipe book handles per-customer variants without forking into N copies. → 09-multi-tenant-prompts.md