07. Evaluation Design — End-to-End Evals, Not Just Component Checks¶

~11 min read. An eval suite is your system's immune system — without it, every change is a blind gamble.

Built on the ELI5 in 00-eli5.md. The inspection — the evaluation suite checking the whole house — must be designed before you think you are done. A system that passes component tests but has no end-to-end evals is uninspected.

Why component evals are not enough¶

See. You tested the retriever. Precision@3 = 0.82. Good. You tested the prompt. 90% of sample outputs match expected format. Good. You wire them together and test end-to-end. 61% resolution rate. Bad.

The gap is real. Component evals measure parts. System evals measure outcomes. The user does not care if the retriever is good. The user cares if their question gets answered correctly.

Component eval                    System eval
──────────────────                ──────────────────────────────────
Retriever: precision@3?           Did the user get the right answer?
Prompt: format valid?             Was the answer grounded in context?
Latency: < 800 ms?                Did the user solve their problem?

Both matter. But only system evals catch integration failures. The inspection must include both levels.

The four levels of evaluation¶

Think of evaluation as four floors of the house.

Floor 4: Business outcome          (hardest, slowest, most honest)
         User task success rate, NPS, ticket resolution rate

Floor 3: End-to-end system quality (medium difficulty, automated)
         Answer correctness, grounding, format compliance

Floor 2: Component quality         (easy, fast, automated)
         Retrieval recall, prompt format adherence, latency

Floor 1: Unit tests                (instant, code-level)
         Input validation, schema checks, null handling

Build all four floors. Start from Floor 1 and add floors incrementally. Do not skip to Floor 4 without having Floors 1–3 in place.

Designing the Floor 3 eval suite¶

Floor 3 is where most teams underinvest. Here is how to build it.

Step 1: Collect test cases. Aim for 100–200 cases minimum. Sources: real user queries (anonymised), synthetic queries from the domain, edge cases (empty context, off-topic, ambiguous), and adversarial queries.

Step 2: Write expected properties, not expected strings. Wrong: expected output = "The return period is 30 days." Right: expected properties = {contains_days: True, days_value: 30, cites_source: True}

Why? LLM outputs are not deterministic. Exact string matching fails on rephrasings. Property-based assertions are robust to wording variation.

Step 3: Choose a judge. For automated evaluation at scale, use an LLM-as-judge approach.

Judge prompt:
"Given the question, the context, and the model answer,
 rate the answer on three dimensions (0 or 1 each):
 - factually_correct: is the answer factually supported by the context?
 - grounded: does the answer avoid claims not in the context?
 - format_valid: is the answer in the requested JSON format?
 Return: {factually_correct: int, grounded: int, format_valid: int}"

Step 4: Measure inter-rater reliability. Manually label 20 cases. Compare your labels to the LLM judge labels. Agreement < 70%: the judge is unreliable. Fix the judge prompt first. Agreement ≥ 70%: proceed with automated eval at scale.

Worked example: eval metric calculation¶

You run 150 test cases through the full pipeline and the LLM judge.

factually_correct:  132 / 150 = 0.88   ✓ above target (0.85)
grounded:           140 / 150 = 0.93   ✓ above target (0.90)
format_valid:        144 / 150 = 0.96   ✓ above target (0.95)

Latency p95:        724 ms              ✓ under SLA (800 ms)
Cost per call:      $0.00049            ✓ under budget ($0.002)

This is a passing eval run. Lock these as your baseline before any code change.

Now you change the prompt (file 06). Re-run evals.

factually_correct:  138 / 150 = 0.92  ↑ improved
grounded:           135 / 150 = 0.90  ↓ regressed slightly (still above target)
format_valid:        148 / 150 = 0.99  ↑ improved
Latency p95:        798 ms             ↑ latency increased (still under 800 ms)

Ship or hold? The regression in grounding is real but small and still above target. The improvement in factual correctness is larger. Decision: ship, but monitor grounding in production.

Simple, no? Numbers make deployment decisions. Opinions do not.

Common eval design mistakes¶

Mistake 1: Evaluating on the tuning set. If you tune your prompt against 20 queries and also evaluate on those 20 queries, you are testing memorisation, not generalisation. Always keep a held-out set that you never look at during tuning.

Mistake 2: Exact-match evaluation for open-ended output. "The return period is 30 days" ≠ "Returns are accepted within 30 days." Both are correct. Exact match penalises rephrasings. Use property-based assertions.

Mistake 3: No human calibration of the LLM judge. An LLM judge can be confidently wrong. Always calibrate with 20 manual labels before trusting it at scale.

Mistake 4: Evaluating happy-path only. Build edge case tests: empty retrieval results, off-topic queries, conflicting chunks. Real users send all of these.

Where this lives in the wild¶

Hamel Husain's LLM eval framework — property-based assertion format, human calibration against LLM judge, used at scale.
OpenAI Evals framework — end-to-end eval suite for ChatGPT; hundreds of domain-specific test sets.
Anthropic Constitutional AI — automated judge compares outputs against a rubric before human review.
Yelp review summarisation — evals check factual consistency (no claims beyond source reviews) not just fluency.
LinkedIn AI messaging — end-to-end eval measures professional tone score, not just grammar; property-based.

Pause and recall¶

What is the difference between a component eval and a system eval?
Name the four floors of evaluation and their difficulty level.
Why is exact-string matching a bad approach for LLM output evaluation?
What is inter-rater reliability and why must you measure it?

Interview Q&A¶

Q: "How do you evaluate an LLM application before shipping?"

A: I run a four-level eval: unit tests for schemas, component evals for retrieval and prompt format, end-to-end system evals with property-based assertions, and a business outcome baseline if available. For the system eval I use an LLM-as-judge calibrated against 20 human-labelled cases.

Common wrong answer to avoid: "I test on a few examples manually and check if they look right." Manual spot-checking is not an eval suite. It cannot detect regressions.

Q: "What is 'LLM as judge' and what are its limitations?"

A: LLM-as-judge uses a second language model to score the first model's outputs on quality dimensions. Its main limitation is that the judge can be confidently wrong and can share biases with the model under evaluation (especially if both use the same base model). Always calibrate with human labels before trusting at scale.

Common wrong answer to avoid: "LLM-as-judge is objective because it is automated." Automation does not equal objectivity. The judge prompt contains human judgements.

Q: "How do you prevent your eval suite from becoming a test you overfit to?"

A: I keep a strict separation between the tuning set and the eval set. I tune on one set, evaluate on a different held-out set. I never look at the held-out set during development. I periodically refresh the held-out set with new real-world queries to prevent concept drift in the test cases.

Common wrong answer to avoid: "I use all available data for both tuning and eval." This guarantees overfitting to your own test.

Q: "A stakeholder asks: is our AI product good enough to ship? How do you answer?"

A: I show the eval dashboard: factual correctness, grounding, format compliance, latency p95, and cost per call — all compared to the targets we agreed on in the blueprint. If all metrics meet or exceed targets, we are ready. If any metric fails, I identify which component is responsible and fix it.

Common wrong answer to avoid: "It looks good in the demo." A demo is not evidence. Numbers from the eval suite are evidence.

Apply now (5 min)¶

Write five property-based assertions for your capstone system's expected output. Do not use exact-string matching. Design your eval test case set: how many cases? Where will you source them? Write one LLM judge prompt for your specific output format.

Sketch from memory: Draw the four-floor evaluation stack. Label each floor and write one metric that belongs to each floor.

Bridge. The inspection is now designed and running. But inspection at release time is not enough — you need continuous monitoring. → 08-monitoring-observability.md