03. The eval backstop¶

The audit produces the map. Before changing anything against that map, you need a safety net that catches behaviour regressions you cannot otherwise see. The eval backstop is the minimum coverage that lets you ship changes responsibly. Without it, modernisation is gambling.

A platform lead at a Chennai logistics company inherits a route-explanation feature. The audit identifies the top three failure shapes from customer complaints. The lead picks fifty real production examples, has a domain expert label the expected behaviour for each, and commits the set to a evals/route_explanation/ directory with a small runner. The runner takes a model+prompt pair, runs each input, and uses an LLM-as-judge with a tight rubric to score each output against the expected behaviour. The first run scores 0.72. The lead now has the number that any change will be measured against. Two weeks later, when she carves the prompt out of code into a registry (chapter 05), the eval re-runs automatically and scores 0.73. The migration was safe; the eval said so. Without the eval, she would have shipped the change on faith and learned about regressions through customer complaints two weeks later.

The eval backstop is the single highest-leverage thing you build during modernisation. It moves the system from frozen to eval-backed — the state in which change becomes possible.

What the backstop is for¶

The backstop is not the eventual full eval suite. It is the minimum coverage that catches regressions on the known failure modes and the core happy path. It exists to let you ship the next change with confidence.

Concretely:

A small set (30–100 examples) of representative inputs
For each input, an expected behaviour — sometimes an exact expected output, sometimes a rubric describing what a good output looks like
A scoring mechanism that produces a number you can track across runs
A runner that executes the system against the set and reports scores

The backstop does not need to be exhaustive. It needs to be real — drawn from actual production behaviour, labelled by someone who knows what good looks like, and run on every change.

What goes into the set¶

Three sources, in priority order.

1. The top failure modes from the audit¶

Chapter 02 identified 5–10 failure shapes from customer complaints. Each gets 3–10 examples in the eval set. These are the regressions you most want to detect — if you cannot prevent the same complaint from happening again, you have not made progress.

Examples for a hypothetical recommendation system:

Recommends premium products to free-tier users (5 cases)
Returns no recommendations for users with sparse history (5 cases)
Recommends unavailable items (3 cases)
Slow response on bulk requests (3 cases) — latency case, not quality

2. The core happy path¶

The cases the system handles well. 10–30 examples representing the typical input distribution. These ensure that fixing the failure modes does not break the happy path.

3. Edge cases the team is aware of¶

The cases the team has muttered about but not formalised. Maybe a customer who once asked something the system handled clumsily. Maybe a regulatory edge case nobody is sure about. 5–15 examples documenting these.

The total: 50–100 examples. This is enough to catch obvious regressions and small enough to label by hand in a week.

How to label expected behaviour¶

Three labelling styles, picked per case.

Exact output match¶

For deterministic outputs (a structured response, an enum classification), the label is the exact expected output. Comparison is byte-equal or shape-equal.

- input: { user_id: "u_42", query: "refund my last order" }
  expected:
    intent: "refund_request"
    confidence: ">0.85"
    action: "open_refund_ticket"

Useful for classifications, intent recognition, structured extraction. Cleanest possible eval.

Rubric-graded behaviour¶

For open-ended outputs (a summary, an explanation), the label is a rubric — a set of criteria that a good output must meet. Comparison is by LLM-as-judge or human grading.

- input:
    customer_query: "Why was my package delayed?"
    delivery_status: "Held at customs for documentation"
  rubric:
    - "Acknowledges the customer's frustration without being saccharine"
    - "Mentions customs documentation as the cause"
    - "Says nothing about other possible causes (weather, theft, etc.)"
    - "Suggests one concrete next action (provide documentation, contact agent)"
    - "Is under 150 words"
  must_not:
    - "Promises a specific delivery date"
    - "Blames the customer"

The rubric is the most flexible labelling style and the most expensive. It is also the one that handles open-ended outputs realistically.

Comparison to expected reference¶

For cases where you have a known-good output (e.g., a curated example or a previous-system output the team agrees is good), the label is the reference. Comparison is by similarity (semantic or rubric-based).

- input: { ... }
  reference_output: "Your package is delayed because customs is reviewing the required documentation. Please send a copy of the proof-of-purchase to support@... and we will expedite the release."
  similarity_threshold: 0.80   # semantic similarity to reference

Useful when "a good answer" is hard to describe but you have an example of one.

Scoring¶

A score per example, aggregated to a score per run. Three aggregation styles:

Aggregation	What it tells you
Pass/fail percentage	"82% of examples passed" — simple, defensible
Mean rubric score (0–1)	"Average score is 0.74" — sensitive to small changes
Per-failure-mode pass rate	"Of the 25 failure-mode cases, 18 pass" — sharpest for incident-driven evals

Most platforms start with pass/fail and graduate to mean rubric score as the eval matures. For modernisation, the per-failure-mode pass rate is often the most actionable — it tells you which complaint patterns the current system addresses.

What "passes the eval" means as a gate¶

The pre-merge or pre-deploy rule:

No regression on existing failure-mode coverage. The number of failure-mode cases passing must not decrease.
No regression on happy-path coverage. Same rule, applied to happy path.
Threshold for promoting a change. Optional — a change must improve some score by a documented amount before it is worth shipping.

Two patterns work:

Block on regression. A merge that drops any score below the previous run is refused by CI. Useful when you can run the eval in CI quickly.
Track and review. Score is logged on every change; a regression triggers a review but does not block. Useful when the eval is too slow for CI.

Most modernisation programmes start with track-and-review and graduate to block-on-regression as the eval matures.

How to build the first version in a week¶

The first eval set is fast to build if you keep scope tight. A reasonable week:

Monday. From the audit, list the top 5 failure modes. For each, pull 3–5 real examples from production logs. Total: 15–25 examples.
Tuesday. Add 15–20 happy-path examples from production sample.
Wednesday. Get a domain expert (PM, senior engineer, customer-success rep) to label expected behaviour with you. Rubrics for open-ended cases; exact matches where possible.
Thursday. Build the runner. It can be a 100-line Python script: load examples, call the model, score against the expected behaviour, print a report.
Friday. Run the eval. Capture the baseline score. Commit the set, the labels, and the runner to source control.

By end of week, you have an eval. It is small; it is real; it is enough to catch the obvious regressions on the changes you are about to make.

When to expand the set¶

Expand when:

A new failure mode appears in production complaints. Add examples; re-label.
A change is about to touch a part of the system not covered. Add examples for that part.
The team starts skipping the eval because it does not catch the cases they care about. The set has aged; refresh it.

Keep growing the set carefully. A set that grows from 100 to 1000 cases without curation becomes hard to reason about. Module 04_ai_product_evals covers the golden-set lifecycle in detail.

Common mistakes¶

Building the eval after the change. The point of the backstop is to catch regressions on the change. After-the-fact eval is "verification we got lucky." Build it first.

Using only happy-path examples. Happy-path coverage proves you can do the easy thing. Failure-mode coverage proves you fixed what was broken. Both are required.

Labelling expected outputs by yourself, fast. The labels are the spec; if you wrote them quickly under deadline, they encode your guesses about what is right, not the team's agreement. Slow down. Get a second labeller.

Treating the eval as a "test pass" rather than a measurement. Tests are pass/fail. Evals are distributions of scores. A 0.74 → 0.72 score drop is meaningful even if both technically "pass."

Running it manually. If you have to remember to run the eval, you will eventually forget. Wire it into CI or a daily scheduled run.

What the backstop does not solve¶

Long-tail rare failures. A 100-case eval will miss failures that occur once per 10,000 calls. For those, you need production telemetry and feedback loops (04_ai_product_evals covers this).
Subjective quality at scale. Rubric-graded LLM-as-judge scores have variance. Tight rubrics and judge calibration reduce it; they do not eliminate it.
Drift you did not anticipate. The eval covers the failure modes you knew about. It catches regression on those. It does not catch a new failure shape introduced by the change.

These limits do not invalidate the backstop. They define what the next steps (production telemetry, judge calibration, expanded set) need to do.

Interview Q&A¶

Q1. Why is the eval backstop the highest-leverage thing you build during modernisation? Because it is the precondition for every other change. Without it, you cannot ship changes responsibly — you have no signal whether they help or hurt. With it, the system moves from frozen to eval-backed; every subsequent improvement is verifiable. The next chapter (sequencing fixes) and onward all assume the backstop exists. Building it first is the unlock. Wrong-answer notes: "because tests are important" misses the specific role of the eval as the safety net that distinguishes frozen from eval-backed.

Q2. The team objects: "we can't evaluate AI outputs, they're non-deterministic." How do you respond? Non-determinism is exactly why evals exist. A deterministic system could be verified by a unit test; a non-deterministic one needs a distribution of inputs and an aggregated judgement. Rubric-graded evals handle non-determinism by scoring against criteria, not exact outputs. The runner can use multiple judges for robustness, or set thresholds that allow for variation. The objection collapses on inspection — non-determinism is the problem evals are designed for. Wrong-answer notes: agreeing with the objection produces the frozen state.

Q3. Walk through how you would build the first eval set in a week. Monday: list the top 5 failure modes from the audit; pull 3–5 real examples per mode. Tuesday: add 15–20 happy-path examples. Wednesday: domain expert labels expected behaviour (rubrics for open-ended, exact-match for structured). Thursday: build the runner — load examples, call the model, score, report. Friday: run, capture baseline, commit. Total: 40–60 cases, takes one focused week, gives you the backstop. Wrong-answer notes: designing an exhaustive eval framework before producing the first one is the procrastination path; tight scope is the discipline.

Q4. The eval scores hold for a prompt change you ship; two weeks later a customer reports a regression on a case not in the eval set. What does this tell you, and what do you do? That the eval coverage was incomplete for the affected case — a known limit, not a failure of the backstop. The response is to add the affected case to the eval set with appropriate labelling, verify the regression with the expanded eval, and roll back or patch the prompt change. The eval is now more comprehensive for the next change. Long term, the production telemetry layer (out of scope here, in 04_ai_product_evals) should be feeding new cases into the set automatically. Wrong-answer notes: "the eval is broken" misses that the eval was always known to be partial; the response is to expand it, not abandon it.

What to do differently after reading this¶

Build the eval before any change. The change without the eval is gambling.
Source examples from real production logs, not from imagination.
Label with a domain expert, not alone.
Wire the eval into CI or a daily scheduled run. Manual runs decay.
Expand the set on every customer complaint and every change that touches new ground.

Bridge. With the eval backstop in place, you can change things. The next chapter is the question that follows: of all the things you could change, which do you take on first? Stop-the-bleeding versus do-it-right, sequenced to maximise leverage and minimise risk. → 04-stop-bleeding-vs-do-right.md