17. Regression eval as a lock — turning a fixed bug into a permanent guardrail¶

~14 min read. A confession without a lock is a bug that comes back. Eval is not a quality theatre — it is the mechanism that prevents your fix from un-fixing itself in the next release.

Built on the ELI5 in 00-eli5.md. The lock — a regression eval that fails the build if the bug returns — is what turns one detective win into permanent safety. Every confession must end with a lock; otherwise the same crime ships again next Tuesday. Joining offline eval scores with live traces is how we make those locks real.

The usual split is unhealthy¶

ML teams run offline evals, ops teams watch production dashboards, and support teams read complaints — three streams that usually live apart. Then a rollout happens, evals say the model improved, production complaints rise, and everyone argues about whose number is "the real one." That split, not any single number, is the actual problem the lock has to solve.

offline world                         live world
┌─────────────────────┐              ┌────────────────────────┐
│ benchmark score     │              │ traces, complaints     │
│ prompt candidate    │              │ latency, cost, errors  │
└──────────┬──────────┘              └──────────┬─────────────┘
           │                                    │
           └──────────── need shared keys ──────┘

If the two worlds cannot join, learning stays partial. Every offline win that cannot be matched to a trace lives in a different universe from the live failure it was supposed to prevent.

What should connect evals to traces¶

At minimum, version everything. Model name. Prompt version. Retriever version. Guardrail version. Dataset version. Experiment bucket. These evidence tags must appear both in eval records and production traces.

Better still, capture shared task labels. Intent type. Feature name. Language. Customer tier. Expected output format. These make comparison fair. A support-summary task should not be mixed with SQL-agent tasks.

And when possible, link sampled production traces back into eval datasets. Complaint-linked traces are especially valuable here. The complaint slip can seed tomorrow's eval set. See the loop? Production pain improves offline testing.

Worked example: rollout looked good offline, bad online¶

Suppose prompt refund-v4 beats refund-v3 on an internal benchmark. Accuracy rises from 84 percent to 89 percent. Leadership approves rollout. After deployment, complaint rate rises for enterprise accounts. What happened?

Open the case board. Production traces show these tags: prompt_version=refund-v4 plan_tier=enterprise context_tokens much higher than before. tool_parse_failure_rate also higher.

Now inspect the eval data. The offline set had mostly short consumer refund cases. Almost no enterprise contracts. Almost no long tool outputs. So the offline win was real but narrow. It missed the dominant production pain path. That is a mature lesson.

Tracing made this visible. Without versioned witness notes, support complaints and eval charts would never meet. Now the fix is obvious. Expand the eval set using enterprise complaint-linked traces.

Production traces create better eval data¶

Now what is the positive story? Observability is not only for firefighting. It also helps training the test suite. Sample traces with high complaint rates. Sample traces with empty retrieval. Sample traces with high cost and low satisfaction. Sample traces from new features after rollout. These become eval candidates.

That means the case file is not just a debugging artifact. It is a dataset seed. The case board is not just an ops screen — it is a map of where evaluation coverage is thin.

Evals should also emit trace-like metadata¶

Offline runs need observability too. Each eval example should record prompt version, model, judge version, dataset slice, and output artifacts. Otherwise you cannot compare it cleanly with production. Think of each eval run as a synthetic case file. It also needs evidence tags.

Then comparisons become easy. Production says json_parse_fail rose on agent-v6. Eval runs filtered to agent-v6 show failure spikes on long tool outputs. Now investigation is fast. The shared metadata made it possible.

A practical join pattern¶

Use a shared registry or warehouse table. One row for eval records. One row for production trace summaries. Shared keys for model, prompt, feature, and date window. Optional links to raw trace URLs. That is enough for many teams.

production trace summary
┌──────────────────────────────────────────┐
│ trace_id │ feature │ prompt_v │ outcome  │
└──────────────────────────────────────────┘
                ▲
                │ join on shared keys
                ▼
eval record
┌──────────────────────────────────────────┐
│ eval_id  │ feature │ prompt_v │ score    │
└──────────────────────────────────────────┘

You do not need mystical AI here. You need disciplined metadata. That is the whole game — shared keys turn offline and online into one continuous loop.

Regression-eval gating across shipped LLM teams¶

OpenAI evals team — compares benchmark improvements with production trace outcomes after prompt or model changes; the role is closing the loop from offline win to live verification.
Notion AI — seeds eval sets from thumbs-down traces on workspace summary tasks; the role is making the complaint slip the source of next quarter's regression tests.
Intercom Fin — joins complaint-linked production traces with offline rubric scores for support-answer quality; the role is matching offline and online quality in one dashboard.
Cursor coding-agent research — compares offline patch-success evals against live traces from failed repair sessions; the role is exposing offline/online divergence per failure type.
Glean search quality — uses retrieval-failure traces to expand eval coverage for enterprise permissions-heavy queries; the role is making the case file drive the next eval set.
Promptfoo CI assertions — locked prompt regression tests in GitHub Actions; the role is making the lock a CI gate rather than a manual rerun.
Braintrust regression suites — historical performance preserved across releases; the role is keeping every fixed bug as a permanent test.
LangSmith regression datasets — eval sets tied to trace IDs from production; the role is making "every fixed bug becomes a permanent test" the default workflow.
Anthropic's eval-pinning patterns — locked baselines per release; the role is the canonical anti-regression discipline at frontier-model scale.
OpenAI Evals with snapshots — snapshotted scores per model version; the role is making the lock auditable across snapshots.
Vellum CI tests — prompt + eval as a single deployable unit; the role is treating prompts as code requiring test coverage.
Pydantic AI evals with locked baselines — typed agents with score regressions blocking merges; the role is making the lock type-checkable.
BAML test-locked outputs — typed schemas with frozen expected outputs; the role is shifting regression detection to compile time.
GitHub Actions LLM evals — eval jobs gating PRs; the role is making the regression check a code-review artifact, not a separate ritual.
Helicone-stored regression seeds — production traces tagged for eval seeding; the role is making the case file automatically dataset-eligible.
Comet Opik regression dashboards — eval-over-time panels with regression deltas; the role is exposing the lock as a visible trend, not a hidden CI check.
Phoenix Arize eval inspector — per-example trace + eval co-located; the role is making regression debugging single-screen.
Inkeep / Mendable EDD workflows — every customer complaint becomes a regression test; the role is encoding the complaint slip → lock pipeline as product policy.
LangFuse evaluations — open-source eval pipelines tied to traces; the role is enabling on-prem regression discipline without vendor lock-in.
Anthropic console eval workbench — replay regression sets against new prompts; the role is first-party regression UX without third-party tooling.
MLflow LLM Evaluate — versioned eval runs alongside ML experiments; the role is fitting LLM regression discipline into existing ML-experiment tracking.
Pytest plugins for LLM (e.g., pytest-eval) — eval-as-unit-test patterns; the role is making the lock indistinguishable from a unit test.

Recall — shared metadata, the lock, and the join¶

Why is it risky to keep offline evals and production observability separate?
Which metadata fields must usually exist in both eval and production systems?
In the worked example, why did the rollout look good offline but fail online?
How can production traces improve future eval datasets?

Interview Q&A¶

Q: Why should prompt and model version tags exist in both eval records and live traces? A: Shared version keys let teams compare offline claims against production behavior without hand-wavy guesswork. Common wrong answer to avoid: "Because version tags make dashboards look more organized."

Q: Why can a prompt win offline and still hurt users online? A: Offline datasets may underrepresent the real production slices, tool outputs, and context lengths that the prompt encounters after rollout. Common wrong answer to avoid: "Because offline evals are always fake and useless."

Q: Why are complaint-linked traces especially valuable for evaluation design? A: They capture real user pain paths, which often reveal exactly the slices missing from benchmark datasets. Common wrong answer to avoid: "Because complaints are automatically labeled ground truth."

Q: Why should offline eval runs themselves carry observability metadata? A: Without comparable metadata, teams cannot line up offline outcomes with production traces to diagnose regressions. Common wrong answer to avoid: "Observability only matters in live systems, not offline experiments."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the shared-metadata table I would build between an eval record and a production trace, on the refund-bot project:

Field	Eval record	Production trace	Why both?
`model_version`	yes	yes	A/B replay needs identical model
`prompt_version`	yes	yes	regression bisection by prompt
`retriever_version`	yes	yes	retrieval geometry must match
`intent_type`	yes	yes	per-intent regression slices
`customer_tier`	(synthetic flag)	yes	high-stakes slice gating
`dataset_id`	yes	(link only)	which lock the trace belongs to
`judge_version`	yes	(n/a)	judge calibration tracking

Without these shared keys, "offline win, online loss" is unsolvable: the team cannot tell which prompt version, intent, or customer tier saw the offline lift.

Step 2 — your turn. Write five metadata fields that your eval table and production trace table should share. Then write one complaint slice that is missing from your current eval coverage — the next eval test you would add this week.

Step 3 — reproduce from memory. Draw the join between eval records and production trace summaries. Label the shared evidence tags. Add one sentence on how the complaint slip can seed better eval data.

What you should remember¶

This chapter explained why offline evals and production observability cannot live in separate worlds. The lock — the regression test that prevents a fixed bug from returning — only works if the same metadata exists on both sides. Without shared keys, an offline win is just a number; with shared keys, it is a hypothesis that can be tested against live traces. The discipline is mechanical, not glamorous: every eval record and every production trace carries the same model version, prompt version, retriever version, intent, customer tier, and dataset id.

You also learned that the case file is a dataset seed. Production traces tagged with outcome=bad are the cheapest source of next quarter's eval cases. The case board doubles as a map of where eval coverage is thin — every complaint slice that has no eval test is a known blind spot.

Carry this diagnostic forward: when an offline win does not materialise online, look at the join. If you cannot join the eval record to the live trace on shared keys, the win was never measured in the same universe as the failure.

Remember:

Offline and online must share keys. Without them, learning stays partial.
Every fixed bug becomes a permanent lock — a regression test pinned to the dataset and to the trace it came from.
The case file is a dataset seed. Tagged failure traces feed next quarter's eval.
Evals need observability too. An eval run without metadata is as opaque as a production request without tracing.
The simplest join — warehouse table with shared keys — is enough for most teams. Mystical tooling is not required.

Bridge. The bug is locked. Good. But the incident itself needs to be written up — what broke, when, why, and what the lock now prevents. Agent postmortems are not the SRE template you know. Non-determinism, model rollouts, and emergent behavior change the structure entirely. → 18-postmortem-for-agents.md