Skip to content

11. Eval-set incident response

The set is built, refreshed, governed. The last operational concern is what happens when the set itself is wrong — false positives blocking ships, missed cases producing regressions in production, mis-labels causing wrong decisions. The set has its own incident class.


A platform engineer at a Chennai SaaS company is on call when the regression gate blocks a routine prompt change. The eval reports a drop of 5 cases in a specific stratum. The engineer investigates. The "regression" is in five cases that all share a specific phrasing the team has decided is better with the new prompt — the old labels reflected an outdated rubric. The block is a false positive caused by stale labels. The fix is to relabel those cases with the new rubric and re-run; the gate then passes. Total time: 90 minutes of investigation, 30 minutes of re-labelling, 10 minutes of re-running. Without the discipline to investigate (and the willingness to question the eval), the team would have either shipped past the gate (wrong) or rolled back the change (also wrong).

This chapter is the incident discipline. False positives, missed cases, mis-labels, judge drift — each has a response shape.


The four incident classes

Class What happens What it tells you
False positive The eval flags a regression that is not real Labels stale; rubric out of date; judge bias; case ambiguous
Missed case Production shows a failure mode not in the set Coverage gap; refresh trigger
Mis-label A case's label is wrong Either always was, or has aged with policy changes
Judge drift The judge's behaviour shifts producing systematic score changes Judge prompt drift; judge model change; calibration loss

Each is investigated; each has a fix; each has a longer-term improvement.


False positive — investigation and response

When the regression gate blocks a change but the team suspects the eval is wrong:

Investigate the affected cases. Manually inspect the cases that regressed. Does the new output look better, worse, or same as the old? If better, the label is the issue (the old label encoded behaviour the team no longer wants). If same, the judge is noisy. If worse, the gate is correct and the change should not ship.

Resolve. If labels are stale, update them via the labelling discipline (chapter 03); re-run. If the judge is noisy, run multi-judge (chapter 07) for the affected cases. If the gate is correct, the change is the problem.

Improve. Update the rubric to prevent the next false positive. Calibrate the judge if drift is involved. Record the false-positive class in the set's metadata so future investigations are faster.

The cost of investigating false positives is real. The cost of ignoring them — shipping past a gate that turns out to be right, or rolling back a good change — is higher.


Missed case — investigation and response

When production shows a failure mode the set does not cover:

Identify the case. From production logs or customer complaints, capture the input and the bad output. Verify it is reproducible.

Triage the gap. Is this a new failure mode entirely, or a variation on a covered stratum? Is it high or low frequency in production?

Add to the set. Source 3–10 cases representing the failure mode; label per chapter 03; tag with the stratum. The case enters the set immediately (out-of-cycle refresh trigger from chapter 05).

Verify the fix. If the system change that addresses the production failure has been shipped, re-run the new cases; verify they now pass. If the system has not been changed yet, the new cases serve as the regression-prevention for the upcoming fix.

The missed case is the most common incident on operating sets. The discipline turns each one into improved coverage.


Mis-label — investigation and response

When a label is discovered to be wrong:

Determine the cause. Was the label always wrong (labelling error) or has it aged (policy changed)? Both are valid; the response is similar but the lesson differs.

Update. Re-label per the labelling discipline. Document the change in the set's changelog. Re-baseline the affected stratum.

Investigate broader impact. Are there other cases with similar mis-labels? Sometimes one mis-label reveals a class of mis-labels (a rubric was applied incorrectly across a batch). A spot-check of related cases is part of the response.

Improve. If the cause was rubric ambiguity, sharpen the rubric. If aging, schedule a label review for adjacent cases. If labelling error, calibration session.


Judge drift — investigation and response

When the judge's behaviour shifts:

Detect. Periodic calibration of the judge against human labels reveals drift. The signal: per-case judge scores on a fixed eval-of-the-judge set move over time.

Identify the cause. Did the judge model change (provider updated it)? Did the judge prompt change? Did the rubric change in a way that the judge interpreted differently than humans?

Recalibrate. Adjust the judge prompt, switch judge models, or accept the new baseline if the new behaviour is more aligned with humans. Update the eval-of-the-judge set with the recalibration.

Communicate. Judge drift can produce systematic score changes across the platform. Inform stakeholders that the score baseline has shifted and re-baseline.

Judge drift is the subtlest incident class because the symptoms look like system regression (scores moving) when the cause is in the measurement.


The investigation workflow

Common across classes:

1. Detect — alarm fires, complaint arrives, calibration reveals
2. Triage — which incident class, how broad
3. Investigate — manual inspection of affected cases
4. Resolve — fix the cases, the labels, the rubric, the judge
5. Verify — re-run; expected outcome
6. Improve — what discipline prevents the next instance
7. Document — changelog entry, set metadata, lessons file

Each step is bounded; the discipline is the workflow, not heroic individual investigation.


When investigations reveal a deeper issue

Sometimes an eval incident points to something larger:

  • A false-positive pattern suggests the rubric is systematically misaligned with team intent — broader labelling review.
  • A missed-case pattern suggests the refresh cadence is too slow — adjust the cadence.
  • A mis-label pattern suggests a labelling process gap — improve calibration.
  • A judge-drift pattern suggests the platform should ensemble judges or do more frequent calibration.

The eval incident is the entry point; the systemic improvement is the larger work.


False positive vs real regression — the discipline

The hardest skill is distinguishing false positives from real regressions. The discipline:

  • Look at the cases. Always. Never accept "score dropped" without manual inspection of the affected cases.
  • Compare outputs side by side. Old prompt vs new prompt; for each affected case, which output is better. Often the answer is clear; sometimes it requires a domain expert.
  • Default to investigating. A drop is a signal worth 30 minutes of investigation. A handful of drops worth a full session.
  • Be willing to update the labels. If the new behaviour is better, the labels were wrong; update them; do not block the change.
  • Be willing to block. If the new behaviour is worse, the change is wrong; do not ship past the gate.

The team that does this regularly builds a calibrated sense of when the eval is right and when it is questioning. The team that does not, alternates between trusting the eval blindly (blocks good changes) and ignoring it (ships bad changes).


Common mistakes

Ignoring eval alarms. "The eval is noisy" becomes the excuse to ship regressions.

Trusting the eval blindly. Stale labels block good changes; the team rolls back unnecessarily.

Never updating labels. Labels age; the eval slowly diverges from current intent.

Treating each incident as one-off. Patterns are missed; the same incident class recurs.

No documentation of incidents. Lessons are lost; future investigations re-do the work.


Interview Q&A

Q1. The regression gate blocks a change for 5 case regressions in one stratum. Walk through your response. Open the cases. Compare the old prompt's output to the new prompt's output for each. If the new outputs look better (subjective rubric judgement to be made by a domain expert), the labels are stale; update them via the labelling discipline; re-run; gate passes. If the new outputs look worse, the change is correctly being blocked; investigate why and fix. If the outputs look the same, the judge is noisy on those cases; run multi-judge; if still flagged, the labels need sharpening. The discipline is to look at the cases, not just the score. Wrong-answer notes: "ship past the gate" or "roll back without investigation" both miss the case-level work.

Q2. Production shows a failure mode not in the set. What do you do? Capture 3–10 cases representing the mode from production samples. Label per chapter 03. Tag with the relevant stratum (or create a new stratum if the failure mode is genuinely new). Add to the set immediately as an out-of-cycle refresh trigger. Re-run; the cases serve as regression-prevention going forward. If the system change addressing the failure is in flight, the new cases verify the fix. The set's coverage grows from the production signal. Wrong-answer notes: "wait for the quarterly refresh" delays the regression-prevention.

Q3. How do you detect judge drift? A small "eval of the judge" set — cases with stable human labels — run periodically. If the judge's pass-rate on this set drifts over time, the judge's behaviour is changing. Causes: judge model updated by provider, judge prompt changed inadvertently, rubric refined in a way that interacts with the judge's interpretation. The response is to recalibrate: adjust the judge prompt, switch model, accept the new baseline if it aligns better with humans. Judge drift produces systematic score changes that look like system regression; the eval-of-the-judge is the distinguishing tool. Wrong-answer notes: without an eval-of-the-judge, drift is invisible until it produces strange scores.

Q4. What is the difference between a false positive on the eval and a mis-label? A false positive is when the eval flags a regression that is not real — the system's output is actually fine, but the eval scores it as failure. Cause is usually a stale label or a noisy judge. A mis-label is when the label itself is wrong — the case's expected behaviour is not what the team would currently call correct. Both produce similar symptoms (the eval is saying something the team disagrees with). The distinction matters because the fix differs: false-positive-from-stale-label is a label update; mis-label is a label correction. The discipline is to investigate the case and determine which. Wrong-answer notes: treating them as the same misses the lesson about why the label was wrong.


What to do differently after reading this

  • Build the investigation workflow as a regular process; document each incident.
  • Look at the cases on every regression gate block; never trust the score alone.
  • Update labels when investigation shows they are stale; do not preserve wrong labels for "comparability."
  • Add a missed case to the set immediately; do not wait for the cadence.
  • Detect judge drift with a small eval-of-the-judge set; recalibrate periodically.

Bridge. Eleven chapters of discipline. The last two synthesise. The next chapter is the architect's checklist — twenty items that distinguish a defensible eval-set operation from one that is not. → 12-architect-checklist.md