06. Judge and rubric calibration¶

The pipeline feeds the artefacts. The judge — the LLM-as-judge that grades eval cases — needs ongoing calibration against user perception. Feedback is the ground truth; the judge's job is to approximate it. Calibration keeps the approximation accurate.

A platform engineer at a Pune SaaS company notices that the eval score has stayed steady at 0.86 while user feedback has been declining. The judge says everything is fine; users say otherwise. She investigates. The judge's rubric — written 8 months ago — emphasises factual accuracy and tone, weighted equally. The user feedback patterns reveal that users care much more about specificity and completeness; cases with generic correct answers are judged "good" by the rubric and "bad" by users. The judge is misaligned with user perception. The fix is to refine the rubric to weight specificity and completeness more, calibrate against recent user feedback, and re-baseline. The new judge produces scores that move with user feedback.

This chapter is the calibration discipline. Judges drift; rubrics age; users change; the calibration is the ongoing alignment work.

What calibration is¶

Calibration is the ongoing alignment between the judge's scores and the users' perception of quality, maintained through periodic comparison against feedback signals.

The judge is a proxy for user perception. The proxy is approximate. Calibration measures the gap and refines the rubric and judge prompt to close it.

What goes out of calibration¶

Three causes.

Rubric drift. The rubric was written at a point in time; user expectations have evolved; the rubric still captures the original criteria. The judge applies a rubric that does not reflect what users now care about.

Judge prompt drift. The judge's prompt has been edited over time; small changes accumulate; the judge interprets cases differently than originally intended.

Model drift. The judge LLM has been updated by the provider (the alias's pinned version may have been bumped); behaviour shifts.

Each is detected by the same signal: judge scores diverge from user feedback on cases that have both.

The calibration set¶

A small set of cases (50-200) that have:

A production prompt-response pair
An explicit user feedback signal (thumbs/rating)
A human-labelled "true" verdict for what counts as good

The calibration set is the gold standard for judging the judge. Run the judge against it; compare to the human labels; compute agreement.

The set is curated separately from the regression set, with the same disciplines from module 01: stratified, refreshed, versioned.

The calibration metric¶

Inter-rater agreement between judge and human labels. Cohen's kappa or Krippendorff's alpha.

Agreement	Interpretation	Action
>0.80	Judge calibrated	Continue current operation
0.60-0.80	Judge reasonable	Refine rubric on disagreement cases
0.40-0.60	Judge weak	Significant refinement; consider model change
<0.40	Judge unreliable	Major overhaul; do not rely on eval scores

The metric is computed monthly (or quarterly for stable platforms). A drop below threshold triggers refinement.

When calibration shows misalignment:

Pull disagreement cases. Cases where the judge and the human (or user feedback) disagreed.

Read them. Look at the input, the response, the rubric, the judge's reasoning, the human verdict. Why does the judge say "good" when the human says "bad" (or vice versa)?

Identify the pattern. Often a single rubric criterion is too weak or too strict; cases that exercise that criterion are where disagreement clusters.

Refine the rubric. Sharpen the criterion. Add a new criterion if needed. Remove a criterion that produces noise.

Re-validate. Run the refined judge against the calibration set; check that agreement improved without regressing on aligned cases.

Ship. Update the judge prompt; bump its version; the regression set's scores will shift (a re-baseline; chapter 06 of module 01's discipline).

The refinement is small per cycle; over months, the judge tracks user perception faithfully.

Using user feedback as ground truth¶

User feedback is one form of ground truth. It is biased (chapter 07 of this module), but with bias awareness, it is useful.

Conservative usage:

Aggregate feedback across many users; single-user reactions are noisy.
Filter to cases with strong signal (definite thumbs-down, specific negative comments); ambiguous feedback is less useful.
Weight by user representativeness; a small biased segment should not dominate calibration.

The judge's job is to predict the median user's reaction; calibration aligns it to that target.

Multi-criteria calibration¶

A rubric typically has multiple criteria. Calibrate per criterion:

Cases where the rubric criterion "factual accuracy" disagreed with users.
Cases where "tone" disagreed.
Cases where "specificity" disagreed.

Per-criterion calibration is more actionable than aggregate; the team knows which criterion needs refinement.

When the judge cannot be calibrated¶

Sometimes the gap between judge and user is structural, not refinable:

The model the judge uses is too weak for the domain.
The criterion the user cares about is inherently subjective and the judge cannot reliably evaluate.
The user population is heterogeneous; no single judge fits all segments.

The responses:

Try a more capable judge model.
Acknowledge the criterion's limit; reduce its weight in the score; supplement with human review for high-stakes cases.
Segment the judge — different judges for different segments — if heterogeneity warrants.

Calibration is not always achievable to 0.80; the discipline is to know the achievable range and operate within it.

Common mistakes¶

Judge calibrated once, never re-calibrated. Drift accumulates.

Calibration without disagreement-case review. The metric shows misalignment but the team does not investigate.

Refinement without re-validation. New rubric improves on disagreements but regresses on previously-aligned cases.

Single-criterion calibration on multi-criterion rubrics. The aggregate hides per-criterion gaps.

Trusting the judge with low agreement. Eval scores drive decisions when the judge itself is unreliable.

Interview Q&A¶

Q1. The eval is 0.86; users are unhappy. What is the likely cause? The judge is calibrated against the team's rubric, not against user perception. The rubric may have aged; user expectations may have shifted; the judge is faithfully producing scores against the old rubric. The calibration cycle compares the judge against user feedback and refines the rubric to close the gap. The 0.86 is honest about the rubric; the user dissatisfaction is honest about reality; the two diverged silently. Wrong-answer notes: "users are wrong" or "the model regressed" both miss the rubric-judge-user alignment issue.

Q2. Walk through a monthly calibration cycle. Pull the calibration set (50-200 cases with feedback and human labels). Run the judge against it. Compute agreement between judge and human labels. If above 0.80, continue. If lower, pull the disagreement cases; read them; identify the pattern (often a single rubric criterion is the source). Refine the rubric. Re-run the judge on the calibration set; verify agreement improved without regressing aligned cases. Ship the refined judge; bump its version; re-baseline the regression set's scores. Wrong-answer notes: "check the score" without disagreement-case review misses the substance.

Q3. The judge cannot be calibrated above 0.65 on a criterion. What do you do? The criterion may be structurally hard to evaluate with the current judge — too subjective, too domain-specific, too dependent on context the judge lacks. Options: try a more capable judge model (sometimes solves); reduce the criterion's weight in the aggregate score; supplement with human review for cases where this criterion is critical; segment the judge if heterogeneity warrants. The discipline is to acknowledge the limit and operate appropriately, not to pretend the score is reliable. Wrong-answer notes: "raise the threshold" silently lowers the eval's value.

Q4. How do you ensure calibration refinement does not regress on aligned cases? Always re-run the full calibration set, not just the disagreement subset, after refinement. The metric is the agreement across all cases; an improvement on disagreement cases that produces a regression on aligned cases is a wash. The refinement must improve agreement overall, not just on the specific cases that triggered the refinement. Track per-criterion and per-stratum agreement to detect regression hidden in averages. Wrong-answer notes: "test on the disagreement cases only" produces over-fitted refinement.

What to do differently after reading this¶

Establish a calibration set and a monthly calibration cycle.
Compute per-criterion agreement, not just aggregate.
Read disagreement cases for refinement; do not just adjust thresholds.
Re-validate refinements on the full calibration set.
When the judge cannot be calibrated, acknowledge and adjust how the score is used.

Bridge. Calibration assumes the feedback is itself reliable signal. The next discipline is bias awareness — selection, response, sycophancy — that distorts the feedback if uncorrected. → 07-bias-in-feedback.md