Skip to content

08. Judge calibration — the rubric is anchored, but the judge still drifts

~18 min read. A locked rubric does not lock the judge's inner meter. Calibration is the discipline of measuring that meter, probing the biases that bend it, and freezing the prompt that survives.

Builds on the ELI5 in 00-eli5.md. The rubric specifies the criteria; the inspection runs the measurement; this chapter asks whether the inspector reading the rubric can be trusted to produce the same number twice.


What chapter 07 anchored, and what the judge still gets wrong

In chapter 07 the team rewrote a fluffy "helpful and friendly" rubric into a four-dimension rubric with anchors — policy correctness 0–3, handoff completeness 0–3, brand-voice match 0–3, hallucination 0/1. Anchor strings name the visible behaviour at each level: "score 2 if the bot cites the right clause but omits one required disclaimer." The rubric is now a contract a labeller can grade against. So far, so good.

Chapter 01 taught the rule that a quality claim covers only the sample it measured. Chapter 06 introduced the judge — a large model scoring outputs at 1/100th the cost of a human reviewer. The natural assumption is that a tight rubric plus a frontier judge equals a reliable number. That assumption is what this chapter dismantles.

The rubric is anchored, but the judge's interpretation of the anchors drifts with model family, prompt phrasing, and rubric subtlety. Without calibration, the scores are theatre. Two judges from the same vendor will disagree on the same conversation. The same judge, asked twice with two harmless prompt rewordings, will return scores five to ten points apart on a Likert scale. A judge from the GPT family will quietly prefer GPT-generated answers when comparing GPT vs Claude on the same rubric. None of these failures shows up in the aggregate. All of them poison the eval loop the team is about to bet ship decisions on.

What this file solves

The refund chatbot has a fixed rubric and a Claude-based judge scoring 100 sampled conversations per week. The aggregate looks stable at 78%. This file walks the team through building a 50-example human-anchored calibration set, measuring Cohen's kappa at a baseline of 0.62, running three bias probes that drop kappa to 0.54 under positional swap, refining the judge prompt to regain kappa = 0.71, and locking that prompt under version control. By the end you can describe the difference between "the judge scored 78%" and "the judge scored 78% with kappa = 0.71 against humans on a January 2026 calibration set, with positional bias under 3 points and self-preference probe passed." Only the second statement is evidence.

Why a frontier judge is not calibration

The 78% aggregate is not the problem. The problem is that nobody in the room can tell you whether the judge would produce 78% again next week on the same conversations.

Pick one refund conversation. Score it five times with the same Claude-judge prompt at temperature 0. You get scores of 2, 2, 3, 2, 3 on the same anchored 0–3 dimension. The model is mostly deterministic; the prompt is fixed; the rubric is fixed. The five-point swing across runs is the inner meter wobbling around the anchor boundary between "cites clause but omits disclaimer" (score 2) and "cites clause with all disclaimers" (score 3). The model interprets the boundary slightly differently each time.

That alone would be tolerable noise. The deeper failure is that the wobble is not random. It correlates with features the rubric never mentioned:

  • Positional bias. In pairwise comparisons, whichever answer is shown first wins 56–62% of the time, not the 50% you'd expect under a fair coin.
  • Verbosity bias. Longer answers get higher Likert scores even when the rubric scores only correctness. A 400-token answer rates ~0.4 points higher on a 0–3 scale than the same content compressed to 120 tokens.
  • Self-preference. A GPT-4 judge prefers GPT-4 outputs over Claude outputs at ~58% even on rubric-tied responses. A Claude judge mirrors the effect for Claude outputs.
  • Leniency drift. Run the same judge prompt against the same eval set every Monday for ten weeks. Average scores creep up ~0.15 Likert points with no model change. The judge is not getting kinder; small variations in input distribution shift the anchor boundaries.
  • Prompt sensitivity. Reword the judge system prompt from "score 0 to 3 against the rubric" to "rate from 0 to 3 using the rubric". Average score shifts 0.3 points. The model treats score and rate as different verbs.

None of these biases is detectable from the aggregate. All of them are detectable with the right probe. The rubric is the contract; calibration is the audit that the contract is being applied without these systematic distortions.

The naive repair, the visible break, the diagnosis

The first repair smart teams reach for is "upgrade to a more capable judge." Move from GPT-4o to Claude Opus, or to Gemini 2.5 Pro. The reasoning is that frontier reasoning models follow rubrics more faithfully. The visible break: the upgraded judge has higher agreement-with-itself but its biases remain. Opus still favours its first option ~58% of the time. Sonnet still rates verbose answers higher. Frontier models reduce noise; they do not remove directional bias because the bias is baked into pretraining preferences, not into capability gaps.

The second repair is "average across three judges." This helps with random noise but fails on systematic bias. If all three judges favour first position, the ensemble still favours first position. Averaging amplifies a shared bias rather than cancelling it. Not a model-strength problem, not an averaging problem. A measurement-validity problem. The judge has not been compared against a trusted source, so there is no way to know whether the score it produces means what the team thinks it means.

So the natural question is: what trusted source would let us measure how far the judge has drifted from human consensus, and which probes would expose the specific biases the model is leaking?

When two judges read the same conversation and disagree by a full point

Here is one refund conversation from the live sample, scored on policy correctness 0–3 by two humans and three judges.

CONVERSATION 4481 — customer wants refund on a late delivery

Human reviewer A:          2  (cites clause, omits 14-day disclaimer)
Human reviewer B:          2  (same reasoning)
Claude Sonnet (judge):     3  (rates as full credit)
GPT-4o (judge):            2  (matches humans)
Gemini 2.5 Pro (judge):    1  (flags it as policy-incomplete)

Spread: 1 to 3. Human consensus: 2.

One conversation, three judges, a two-point spread on a four-point scale. The aggregate over 100 conversations will hide this. The team will see Claude judge: average 2.4 and Gemini judge: average 2.1 and assume a small calibration offset. The real story is that each judge is interpreting the "omits one required disclaimer" anchor differently — Sonnet ignores the disclaimer requirement, Gemini treats it as fatal, GPT-4o splits the difference. Without a human anchor, you cannot tell which one is right. They cannot all be right.

This is the inspectable artifact at the start of the chapter. Save the 5x100 score matrix; every diagnostic below operates on it.

The rule: a judge score without an agreement number is a number without a unit

State it plainly: a judge produces decisions a team can trust only when its agreement with humans is measured, its biases are probed, and its prompt is version-locked. Three operations, in that order. Skip any one and the score is decorative.

Agreement is measured with Cohen's kappa for categorical or Likert decisions, MAE for continuous, AUC for binary. Bias is probed with three controlled experiments — swap positions, vary verbosity, swap families. Locking is done by storing the judge model name, the judge prompt text, the rubric version, and the calibration timestamp in the same place the model checkpoint lives. A judge prompt is a model artifact. Treat it like one.

Teacher voice. Accuracy and calibration are different properties. A judge can be highly accurate on average (mean score close to human mean) and badly miscalibrated (kappa 0.45 because it agrees on easy cases and disagrees on every borderline one). Aggregate accuracy hides per-case agreement. Kappa exposes it.


1) Building the calibration set — 50 examples, two humans, one number

The unit of trust is a small set of conversations that humans have labelled carefully. For the refund chatbot, the team built a 50-example calibration set from the live week's sample. The construction rules are load-bearing:

  1. Stratified by intent. 15 simple-refund, 15 awkward-refund (auto-filled wrong address, partial refund), 10 policy-edge (warranty after death, regulated jurisdictions), 10 escalation-required.
  2. Two independent human reviewers. Each scores every example on every rubric dimension without seeing the other's scores.
  3. Adjudication. Disagreements above 1 point on a 0–3 scale are resolved in a 20-minute review meeting. The adjudicated score is the gold label.
  4. Inter-rater agreement first. Cohen's kappa between the two humans is computed before any judge is touched. If humans disagree at kappa < 0.6, the rubric is the problem, not the judge.

For the refund chatbot, human–human kappa came in at 0.78 on policy correctness, 0.82 on handoff completeness, 0.71 on brand-voice match, and 0.91 on hallucination (binary). The brand-voice number was the weakest — exactly the dimension chapter 07 had warned was hardest to anchor. Two humans, two different sensibilities for "sounds friendly enough." For the inspection — the inspection, the placeholder you should be hearing in the back of your mind — to be honest, this number sets the ceiling. A judge cannot meaningfully exceed human–human agreement on the same rubric.

50 is the floor. 100 is comfortable. Going to 500 helps precision but rarely changes the decision. The cost is real — 50 examples at two reviewers and ~5 minutes per dimension per example is roughly 17 person-hours. Budget it once a quarter, plus once per rubric change.

2) Measuring baseline agreement — the kappa = 0.62 number

Run the candidate judge — Claude Sonnet 3.5, system prompt v1 — on the 50-example set. Score every conversation on every rubric dimension. Compute Cohen's kappa between the judge's labels and the gold labels.

JUDGE-VS-HUMAN AGREEMENT, baseline (Claude Sonnet 3.5, prompt v1)

Dimension              Kappa    MAE     Interpretation
─────────────────────────────────────────────────────────
Policy correctness     0.62     0.41    moderate, borderline trust
Handoff completeness   0.68     0.36    moderate, usable
Brand-voice match      0.51     0.58    poor, do not trust slice
Hallucination (0/1)    0.84     n/a     strong, AUC = 0.92

Overall (macro):       0.66 kappa, 0.45 MAE

The interpretation matters. Kappa scales (Landis & Koch, the most widely cited):

  • < 0.20 — slight; the judge is barely above chance.
  • 0.21–0.40 — fair; usable for direction, not magnitude.
  • 0.41–0.60 — moderate; usable for triage, not for ship decisions on tight slices.
  • 0.61–0.80 — substantial; the default acceptance band for production evals.
  • > 0.80 — almost perfect; required for high-stakes domains (medical, legal, financial).

Refund-chatbot policy correctness at 0.62 sits at the bottom edge of "substantial." Brand-voice at 0.51 sits in "moderate" — the team cannot trust brand-voice judgements for any decision finer than a 10-point shift. Hallucination at 0.84 is strong because it is binary and the rubric is sharp ("invented a fact not in the policy"). The closer the rubric anchors to physical evidence, the higher kappa rises.

Mini-FAQ. "Why kappa instead of raw agreement?" Raw agreement double-counts chance agreement. Two judges who both guess randomly will agree 25% of the time on a 0–3 scale. Raw agreement of 80% on a binary rubric where 90% of answers are class 1 is meaningless — a constant-output judge would score 90%. Kappa subtracts the chance baseline.

3) Probing the three biases — when kappa drops to 0.54

A baseline kappa of 0.62 is a starting point, not a verdict. The bias probes ask: under what conditions does this number drop? Three controlled experiments, run on the same 50-example set with deliberate manipulations.

Probe 1 — positional swap (pairwise mode)

The judge sometimes runs in pairwise mode: shown two candidate replies, asked which is better. The team takes 30 pairwise cases from the calibration set, runs each one twice — first with answer A in position 1 and B in position 2, then swapped. A fair judge should pick the same winner both times.

POSITIONAL-BIAS PROBE — 30 pairwise cases, judge run twice with swap

Same winner under swap:        21 of 30  (70%)
Reversed winner under swap:     9 of 30  (30%)

Of the 9 reversals:
  position-1 wins both times:   7
  position-2 wins both times:   2

First-position excess:          ~7 points above the 50% baseline
Kappa on the consistent subset: 0.54 (dropped from 0.62 baseline)

The judge has a measurable preference for whatever it sees first. Seven points of first-position excess is on the high end of published numbers for frontier judges — Chatbot Arena and MT-Bench studies consistently report 3–10 points. The kitchen log — every per-case score with position labels — is what makes this probe actionable. Without per-case labels, the team would see the aggregate winner counts and miss the directional skew entirely.

Probe 2 — verbosity manipulation

Take 20 cases where the rubric-correct answer is the shorter one. Generate a verbose variant of each — same content, 2.5x the tokens, padded with restatements of the user's question and polite filler. Score both with the judge.

VERBOSITY-BIAS PROBE — 20 cases, content held equal

Verbose variant scored higher than concise:  14 of 20  (70%)
Equal score:                                  4 of 20  (20%)
Concise scored higher:                        2 of 20  (10%)

Mean Likert delta (verbose - concise):       +0.4 on 0-3 scale
Expected delta under no bias:                 0

A 0.4-point lift purely for being verbose is the kind of bias that quietly rewards a wordier prompt template across an entire eval campaign. The team has been favouring the wordier version of its system prompt for two months. The eval score went up. The CSAT score did not.

Probe 3 — self-preference (cross-family)

Take 20 cases where Claude and GPT-4 produce roughly matched-quality answers — adjudicated by humans as ties. Score each pair with the Claude judge and again with a GPT-4 judge.

SELF-PREFERENCE PROBE — 20 matched-quality pairs

Claude judge preferred Claude output:     13 of 20  (65%)
GPT-4 judge preferred GPT-4 output:       12 of 20  (60%)
Human consensus (adjudicated):             ties

Expected near-even split under no bias:   ~10 of 20
Self-preference excess:                   3-5 points per judge

Both judges lean toward their own family. The effect is smaller than positional bias but more dangerous in vendor-comparison work — if the team is using a Claude judge to decide whether to switch to GPT-4, the judge is structurally biased against the candidate.

Stack the probes. On the worst-case slice of the calibration set — pairwise cases shown with the GPT output in position 2 — kappa drops from 0.62 baseline to 0.54. That is the headline. The aggregate kappa hides this; the slice kappa exposes it.

4) Refining the judge prompt — regaining kappa = 0.71

Three concrete changes to the judge system prompt, each motivated by one probe.

PROMPT V1 (baseline, kappa 0.62 / worst-slice 0.54)

  "Score the following reply on a 0-3 scale using the rubric below.
   Provide your score and a one-sentence justification."

PROMPT V2 (refined, kappa 0.71 / worst-slice 0.66)

  "You are scoring replies against an anchored rubric. Apply each
   anchor strictly; if the anchor for score 2 says 'omits one
   required disclaimer', a reply missing that disclaimer cannot
   receive 3 regardless of fluency or length.

   Ignore answer length when scoring; rubrics never reward verbosity.

   In pairwise comparisons, evaluate both answers against the
   rubric independently before comparing. Do not let presentation
   order influence the decision.

   Provide score, anchor string you matched, and justification
   in this exact JSON shape: {score, anchor_matched, reason}."

Three deliberate moves: (a) restate the strict-anchor rule the rubric already implies, because the judge under-weights it without reminder; (b) explicit verbosity nullifier; (c) force the judge to name the anchor string it matched, which structurally constrains the score. The JSON forcing is from chapter 06's judge-output discipline.

Re-run the full calibration suite on prompt v2:

JUDGE-VS-HUMAN AGREEMENT, prompt v2

Dimension              Kappa    MAE     Change from v1
─────────────────────────────────────────────────────────
Policy correctness     0.71     0.32    +0.09
Handoff completeness   0.74     0.30    +0.06
Brand-voice match      0.58     0.51    +0.07
Hallucination (0/1)    0.86     n/a     +0.02

Worst-slice kappa:     0.66     -       +0.12 (was 0.54)
First-position excess: 3 pts    -       -4 pts
Verbosity gap:         0.1      -       -0.3
Self-preference:       2 pts    -       -1 pt (smaller effect)

Kappa rose from 0.62 to 0.71 overall and from 0.54 to 0.66 on the worst-case slice. Three rubric dimensions move into the substantial-agreement band. Brand-voice stays in moderate — that one needs more rubric work, not more judge work, because human–human kappa was only 0.71 on that dimension.

Lock prompt v2 under version control. The judge artifact in the eval system now looks like:

judge_id:           refund_judge_v2
model:              claude-3-5-sonnet-20241022
prompt_hash:        sha256:a4f1...
rubric_version:     refund_rubric_v3
calibration_kappa:  0.71 (macro), 0.66 (worst slice)
calibration_set:    refund_cal_50_jan2026
locked_on:          2026-01-15
recalibrate_when:   rubric_version changes, judge_model changes,
                    application surface changes, kappa drift > 0.05

When chapter 09 introduces drift detection, this artifact is the anchor. Drift is measured against the calibration kappa, not against last week's aggregate.

5) Mental model — the judge as a model that needs evaluation

        THE EVAL STACK BEFORE CALIBRATION
        ─────────────────────────────────
        production output ──→ rubric ──→ judge ──→ score
                                    (assumed honest)

        THE EVAL STACK AFTER CALIBRATION
        ─────────────────────────────────
        production output ──→ rubric ──→ judge ──→ score
                          human-labelled calibration set
                            kappa, MAE, bias probes
                          locked judge prompt + artifact

The model the team is evaluating used to be just the refund chatbot. After calibration, the team is evaluating two models — the chatbot and the judge. The judge sits inside the eval loop, and any drift in the judge contaminates every decision downstream. Chapter 06 introduced the judge as a measurement instrument. This chapter adds: instruments need calibration certificates, just like a kitchen scale needs to be checked against a known weight before you trust the bag of flour it measured.

The spot check — sampling representative conversations — is what built the calibration set. The rubric is the contract being audited. The kitchen log — per-case scores with position labels and length labels — is what makes bias probes possible.

6) Alternative comparison — single judge vs ensemble vs human-anchored

Three reasonable architectures for getting trustworthy numbers. The choice depends on cost, stakes, and rubric maturity.

Single judge, uncalibrated

What it costs: free beyond the judge inference. What it buys: nothing trustworthy. What it breaks: every decision rests on an unmeasured instrument. Use when: prototype only, no shipping decisions ride on the number.

Single judge, human-anchored calibration

What it costs: 50–100 human-labelled examples per quarter, ~20 person-hours per refresh, plus the judge inference. What it buys: a kappa number with footnotes, locked prompt, three bias probes passed. What it breaks: catches systematic miscalibration; does not protect against family-specific bias on cross-vendor comparisons. Use when: within-family eval, kappa target 0.65–0.75, normal product stakes.

Cross-family ensemble + human-anchored calibration

What it costs: 3x judge inference, plus the calibration overhead. What it buys: bias cancellation when judges from different families disagree on direction, plus a kappa-per-judge view that flags the rogue judge. What it breaks: 3x cost, more complex tie-breaking, slower turn-around. Use when: cross-vendor comparisons, kappa target 0.75+, regulated or high-stakes domain.

Architecture Setup cost Per-eval cost Kappa ceiling What it catches What it misses
Single judge, uncalibrated ~0 $0.05/case unknown nothing everything
Single judge, calibrated ~20 person-hours/qtr $0.05/case 0.70–0.80 systematic bias, drift after refresh cross-family self-preference
Ensemble (3 judges), calibrated ~60 person-hours/qtr $0.15/case 0.78–0.85 family-specific bias, divergence between judges rubric ambiguity (all 3 inherit it)
Human + judge hybrid ~5x judge $1–3/case 0.85+ most failures cost scales with sample

The refund chatbot landed on single judge, calibrated. The product is mid-stakes (consumer refunds, no regulatory bar), the rubric is mostly within-family-stable, and the team's cost budget for evals is $200/week. Ensemble was rejected when the kappa-vs-cost curve plateaued — the marginal kappa gain from a second judge was 0.04 for a 2x cost increase.

7) Operational signals — what calibration looks like healthy, sick, and dying

Healthy: kappa stays within ±0.03 of the calibration baseline across weekly re-runs on the calibration set. Bias probes pass — positional excess under 5 points, verbosity gap under 0.2 Likert, self-preference within 5 points of even. The judge artifact has not been touched since the last rubric change. The aggregate score and the calibration kappa appear on the same dashboard.

First metric to degrade: positional-bias excess creeping above 5 points on the weekly probe. This is usually the leading indicator because pairwise mode is the most sensitive to small prompt drift. The misleading metric beginners watch is aggregate judge agreement — the raw fraction of cases where the judge picks the same winner as another judge. This number can stay high while kappa drops, because the easy cases dominate and the hard cases are the ones that move kappa.

The experienced graph is the kappa-over-time plot with bias-probe band overlaid: kappa on the y-axis, week on the x-axis, with a band showing acceptable positional-excess range. When kappa drifts down or positional excess drifts up, the plot shows which one moved first. That tells the team whether to refresh the rubric, refresh the judge prompt, or refresh the calibration set.

The signal that calibration discipline is rotting: the team has not re-run the calibration suite in three months. The kappa number on the dashboard is from January; it is now May. The rubric has changed twice in that time. The number is decorative.

8) Boundary of applicability — when kappa 0.6 is fine, when you need 0.8+

Kappa 0.6 is a fine acceptance bar when three conditions hold: the decisions the judge feeds are coarse (ship vs don't-ship at 5-point granularity), the costs of a wrong decision are bounded and recoverable (small refunds, retryable replies), and the user-facing harm of any individual miscall is low. A consumer-facing chatbot for marketing copy variants lives here. A B2B internal search assistant for non-confidential docs lives here.

Kappa 0.8+ is required when decisions are fine-grained (small score deltas drive A/B winners), costs are irreversible (regulatory filing, medical recommendation, legal draft), or harm is asymmetric (one false negative is worse than 100 false positives). Healthcare chatbots, legal-research assistants, financial-advice systems, and content moderation at scale all need this band. The cost is real — getting from 0.65 to 0.85 typically means doubling the calibration set, ensembling 2–3 judges, and adding human-in-the-loop sampling on borderline cases. Plan for 5–10x the eval budget.

The pathology of "always aim for 0.85+" is two-fold. First, you spend disproportionate effort calibrating dimensions that don't move the ship decision. Second, you exhaust the rubric's natural ceiling — if human–human kappa is 0.78 on brand-voice, no amount of judge tuning can push judge–human kappa past 0.78 on that dimension. Trying anyway means you're tuning the judge to agree with one human's quirks, not with consensus. Calibration targets must be set against the human–human ceiling, dimension by dimension.

At scale, ensemble calibration becomes attractive because the marginal cost per eval drops as volume rises but the kappa gain stays. A team running 50K evals per week can afford a 3-judge ensemble; a team running 500 cannot.

9) Common wrong mental model — "frontier judges don't need calibration"

The seductive belief is that GPT-5 or Claude Opus 4 will be so capable that calibration becomes unnecessary. "The model is smart enough to follow the rubric. We checked. Look at the scores." This belief is wrong for three stacked reasons.

First, frontier reasoning improves the judge's capability ceiling but not its bias floor. Positional bias, verbosity bias, and self-preference are pretraining artifacts. They scale slowly with capability. A more capable model is less noisy but no less biased — the directional skew remains.

Second, even if a hypothetical perfect judge existed, calibration would still be required to know which rubric dimensions it was applying correctly. The team's brand-voice rubric might be ambiguous. The judge can perfectly follow an ambiguous rubric in three different incompatible ways. Without human-anchored measurement, you cannot detect which way the judge is leaning this week.

Third, calibrating once is enough is the sibling wrong belief. Drift requires recurring calibration. The judge model is updated by the vendor without notice; the rubric is revised by the product team; the user-input distribution shifts as the product evolves. Each of these can degrade kappa silently. A judge calibrated in January and trusted through July is a judge whose number you no longer know.

Replace the wrong mental model: judges are instruments that drift; rubrics are contracts that evolve; calibration is the recurring audit that keeps the instrument honest against the contract. The frontier helps with noise. Only calibration helps with validity.

10) Failure catalog — six other shapes calibration shortcuts produce

  • Single-rater calibration set. One human labels the 50 examples. Kappa-with-judge looks great because the judge is implicitly tuned to one human's quirks. Cross-validates against zero people.
  • Adjudication theatre. Two humans disagree on 18 of 50 cases. The senior reviewer picks the "tiebreaker" answer for all 18 without recording the reasoning. The gold labels now encode the senior reviewer's taste, not the rubric.
  • Reused calibration set. The same 50 examples are used to calibrate, evaluate, and report. The team is now testing on its training set. Real-world kappa is 0.15 lower than reported.
  • Calibration without bias probes. Kappa is 0.72 — looks great — but positional excess is 12 points. Pairwise comparisons in production are skewed by a quarter of a margin. Aggregate hides the slice.
  • Locked prompt, unlocked model. The judge prompt is in version control; the underlying API points at claude-3-5-sonnet-latest. Anthropic ships a new snapshot. Kappa silently drifts. The team discovers via CSAT three months later.
  • Rubric drift unseen by judge calibration. Product team revises the rubric anchors slightly. The judge keeps running the old prompt against the new rubric definition. Calibration kappa, measured against new gold labels, falls from 0.71 to 0.58. Nobody notices for two weeks.

Each shape is a violation of one of the three pillars: human-anchored, bias-probed, version-locked. The fix in every case is mechanical — restore the missing pillar — but you cannot restore what you cannot see, which is why the artifact in section 4 has all three explicitly named.

11) Pressure transfer — where this discipline reappears

  • Same pressure as chapter 01. Shipping on vibes is using an uncalibrated demo as evidence. Trusting an uncalibrated judge is the same category error, one layer down: the judge becomes the "demo" of the eval system, and the slice table you skipped at launch is the kappa you skipped at calibration.
  • Same shape as data quality in chapter 03. A golden eval set with no labelling audit and a judge with no calibration audit are two instances of one failure: trusting a measurement instrument that was never measured.
  • Same constraint as drift detection in chapter 09. Both chapters fight slow degradation in a number you have stopped looking at. Calibration sets the baseline; drift detection watches the delta. Without the first, the second has no anchor.
  • Same family as A/B testing in chapter 10. A/B comparisons amplify judge bias if the bias is correlated with the candidate being measured. A self-preferring judge running an A/B between Claude and GPT will reliably pick wrong.

12) A fast self-test before you trust a judge score

  • Can you name the kappa, MAE, and AUC of the judge against humans on a calibration set built this quarter?
  • Can you name the positional-bias excess, verbosity gap, and self-preference excess for the judge?
  • Is the judge prompt under version control with a hash, the model snapshot pinned, and the rubric version recorded?
  • Is the calibration set held out from the regular eval set and the training-of-prompts loop?
  • Would a 0.05 drop in kappa next week be visible to anyone without manual checking?

Five yeses means the judge has a calibration certificate. One or more nos means the judge score is decoration.

Where judge calibration shows up in shipped products

The teams that produce trustworthy numbers all reinvent this discipline. The shapes are similar enough that a short tour teaches the role calibration plays at each.

  • Anthropic Claude evals cookbook — publishes the recipe of human-anchored kappa for every internal judge before model-card numbers ship; the calibration cost is part of the model launch budget, not an afterthought.
  • OpenAI evals platform — supports judge-vs-human agreement reports as a first-class artifact, because customers running production evals consistently asked for the agreement number alongside the score.
  • Chatbot Arena (LMSYS) — uses ELO computed from millions of paired human votes; the entire platform is a calibration ground truth for downstream model-as-judge claims, and the team published positional-bias correction methodology in 2023.
  • MT-Bench (Zheng et al.) — the foundational study that measured 3–10 point positional bias in GPT-4 judges and published the recipe of swap-and-average that most teams now copy.
  • AlpacaEval 2 — explicitly fixes positional bias via length-controlled win rate, after the team measured a 12-point verbosity skew in the original AlpacaEval.
  • Vectara HHEM (Hallucination Evaluation Model) — calibrated against TruthfulQA and SummEval human labels; the company exists because customers asked for a faithfulness judge with a published kappa.
  • Galileo's bias-detection suite — productises positional, verbosity, and self-preference probes for enterprise customers running judge-based evals at scale.
  • Patronus AI Lynx — open-weights hallucination judge with published agreement numbers against SimpleQA and HaluEval human labels; calibration certificate ships with the model.
  • Braintrust Autoevals — provides judge templates with human-comparison harness built in; the docs lead with kappa, not with the judge prompt.
  • LangSmith evaluator runs — surfaces per-case disagreement between judge and human label, so teams can inspect calibration gaps without writing a kappa pipeline.
  • Arize Phoenix evaluations — bundles MAE and kappa visualisations alongside the judge score, making the calibration footnote a default view.
  • Cohere's Rerank evals — calibrates the reranker judge against TREC-Robust04 nDCG labels; calibration is a release gate, not a one-time setup.
  • G-Eval (Liu et al.) — chain-of-thought judge that explicitly addresses verbosity bias by forcing the judge to produce per-criterion reasoning before scoring.
  • PandaLM — fine-tuned LLM judge trained explicitly to reduce positional and self-preference biases, calibrated against ~252K human-labelled pairwise samples.
  • Prometheus 2 (KAIST) — open-weights judge with published agreement-with-GPT-4 and agreement-with-humans numbers across multiple rubric types; calibration data is part of the release.
  • Scale AI's SEAL leaderboards — uses contracted human evaluators with measured inter-rater kappa as the ground-truth layer for model rankings; treats the human raters themselves as instruments that need agreement audits.
  • HHH (Helpful, Honest, Harmless) Anthropic eval — required pairwise calibration before judge-based scoring became publishable; the failure of single-judge HHH was the empirical motivation for ensemble judging in safety evals.
  • NIST AI risk-management evals — explicitly require human-anchored calibration for any model-based judge used in regulated contexts; calibration certificates feed audit trails.
  • GitHub Copilot Chat experiments — internal eval pipelines run position-swapped pairwise on every model swap, after the team measured an early-2024 ordering effect that flipped a release decision.
  • Cursor's tool-call benchmarks — calibrate the LLM judge against a human-labelled 200-example tool-call set; the kappa number gates whether judge scores can be used to ship a model change.
  • Perplexity citation-accuracy eval — uses a citation-checking judge calibrated against editorial reviewers; the judge prompt is locked under the same change control as the production model.
  • Casetext CoCounsel — legal-drafting judge calibrated against senior associate review; the calibration cost is itemised in product economics because regulators audit it.
  • Stripe Radar fraud-judge feedback loop — calibrates the judge against confirmed-chargeback labels (the ground truth) rather than against human reviewers; the unique-to-fraud variant of the same discipline.

The pattern is consistent. Wherever a judge produces a number that drives a real decision, somebody has built a calibration set, measured agreement, probed biases, and locked the artifact. The teams that skip this step ship the number alone and discover later that the number meant something else.

Pause and recall

  1. What three operations turn a judge score from decoration into evidence?
  2. Why is raw agreement an unreliable substitute for Cohen's kappa?
  3. In the refund-chatbot trace, which bias dropped kappa from 0.62 to 0.54 on the worst slice?
  4. Why does increasing judge model capability not remove positional bias?
  5. What three artifacts must be version-locked in a calibrated-judge setup?
  6. Why is human–human kappa the ceiling for judge–human kappa?
  7. When is kappa 0.6 a reasonable acceptance bar and when do you need 0.8+?
  8. Name one operational signal that calibration discipline is rotting.

Interview Q&A

Q1. A judge shows 80% raw agreement with humans on a 0–3 rubric. Is that enough to trust the scores?

A. No. Raw agreement double-counts chance agreement, especially when the label distribution is skewed — a constant-output judge can score 70%+ on a typical eval where one class dominates. The honest number is Cohen's kappa, which subtracts the chance baseline. 80% raw agreement might be kappa 0.45 (moderate, not trustworthy) or kappa 0.75 (substantial). Without kappa, the number is uninterpretable. Common wrong answer to avoid: "80% sounds high, let's ship."

Q2. Your judge has kappa 0.72 overall but positional excess of 8 points in pairwise mode. How do you respond?

A. Trust the judge for non-pairwise scoring; do not trust it for pairwise A/B decisions until the positional bias is reduced. The mitigation is to run every pairwise case twice with positions swapped and accept only consistent winners. The 0.72 kappa hides the slice where bias bites. The aggregate is misleading because pairwise cases are precisely the ones that drive ship decisions on close calls. Common wrong answer to avoid: "Kappa is in the substantial band, the judge is good."

Q3. The team wants to skip calibration because they're using Claude Opus, "the most capable judge available." What do you tell them?

A. Frontier capability reduces noise; it does not remove directional bias. Positional, verbosity, and self-preference biases are pretraining artifacts that scale slowly with model capability. Calibration is also the only way to detect rubric ambiguity — a perfectly capable judge can follow an ambiguous rubric in three incompatible ways, and without human-anchored measurement, you cannot tell which way it is leaning. The cost of calibration is ~20 person-hours per quarter; the cost of skipping it is shipping decisions on an unmeasured instrument. Common wrong answer to avoid: "Frontier judges don't need calibration."

Q4. You calibrate a judge in January at kappa 0.71. In May, the dashboard still shows 78% aggregate pass rate. What is the gap?

A. The aggregate has not been compared to humans since January. The kappa today might still be 0.71 or it might be 0.55. The rubric may have been revised, the judge API may have rolled to a new snapshot, the user-input distribution may have shifted. Each of these can degrade kappa silently. Re-run the calibration suite quarterly at minimum, and whenever the rubric, judge model, or application surface changes. A January kappa on a May score is a number with an expired footnote. Common wrong answer to avoid: "If the aggregate is stable, calibration is fine."

Q5. Cumulative — your chapter 07 rubric defines brand-voice on a 0–3 scale. Human–human kappa is 0.71 on that dimension. Your judge kappa is 0.68. Should you invest in a better judge?

A. No. The judge is approaching the human–human ceiling of 0.71. The bottleneck is the rubric — two humans agree only at 0.71 because the anchors for brand-voice are interpretable in different ways. The investment should go into refining brand-voice anchors with more concrete language and example outputs, not into a more powerful judge. If you tune the judge past 0.71, you are tuning it to agree with one human's specific taste, not with consensus. Cumulative diagnosis across chapter 07 and chapter 08: rubric-ceiling problem, not judge-capability problem. Common wrong answer to avoid: "Switch to a larger judge model."

Q6. When does cross-family ensemble calibration justify its 3x cost?

A. When the eval is comparing models from different families (Claude vs GPT vs Gemini) and self-preference bias would distort the verdict, or when stakes require kappa 0.78+ that a single judge cannot reach, or when volume is high enough that the marginal per-eval cost is amortised. For a within-family eval at mid-stakes — most product evals — a single calibrated judge wins on the kappa-vs-cost curve. The plateau test: if adding a second judge raises kappa by less than 0.05, the ensemble is not justified. Common wrong answer to avoid: "Always ensemble — it can only help."

Q7. Your dashboard shows kappa 0.71 stable, but CSAT is falling. What do you investigate?

A. The rubric, not the judge. A stable judge–human kappa means the judge is faithfully scoring against the rubric. A falling CSAT with a stable rubric–rubric agreement is the canonical Goodhart signal — the team is optimising to a rubric that no longer measures what users value. Pull recent low-CSAT conversations, score them by the current rubric, and inspect cases where the rubric says acceptable but the user said unhappy. The rubric needs new dimensions. This is chapter 07's pressure resurfacing under chapter 08's instrumented gaze. Common wrong answer to avoid: "The judge must be broken — recalibrate it."

Q8. A new vendor's judge model claims 0.85 kappa on a public eval benchmark. Should you adopt it?

A. Run it on your calibration set against your gold labels, with your judge prompt adapted to the new model. Public benchmark kappa is real for that benchmark and unknown for your task. The 0.85 may come from a rubric whose anchors are sharper than yours, or from a benchmark whose label distribution flatters that model. The decision rests on your own kappa number, not the vendor's. The replacement cost is one calibration run, ~6–8 hours of work. Common wrong answer to avoid: "0.85 beats our 0.71, switch."

Apply now (10 min)

Step 1 — model the exercise. Here is the refund-judge calibration certificate you should be able to produce.

Field Value
judge_id refund_judge_v2
model claude-3-5-sonnet-20241022
prompt_hash sha256:a4f1...
rubric_version refund_rubric_v3
calibration_set_id refund_cal_50_jan2026 (50 examples, 2 reviewers, adjudicated)
human–human kappa 0.78 / 0.82 / 0.71 / 0.91 by dimension
judge–human kappa 0.71 / 0.74 / 0.58 / 0.86 (macro 0.72)
positional excess 3 points (under 5-point threshold, pass)
verbosity gap 0.1 Likert (under 0.2 threshold, pass)
self-preference excess 2 points (under 5-point threshold, pass)
locked_on 2026-01-15
recalibrate_when rubric, model, application change, or kappa drift > 0.05

This is the artifact. Anyone reading it knows the judge's strengths and weaknesses without rerunning anything.

Step 2 — your turn. Take one judge-based eval you run today. Build a 30-example calibration set this week (smaller-than-ideal is fine for the exercise). Have two people label each example on every rubric dimension. Compute human–human raw agreement and your judge's raw agreement against the consensus labels. Then write down what positional-bias and verbosity-bias probes you would run against your specific eval shape. Compare your kappa estimate against the bands in section 2.

Step 3 — reproduce from memory. Without scrolling up, redraw the eval stack before vs after calibration diagram from section 5, then write the three operations that turn a judge score from decoration into evidence. Connect each operation to one bias it protects against. If you can do this cold, the chapter has landed.

What you should remember

This chapter explained why a tight rubric does not produce a trustworthy number on its own. The judge — the model scoring outputs against the rubric — is itself an instrument that drifts. Positional bias makes whatever appears first win 56–62% of the time. Verbosity bias rewards length the rubric never mentioned. Self-preference makes judges quietly favour their own model family. Prompt sensitivity shifts scores 5–10 points on cosmetic rewordings. Leniency drift creeps up over weeks. None of these biases is detectable from the aggregate; all of them poison the eval loop the team is about to bet on.

You learned the three operations that turn a judge score from decoration into evidence — measure judge–human agreement with kappa on a small human-labelled calibration set, probe the three biases with controlled experiments, and version-lock the judge artifact (prompt hash, model snapshot, rubric version, calibration timestamp). You watched the refund chatbot go from kappa 0.62 baseline, to 0.54 under positional swap, to 0.71 after a deliberate three-line prompt refinement, and finally to a locked judge artifact ready for the drift detection of chapter 09.

Carry this diagnostic forward: when somebody quotes a judge-based score, ask one question — "what's the kappa and what are the three bias probes saying?" If the answer is "we haven't measured that yet," you have just found the most leveraged half-week of work in the eval system. The rubric is the contract; calibration is the audit that the contract is being applied honestly. The frontier helps with noise. Only calibration helps with validity.

Remember:

  • A judge score without a kappa number is a number without a unit. Always quote agreement alongside the aggregate.
  • Frontier judges reduce noise but not directional bias. Positional, verbosity, and self-preference biases survive capability scaling.
  • Human–human kappa is the ceiling. If your rubric's human–human kappa is 0.71, a 0.85 judge–human kappa is fitting one human's taste, not consensus.
  • Lock three things together: judge prompt hash, judge model snapshot, rubric version. Drift any one and the calibration certificate expires.
  • Re-calibrate when the rubric changes, the judge model changes, the application surface changes, or kappa drifts more than 0.05 since the last refresh.
  • Cross-family ensembles pay for themselves only when stakes need kappa 0.78+ or when cross-vendor comparisons would otherwise be poisoned by self-preference.

The mnemonic that makes the discipline portable: MAP-Lock — Measure (kappa/MAE/AUC), Audit (positional, verbosity, self-preference), Probe (re-run on slices), Lock (artifact in version control). Skip any letter and the score is decoration.


Bridge. Once the judge has a calibration certificate — a measured kappa, three bias probes passed, a locked prompt and rubric version — you can finally watch quality over time and know what quality means. But that creates the next pressure: a stable calibrated judge applied weekly against a moving production distribution will produce a slowly changing number that nobody knows how to read. Drift detection is the discipline of separating real degradation from input-distribution drift from judge wobble, and of catching the slow descent before users feel it. Without calibration, drift detection has no anchor. With it, drift becomes the next observable.

09-drift-detection.md