07. Rubric design — when two careful readers score the same chat and disagree¶

~22 min read. A judge, human or model, is only ever as sharp as the words it scores against. Vague words make brilliant judges produce noisy scores.

Builds on 06-llm-as-judge.md. The rubric — the written scoring criteria a judge reads before every grade — is the load-bearing artifact of this entire module. The inspection is the act of sampling and scoring; the rubric is what the score actually means.

What the judge solved and what it still cannot fix on its own¶

In chapter 06 we replaced a thirty-rupee human grader with a Claude-Sonnet judge that scored ten thousand refund-chatbot conversations overnight for the price of a small dinner. The judge worked. Its agreement with two senior reviewers on a 100-case calibration set landed at 84% on a binary "policy correct" call. We declared the cost problem dead and moved on.

Then the trouble started. Two weeks later a PM ran the same judge on the same 100 cases with a slightly reworded prompt — "score how helpful this refund reply is, 1 to 5" — and the scores became unrecognisable. Same model, same conversations, same temperature. The aggregate moved nine points. Worse, when we asked two human reviewers to grade the same 30 cases with that new instruction, they disagreed on 14 of them. The judge was not broken. The reviewers were not careless. The instruction was vague, so every reader filled in different content for the word helpful. The rubric had failed before the judge ever ran.

This chapter teaches the discipline that closes that gap: how to write scoring criteria that two raters — two humans, two judges, or a human and a judge — score consistently on the same output. Named dimensions, one-sentence definitions, numeric anchors with concrete example outputs, an inter-rater reliability test, and a versioning habit. Without this, every later chapter — calibration, drift, A/B, alerts — measures a rubber ruler.

What this file solves¶

A judge with a vague rubric is worse than no judge at all, because it produces confident numbers nobody can trust. This file walks the refund chatbot from a one-line "is the reply good?" prompt to a 4-dimension anchored rubric — policy correctness, handoff completeness, brand tone, safe refusal — with concrete chat snippets at each anchor. We score 10 chats with two raters, compute Cohen's kappa, watch one disagreement drop from genuine ambiguity to a fixable anchor gap, and explain why rubrics must be versioned like code instead of edited in place.

When two careful readers score the same reply and disagree¶

Pull one conversation from last week's logs. The customer writes: "My order #4481 was delivered 6 days late and the carton was crushed. I want my money back." The refund bot replies:

"Hi — sorry about the delay. We can process a refund for order 4481 under our 7-day delivery guarantee. You will see the credit in 3–5 business days. Is there anything else I can help with?"

Ask two reviewers to score this reply "1 to 5, how helpful is it?" and watch what happens.

Reviewer A:  5 — polite, refunds, gives timeline, closes the loop.
Reviewer B:  3 — never confirmed the order details with the customer,
             never asked about the crushed carton, never offered
             escalation to a human for damage claims.

Both are reading the same text. Neither is being lazy. They are answering two different questions because "helpful" never specified which behaviours to look for. A wins on tone-and-timeline. B wins on completeness-and-handoff. Until the rubric names those as separate things with separate scores, every aggregate over a thousand such chats is an average of two disagreements.

Teacher voice. A rubric is not paperwork around the judge. It is the question the judge is being asked. Vague question, noisy answer — for humans, for models, and for the dashboard your PM watches on Tuesday.

The naive repair, the visible break, the diagnosis¶

Smart teams reach for one of three repairs first. None of them survive contact with real disagreement.

Repair 1 — "add more reviewers and average." Three reviewers instead of two, mean their scores. Feels rigorous. But averaging a 5 and a 3 to get a 4 hides the fact that the two readers were scoring different things. The variance does not shrink; it gets relabelled as a smooth number. Six months later a PM asks "why did helpfulness drop from 4.1 to 3.6?" and nobody can answer because the score never had a fixed referent.

Repair 2 — "use a longer prompt." Pile detail into the judge prompt: "score how helpful, accurate, polite, complete, brand-aligned, safe, and customer-friendly the reply is, 1 to 5." This is worse, not better. Seven concepts collapsed into one number means every reviewer weights them differently and the same disagreement reappears one layer deeper.

Repair 3 — "trust senior reviewers." Let the lead grade everything and call it canonical. Works until the lead leaves. Worse, the lead's implicit standards drift quarter by quarter without anyone noticing, and the eval becomes a measurement of the lead's mood.

Not a reviewer problem. Not a model problem. Not a prompt-length problem. A specification problem. The judge — human or model — is scoring against a sentence that does not pin down what counts as a 5 and what counts as a 3. So the natural question becomes: "what does the rubric need to look like so two careful readers, given the same reply, land on the same score most of the time?"

When a four-line rubric makes the same reply score the same way twice¶

Same conversation. New rubric.

DIMENSION                 ANCHORS

policy_correctness        5 = cites the correct policy clause AND
                              applies it to the correct order details
                          3 = correct refund decision, but cites a
                              vague or generic policy
                          1 = invents a clause or refunds wrongly

handoff_completeness      5 = captures order id, issue type, customer
                              ask, and any escalation flags in a way
                              a human agent could continue cold
                          3 = captures order id and ask only
                          1 = missing order id or issue type

brand_tone                5 = warm, calm, plain English, no jargon
                          3 = polite but generic / boilerplate
                          1 = curt, defensive, or robotic

safe_refusal              5 = correctly refuses out-of-scope asks
                              (damage claims, legal threats) and
                              routes to a human with handoff context
                          3 = refuses but does not route
                          1 = either invents an answer or refuses
                              a request it should have handled

Reviewer A and Reviewer B now score the same refund reply.

                   A    B
policy_correctness 5    5    (cites 7-day guarantee, correct order)
handoff_completeness 3  3    (has order id, missing damage-claim flag)
brand_tone         5    4    (A: warm; B: warm but "Is there anything
                              else" feels boilerplate)
safe_refusal       1    1    (never handed off the damage claim)

Three out of four dimensions match exactly. The fourth differs by one point on tone, which is a known soft dimension. The aggregate now says something useful: the bot is policy-fine and tone-fine but unsafe on damage handoffs. That is a fix, not a feeling. The same reply that scored 5 from A and 3 from B under the old rubric now scores predictably under the new one, and the disagreement that remains points at the next anchor refinement.

Mini-FAQ. "Why four dimensions and not one big one?" Because the four dimensions move independently. Tone can improve while safety degrades. Policy can be right while handoff is empty. One number averages independent failures into invisibility. Four numbers make each failure findable.

The rule: a rubric scores observable behaviour at named anchors, never adjectives¶

State the load-bearing truth plainly. A usable rubric names a small set of independent dimensions, defines each one in a single observable sentence, and fixes numeric anchors to concrete example outputs. Everything else — kappa, calibration, judge prompts, drift alerts — depends on this. If the rubric scores helpfulness without anchoring it, no amount of statistical machinery downstream rescues the measurement.

Three sub-rules follow:

Observable, not internal. A reviewer must be able to point at the text and grade. "Confident" is internal; "uses hedge words like might or I'm not sure" is observable.
Anchored, not Likert. A 5-point scale without anchors is a Rorschach test. A 5-point scale where 5, 3, and 1 each show an example output is a measuring stick.
Independent dimensions. If two dimensions almost always move together, collapse them. If one dimension is doing two jobs, split it.

Why this rule exists. Two raters can only agree if the rubric forces them to look at the same evidence. Adjectives let them look at different evidence. Anchors force them to look at the same chat snippet and compare.

1) The anatomy of an anchored rubric — what each part is doing¶

A working rubric has four parts, and each one fails in a specific way when missing.

┌────────────────────────────────────────────────────────────┐
│ RUBRIC ANATOMY                                             │
├────────────────────────────────────────────────────────────┤
│ 1. dimension name        — what is being scored            │
│ 2. one-line definition   — what counts as this dimension   │
│ 3. anchored levels       — 1, 3, 5 with example outputs    │
│ 4. version + owner       — who maintains it, what changed  │
└────────────────────────────────────────────────────────────┘

Drop the dimension name and reviewers score whatever the example reminds them of. Drop the definition and handoff_completeness means six different things to six readers. Drop the anchors and you are back to the Likert Rorschach test. Drop version + owner and the rubric quietly mutates every sprint while last quarter's scores still sit in the same dashboard column. Each part removes a specific source of noise. None is decoration.

For the refund bot, written out:

dimension:  policy_correctness         v: 1.3      owner: priya@
definition: the reply cites the correct policy clause AND applies it
            to the specific order details the customer mentioned.

  5 — "Order 4481 was 6 days late; under our 7-day delivery guarantee
        (policy §4.2) we are processing a full refund of ₹1,250."
  3 — "We will process a refund for your late order."
  1 — "Under our 24-hour cancellation policy, you are entitled to
        a refund." (invented clause — no such policy exists)

Notice what the anchors do. A reviewer staring at a real reply can ask "is this closer to the 5 example or the 3 example?" and the answer is a comparison, not a feeling. That comparison is what two readers can do consistently.

2) The mental model — the rubric as a measuring stick, not a vibe meter¶

Picture two carpenters measuring a plank.

       VIBE METER                       MEASURING STICK
       ──────────                       ───────────────
   "Looks about right"              ┌──┬──┬──┬──┬──┬──┐
                                    │  │  │  │  │  │  │
   Carpenter A: 1.2 m               0  1  2  3  4  5  6
   Carpenter B: 1.4 m
                                    Carpenter A: 3.2
                                    Carpenter B: 3.2

   disagreement = real              disagreement = none

   no way to reconcile              shared physical reference

The measuring stick has tick marks at known intervals. Two carpenters land on the same number not because they share intuition but because they share a reference. Anchored rubrics are tick marks for behaviour. The dimension name is the axis; the definition is the unit; the anchors are the marks. Without all three, you have a vibe meter wearing a numeric disguise.

For the refund bot, policy_correctness is the plank, the policy document is the unit, the three anchor examples are the tick marks. Two reviewers reading a new reply walk along the same stick.

3) The threaded example — building the refund rubric from disagreement upward¶

We stay with the refund chatbot from chapter 01's 38-point gap. The team has 100 live conversations from last week, a Claude-based judge from chapter 06, and the new problem: scores are bouncing because the prompt was "is the reply good?"

Attempt A — single-dimension Likert¶

The team writes one prompt: "rate this reply from 1 to 5 on overall quality." Two senior reviewers score 10 chats independently.

chat   reviewer A   reviewer B
─────  ──────────   ──────────
  1        5            4
  2        4            2
  3        3            5
  4        5            5
  5        2            3
  6        4            4
  7        5            2
  8        3            3
  9        4            5
 10        5            4

percent agreement (exact):   3/10 = 30%
percent agreement (±1):      7/10 = 70%

Three exact agreements out of ten. Chat 7 has the biggest gap — A scored 5 because the refund decision was correct; B scored 2 because the reply never acknowledged the damaged carton and never offered a human handoff. Both reviewers were right about what they saw. The rubric was not asking them to look at the same thing.

Attempt B — four anchored dimensions¶

The team rewrites the rubric with the four dimensions and 1/3/5 anchors from earlier. Same 10 chats, same two reviewers, fresh scoring.

chat   policy   handoff   tone    safety        A-avg  B-avg
       A   B    A   B    A   B    A   B
─────  ─────   ─────    ─────   ─────         ─────  ─────
  1    5   5   5   5    5   4    5   5         5.0    4.75
  2    3   3   3   3    4   3    1   1         2.75   2.5
  3    5   5   3   3    5   5    3   3         4.0    4.0
  4    5   5   5   5    5   5    5   5         5.0    5.0
  5    1   1   3   3    4   3    1   1         2.25   2.0
  6    5   5   5   5    4   4    5   5         4.75   4.75
  7    5   5   1   1    4   3    1   1         2.75   2.5
  8    3   3   5   5    4   4    3   3         3.75   3.75
  9    5   5   5   5    5   5    3   5         4.5    5.0
 10    5   5   5   5    5   4    5   5         5.0    4.75

per-dimension exact agreement:
  policy_correctness:    10/10 = 100%
  handoff_completeness:  10/10 = 100%
  brand_tone:             6/10 =  60%  ← soft dimension
  safe_refusal:           9/10 =  90%

Three of four dimensions hit very high agreement. brand_tone remains noisy — that is the soft dimension where two reasonable people disagree on whether "Is there anything else I can help with?" is warm (5) or boilerplate (3). That tells us where the next anchor refinement is needed, not that the rubric is failing.

Notice chat 7. Under Attempt A, A said 5 and B said 2 — a 3-point chasm with no diagnosis. Under Attempt B, both reviewers landed on policy=5 and safety=1. The chasm was never about disagreement; it was about which dimension each reviewer was secretly grading. The new rubric makes that visible.

Attempt C — the one real disagreement and the anchor that fixes it¶

Chat 9 still differs on safe_refusal: A says 3, B says 5. Pull the chat.

Customer: "My package is fine, but I want to ask — if I had wanted to return it for store credit instead, would that be possible?"

Bot: "For store-credit conversions, I'll connect you with a human agent who handles returns. Please hold while I transfer you."

A scored 3 because "the bot refused but did not capture the original order details in the handoff." B scored 5 because "it correctly refused an out-of-scope ask and routed to a human." Both are reading the anchor. The anchor for 5 says "correctly refuses out-of-scope asks and routes to a human with handoff context." The phrase "with handoff context" is the ambiguity. Does it mean "the bot mentions the order" (A's reading) or "the bot transfers control" (B's reading)?

The fix is not retraining reviewers. The fix is rewriting the anchor.

safe_refusal v1 (ambiguous):
  5 = correctly refuses out-of-scope asks and routes to a human
      with handoff context

safe_refusal v2 (sharper):
  5 = correctly refuses out-of-scope asks AND the transfer message
      includes order id, issue type, and the customer's exact ask
      in a single line a human agent can act on
  3 = correctly refuses and routes, but the transfer message is
      generic (no order id or no exact ask)
  1 = either invents an answer or refuses a request it should have
      handled

Re-score chat 9 under v2. Both reviewers land on 3. Agreement on this chat moves from disagreement to agreement, and the rubric has become slightly more accurate by becoming slightly more pedantic. That is the loop: disagreement is data; the anchor absorbs the data.

4) Why kappa beats percent agreement — chance correction¶

Percent agreement looks honest but lies when one anchor is much more common than others. Suppose 80% of your refund replies are policy-correct. Two reviewers who always score 5 on policy_correctness will agree 80% of the time even if neither is reading the chat. The agreement is a function of base rate, not skill.

Cohen's kappa corrects for this.

        observed_agreement − expected_by_chance
kappa = ────────────────────────────────────────
              1 − expected_by_chance

For the 4-dimension table above:

dimension            % agree   kappa   interpretation
─────────────────    ───────   ─────   ─────────────────────────
policy_correctness     100%    1.00    perfect
handoff_completeness   100%    1.00    perfect
brand_tone              60%    0.42    moderate — needs work
safe_refusal            90%    0.83    near-perfect

Rough reading guide:

kappa < 0.20   poor      — rubric is broken
0.20 - 0.40    fair      — anchors are leaking
0.40 - 0.60    moderate  — ship if dimension is soft
0.60 - 0.80    good      — production-ready
0.80 - 1.00    excellent — diminishing returns past here

The brand_tone kappa of 0.42 is the chapter's next loop: refine the 3 anchor with a specific example of "boilerplate" versus "warm". Do not chase 1.00 — for soft dimensions, 0.6 to 0.8 is the realistic ceiling, and the cost of pushing further usually shows up as anchor brittleness when the product evolves.

Teacher voice. Kappa is the question "would the reviewers agree if the outcome distribution were balanced?" Percent agreement is the question "how often did they happen to match given today's mix?" The first survives a shift in traffic. The second does not.

5) Alternative comparison — binary, Likert, or multi-dimensional¶

Three rubric shapes show up in production. Each fits a different workload.

SHAPE             FITS WHEN                              BREAKS WHEN
─────────────     ─────────────────────────────────      ─────────────────────────
Binary pass/fail  one hard requirement; safety gates;    quality is a spectrum
                  guardrails; "policy violated y/n"      (helpfulness, tone)

5-point Likert    single soft dimension; product CSAT-   multiple independent
(anchored)        adjacent score; tone alone             failures pile into one
                                                         number

Multi-dimensional 3-6 independent failure modes;         every dimension <2-rater
+ anchors         debugging-grade evals; production      kappa target — keep
                  bots with diverse failure shapes       fewer, sharper dimensions

Binary is what you use for did the bot leak a credit-card number, yes or no. Two raters cannot disagree if the dimension is observable and binary. Kappa is nearly trivial; the rubric is one sentence. Use it for hard guardrails.

Likert is what you use when one soft dimension is the whole story — how warm did the reply feel? Cheap, fast, useful for early product CSAT proxies. It is what Attempt A above looked like. The pathology is using it for compound questions.

Multi-dimensional is what you use the moment failures stop being one-shaped. The refund bot has at least four independent failure modes (policy, handoff, tone, safety). A single number cannot represent four orthogonal failures without averaging them into invisibility. Cost: roughly 4× the labeller time per case and a heavier judge prompt.

Pick by the shape of the failures, not by sophistication. A safety-only system on a binary rubric beats the same system on a 7-dimension grid with weak anchors.

6) The cost movement — what each rubric shape really costs¶

Concrete numbers for the refund-bot eval, 100 cases, two human reviewers, internal rates of ₹400/hour.

Rubric shape	Time per case (one rater)	100-case cost	Inter-rater kappa	Diagnostic signal
One vague Likert ("is it good?")	30 sec	₹670	0.18	almost none
Anchored single Likert (tone only)	45 sec	₹1,000	0.62	one axis
4-dim anchored, 2 anchors (1/5)	90 sec	₹2,000	0.55	per-axis, weak edges
4-dim anchored, 3 anchors (1/3/5)	120 sec	₹2,670	0.80	per-axis, actionable
4-dim anchored, 5 anchors (1-5)	180 sec	₹4,000	0.82	over-engineered

The interesting row is the jump from 2 anchors to 3 anchors. Adding a middle anchor (the 3 example) does most of the work: kappa moves from 0.55 to 0.80 for roughly 30 extra seconds of labeller time per case. Adding two more anchors barely moves kappa and doubles the cost. The 1/3/5 shape is the practical sweet spot for most multi-dimensional rubrics. Two anchors leave too much interpolation; five anchors invite spurious precision that humans cannot maintain.

The cost the rubric creates: now the labelling step is 2× to 4× slower per case. The subsystem absorbing the new cost is human reviewer time (or judge tokens, if the rubric is given to an LLM judge). The pressure relieved: noise in downstream A/B and drift detection. The pressure created: anchor maintenance — whoever owns the rubric now has a small artifact to keep alive across product changes.

7) Operational signals — what tells you the rubric is healthy or rotting¶

Healthy rubric: when you re-run last quarter's labelled set with the current rubric, kappa stays inside ±0.05 of the original. When two new reviewers onboard with the rubric, they hit production-grade kappa after one calibration session of about an hour. When a PM asks "why did tone drop 0.4 points this week?", somebody can point at three specific chats and explain which anchor they slid past.

The first metric that degrades is per-dimension kappa. When one dimension's kappa quietly falls from 0.75 to 0.55 over a quarter, the anchor has drifted: the product changed, edge cases changed, but the anchor example did not get updated. The misleading metric beginners watch first is the aggregate average score. Aggregate average can be flat while policy_correctness kappa collapses, because two reviewers might cancel each other's noise on a per-case basis and produce a stable mean. The expert opens the per-dimension kappa over time graph first, with the rubric version overlaid as vertical lines. Kappa dropping inside one version is anchor drift; kappa jumping at a version boundary is intended.

The slowest-burning signal is the Goodhart drift — your judge's score keeps going up while user CSAT keeps going down. That is the rubric measuring something the product team has learned to optimise that no longer correlates with what users feel. The fix is not better judges. The fix is adding a new dimension or refreshing anchors with recent failed conversations.

Mini-FAQ. "What's a reasonable cadence for refreshing anchors?" Once per quarter for a stable product, once per major release for a fast-moving one. Refresh by pulling 30 conversations from the lowest-CSAT bucket and asking: do any of these score above 3 on the current rubric while the user was clearly unhappy? If yes, the anchor for 3 needs a new example.

8) Boundary of applicability — when one binary is enough and when four dimensions are required¶

A multi-dimensional rubric is overkill when the system has exactly one failure mode you care about. A nuclear-reactor shutdown bot needs a binary "did it shut down when asked, y/n". Two anchors. One dimension. Anything more is rubric theatre.

The strong fit for the multi-dim shape is production assistants with diverse failure modes — refund bots, support bots, code assistants, legal drafting tools, enterprise search. These systems fail in three or four uncorrelated ways simultaneously, and a single number averages the failures into a flat lie.

The pathology is dimension inflation. Teams that have used multi-dim rubrics for a quarter start adding dimensions for every new failure they see — "factual depth", "creativity", "narrative arc", "ethical sensitivity" — until the rubric has 11 dimensions, no reviewer can hold it in their head, and per-dimension kappa drops everywhere. Five crisp dimensions beat twelve muddy ones. If a new failure shows up, first ask whether it is an anchor refinement on an existing dimension before adding a new dimension.

The scale limit is where 1/3/5 anchors stop capturing real variance. Highly creative-writing rubrics often need narrative anchors (rubric prose, not numeric levels) because numeric scoring of narrative quality has a kappa ceiling around 0.5 no matter how careful you are. For those workloads, ranked pairwise comparison usually beats absolute scoring.

9) The Goodhart trap — when the score goes up and the users get unhappier¶

This is the failure most teams discover three months in. The rubric was honest at launch. The team tuned the bot, the judge ran weekly, and policy_correctness climbed from 3.6 to 4.4 over the quarter. CSAT dropped from 78 to 71 across the same window.

Pull the conversations. What you find: the bot has learned to cite policy clauses constantly, in every reply, including replies where the customer never asked about policy. The rubric rewards citation. The user experience punishes citation. The rubric is no longer measuring "did the reply correctly handle this conversation"; it is measuring "did the reply contain a policy citation." A reviewer scoring against the anchors will give 5s all day because the anchor says "cites the correct policy clause."

The fix is to update the anchor: "cites the correct policy clause when relevant to the customer's question AND applies it to the specific order details." Two words ("when relevant") slip a relevance check into the dimension. After the anchor update, the next eval pass drops policy_correctness from 4.4 back to 3.9 — which is the correct score for the bot's actual behaviour. The drop looks like a regression on the dashboard. It is not. It is the rubric becoming honest.

Goodhart's law applied to rubrics: when a measure becomes a target, it stops being a good measure. Every rubric in production becomes a target the moment the team starts optimising against it. The defence is not to stop measuring; it is to version the rubric, refresh anchors quarterly, and watch the eval/CSAT correlation, not just the eval.

10) Common wrong mental model — "raters know quality when they see it"¶

The seductive intuition is that two thoughtful, well-trained reviewers — or two thoughtful, well-prompted LLM judges — will converge on the same quality judgement if you just give them the same conversation. Quality is evident; experts spot it.

This is wrong, and the chapter has already shown why. Two senior reviewers scored chat 7 as 5 and 2 in Attempt A. Neither was wrong. They were grading different dimensions inside one number. The rubric, not the reviewer's expertise, decides whether two graders converge. Anchored multi-dim rubrics produce convergence; unanchored single numbers produce divergence; no amount of reviewer skill closes the gap that the rubric leaves open.

Replace the wrong model with the right one. Reviewers do not converge on quality; they converge on observable criteria with shared anchors. Quality is the output of a good rubric, not the input. When you onboard a new reviewer, you are not teaching them taste. You are teaching them to read the anchor and compare. The same is true of an LLM judge — you are not asking it to be wise; you are giving it a measuring stick.

11) Six recurring rubric failures¶

Adjective stacking. "Helpful, friendly, accurate, clear, complete" in one prompt. Five concepts averaged into one number; kappa collapses. Split into dimensions.
Likert without anchors. A 1-5 scale where 5 and 3 are not defined. Every reviewer invents their own scale. Add anchors at 1, 3, 5.
Overlapping dimensions. "Correctness" and "factual accuracy" almost always co-move. Two columns, one signal, double-counted. Collapse or sharpen the difference.
Internal-state dimensions. "Confidence", "intent", "helpfulness as felt by the user" — none are observable from the text alone. Replace with surface features.
Frozen anchors. Rubric was written nine months ago, product changed, anchors did not. Per-dimension kappa drifts. Refresh quarterly.
No version field. Last quarter's scores and this quarter's scores sit in the same column. Comparing them is comparing apples to slightly-different apples. Tag every score with the rubric version that produced it.

12) Where this pressure shows up again¶

Same invariant, judge layer. Chapter 06 made the judge cheap; chapter 08 makes the judge agree with the rubric. Both rely on the rubric being a measuring stick rather than a vibe meter. The judge is just another rater whose kappa with humans is the calibration target.
Same failure shape, drift layer. Chapter 09's drift detection assumes the rubric is stable. A drifting rubric and a drifting model look identical on the dashboard; only versioned rubrics let you tell them apart.
Same pressure, A/B layer. Chapter 10's A/B tests are kappa-bound from above. You cannot detect a 3-point quality difference if the rubric's kappa is 0.4 — the noise floor of the rubric eats the signal of the experiment.

13) A fast self-test before you trust your rubric¶

Can a brand-new reviewer score 10 cases after a one-hour calibration and hit ≥0.7 kappa with the team lead?
Can you point at three example outputs for every dimension at anchor 1, 3, and 5?
Are your dimensions independent — does each one have at least one production case where it failed alone?
Is the rubric version stamped on every score in your dashboard?
When CSAT dropped last month, could you identify the dimension and the anchor that needed refinement, or did the team just say "we'll look into it"?

Five yeses means the rubric is doing its job. Any no is the next half-day of work.

Where this lives in the wild¶

The teams that ship measurable AI have all converged on anchored multi-dimensional rubrics. The vocabulary differs; the shape does not.

Anthropic Constitutional AI rubrics — each "constitution" principle is effectively an anchored dimension that a judge scores against, with concrete refusal/compliance examples standing in for the 1/3/5 anchors of this chapter.
OpenAI human-preference rubrics for RLHF — labellers compare pairs against a written rubric (helpful, harmless, honest) with anchor examples in the labeller manual; the manual is the rubric file.
BigLaw associate-review anchors — partner review of associate work uses concrete "this brief is a 5, this brief is a 3" memos pinned in firm intranets. The legal industry built anchored rubrics before AI did.
Casetext CoCounsel legal-accuracy rubric — separates citation accuracy, jurisdictional fit, and argument completeness as independent dimensions because a brief can cite correctly while reasoning poorly.
Harvey drafting dimensions — splits legal drafting into clause coverage, citation correctness, tone-appropriate-to-counterparty, and risk-flag completeness; one global "is the draft good" number is explicitly rejected.
Glean enterprise-search rubric — relevance, freshness, source-authority, and answer-completeness scored independently; the team learned that high relevance with stale sources is still a failure.
Notion AI helpfulness rubric — anchors include workspace-document examples so that 5 on "useful for the user's workspace" has a concrete referent rather than a feeling.
RAGAS metric definitions — faithfulness, answer relevance, context precision, context recall — published as implicit rubrics with formal definitions; the entire library is a multi-dim rubric in code.
Vectara HHEM — hallucination evaluation model trained on a rubric where each anchor is a real (output, source-document) pair labelled by humans.
Intercom Fin resolution rubric — separates resolution correctness, escalation appropriateness, and tone; ships with anchor examples from real ticket archives.
Cursor's tool-call rubric — dimensions for selected-correct-tool, called-with-correct-args, and stopped-at-correct-time scored separately; one dimension can fail without the others.
Khanmigo tutoring rubric — correctness, scaffolding (does it lead the learner instead of giving the answer), CEFR-appropriate vocabulary, and emotional support; each dimension has anchor transcripts.
Bloomberg GPT finance evals — domain-specific dimensions including numerical correctness, citation traceability, and regulatory-language compliance; finance refuses one-blob scoring outright.
Duolingo Max conversation rubric — language-correctness, cultural-appropriateness, and pedagogical-progression as separate dimensions; pedagogical-progression has anchors at "asks a follow-up", "advances difficulty", "rewinds when learner struggled".
GitHub Copilot Chat eval rubric — correctness, code safety, instruction adherence, explanation clarity; published anchors include concrete code-diff examples.
Replit Ghostwriter eval — anchored dimensions for compile correctness, behavioural correctness, and idiom match, because syntactically valid code can be behaviourally wrong.
Salesforce Einstein Copilot trust rubric — adds a grounding dimension specifically because adjacent dimensions like "accuracy" hid hallucinations grounded in nothing.
Slack AI summary rubric — coverage, faithfulness, brevity, and decision-extraction; each dimension has anchor summaries pulled from real channels.
Perplexity citation rubric — citation existence, citation correctness, citation sufficiency as three dimensions; an answer can have a real citation that does not actually support the claim.
Adobe Firefly safety rubric — anchored dimensions for trademark proximity, public-figure resemblance, and safe-content compliance with example outputs at each anchor level.
Galileo, Patronus, Arize Phoenix, LangSmith, LangFuse — five eval-platform companies whose product is roughly "host your rubric, run it, version it, alert on it"; the market size tells you how often rubrics are the binding constraint.
Goodhart in deployment — Bing Chat's early days — the system optimised against an internal helpfulness rubric and produced replies users found alarming; the rubric was anchored, but anchored on the wrong observable behaviours.
Air Canada's chatbot incident (2024) — the bot scored well on internal correctness rubrics but invented a refund policy; the rubric had no policy_invention dimension because the team had never seen the failure before. New failure shapes drive new dimensions.

The pattern is consistent. Industries that take measurement seriously — law, medicine, finance, education — all converged on anchored multi-dimensional rubrics decades before AI evals did. The AI industry is rediscovering grading.

Recall — can you reconstruct the chapter cold?¶

Why is a 1-5 Likert "how helpful is this reply" a broken rubric even when the reviewers are senior?
Name the four parts of a rubric anatomy and what fails when each is missing.
In the refund-bot example, which dimension was responsible for chat 7's 5-vs-2 disagreement under Attempt A?
Why does Cohen's kappa beat raw percent agreement when one anchor is much more common than others?
What is the practical sweet spot for number of anchors per dimension, and why?
State the Goodhart trap for rubrics in one sentence.
When is a binary pass/fail rubric the correct choice rather than a lazy one?
What is the wrong mental model this chapter explicitly replaces, and what replaces it?

Interview Q&A¶

Q1. Two reviewers score the same 100 conversations and agree on only 35%. Your manager wants to retrain the reviewers. What do you push back with?

A. Train the rubric, not the reviewers. 35% agreement is a specification problem, not a reviewer problem. Pull the disagreements, find which dimensions are doing two jobs at once, write 1/3/5 anchors with concrete example outputs, re-run on 30 cases, measure per-dimension kappa. Retraining reviewers without sharper anchors moves their agreement up temporarily and then drifts back in a quarter, because the underlying ambiguity is in the words on the page. Common wrong answer to avoid: "Reviewers need calibration sessions until they agree" — calibration is downstream of a sharp rubric, not a substitute for it.

Q2. You added Cohen's kappa to your eval dashboard. Per-dimension kappa shows policy=0.95, handoff=0.9, tone=0.42. The PM asks if the rubric is broken. What do you say?

A. Tone is a soft dimension and 0.42 is moderate — it is the dimension that most needs anchor refinement, but it is not catastrophic. Pull the disagreements on the tone dimension and check whether they cluster around one anchor (usually the 3). If they do, write a sharper example for that anchor. The other two dimensions are production-grade. Treat kappa as per-dimension; an aggregate kappa would average a working signal with a noisy one and tell you nothing. Common wrong answer to avoid: "Aggregate kappa is 0.76, we are fine" — aggregate hides the dimension that is actually failing.

Q3. The judge's score on policy_correctness has climbed from 3.6 to 4.4 over the quarter. CSAT dropped 7 points. What is the first thing you investigate?

A. The rubric, specifically the 5 anchor on policy_correctness. The pattern — rising eval, falling CSAT — is the canonical Goodhart signal. Pull 30 recent low-CSAT conversations and score them by the current rubric. If many score 4 or 5 on policy_correctness while the user was unhappy, the anchor rewards a behaviour the user does not value. Likely cause: the bot has learned to cite policy aggressively, including in conversations where citation is irrelevant or off-putting. Fix the anchor to require relevance to the customer's actual question before crediting the citation. Common wrong answer to avoid: "Push the judge to score more harshly" — the judge is doing what the rubric tells it to do; the rubric is the bug.

Q4. Your team wants to add a "creativity" dimension to the refund-bot rubric. Why is that a bad idea?

A. Two reasons. First, "creativity" is internal, not observable — two reviewers will interpret it differently and per-dim kappa will collapse. Second, "creativity" is not a failure mode the refund bot actually has; adding dimensions for failures you have not seen in logs is dimension inflation. The discipline is to add a dimension only after you have at least 10 production cases where that failure occurred alone — independent of the existing dimensions. If you cannot point at those cases, you do not need the dimension. Common wrong answer to avoid: "More dimensions means more thorough evals" — past 5 or 6 dimensions, per-dim kappa starts dropping everywhere because reviewers cannot hold the rubric in their head.

Q5. Cumulative — is this a chapter 03 dataset problem, a chapter 06 judge problem, or a chapter 07 rubric problem?

Symptom: "the LLM judge agrees with itself across re-runs (consistency 0.95), but disagrees with our two human raters (kappa 0.35)."

A. Chapter 07 rubric problem most likely, chapter 06 judge prompt second. The judge is consistent with itself, so the model is doing its job stably. It is disagreeing with humans, which means either the rubric is ambiguous enough that humans and judge fill it in differently, or the judge prompt is omitting a dimension humans treat as important. Check whether the humans are scoring against the exact same dimension definitions and anchors as the judge sees in its prompt. If yes, refine anchors. If no, align the judge prompt to the rubric text verbatim. Only after both should you suspect a chapter 03 dataset-sampling issue. Common wrong answer to avoid: "Switch to a stronger judge model" — a stronger judge will be more confidently wrong against a vague rubric, not closer to humans.

Q6. A teammate wants to use a 1-7 scale "because it gives finer resolution." What's your response?

A. Finer resolution only matters if the underlying signal supports it. Most production rubrics top out around kappa 0.8 even with 1/3/5 anchors; moving to 1-7 typically drops kappa by 0.1 or more because the four un-anchored intermediate values become Rorschach territory. Resolution that the reviewers cannot maintain is spurious precision. If you genuinely need finer resolution on one dimension, split the dimension into two narrower ones before widening the scale. Common wrong answer to avoid: "Finer scales reduce ceiling effects" — they only reduce ceiling effects when each scale point has an anchor; otherwise they add noise.

Q7. You inherit a rubric that hasn't been touched in 18 months. The product has changed substantially. Where do you start?

A. Pull 50 recent conversations from the lowest-CSAT bucket. Score them against the current rubric. Flag every case where the rubric says 4 or 5 but the user clearly disliked the reply — those are the anchor-drift cases. Cluster the flagged cases by which dimension is over-crediting them. For each cluster, write a sharper anchor example pulled from a real recent failure. Bump the rubric version, tag all new scores with the new version, and never compare scores across versions without a re-scoring of historical cases against the new rubric. Common wrong answer to avoid: "Just update the rubric in place and rerun the dashboard" — you lose the ability to distinguish model regressions from rubric tightening, and your A/B history becomes uninterpretable.

Q8. Define acceptable for the refund bot in one anchored sentence per dimension.

A. Policy correctness: 5 = cites the correct policy clause AND applies it to the specific order; 1 = invents a clause or refunds wrongly. Handoff completeness: 5 = a human agent could continue the conversation cold from the bot's last message; 1 = order id or issue type missing. Brand tone: 5 = warm, plain, no jargon; 1 = curt or robotic. Safe refusal: 5 = correctly refuses out-of-scope asks with a transferable handoff line containing order id and exact ask; 1 = invents an answer or refuses what it should have handled. Each sentence has an observable referent and a concrete example. Common wrong answer to avoid: "Acceptable means the customer is satisfied" — satisfaction is the outcome you are trying to predict from the rubric, not a definition the rubric can use.

Apply now (10 min)¶

Step 1 — model the exercise. Take chat 9 from the threaded example. Re-score it under the v2 safe_refusal anchor. Walk through the comparison:

chat 9 reply: "For store-credit conversions, I'll connect you with a
              human agent who handles returns. Please hold while I
              transfer you."

v2 anchor for 5: includes order id, issue type, AND customer's exact ask
                 in a single line a human agent can act on.

This reply mentions none of those three. So it cannot be a 5.
v2 anchor for 3: refuses and routes, but transfer message is generic.

This reply refuses (correctly — store credit is out of scope), routes
(handing off to human), but the transfer line is generic.

Score: 3. Both reviewers now land here.

That walkthrough is what mastery on this chapter looks like: the score follows from comparing the reply to the anchor sentence, not from a feeling.

Step 2 — your turn. Take one AI feature from your own product. Write a 3-4 dimension rubric for it. For each dimension, write a one-sentence observable definition and a concrete example output for anchors 1, 3, and 5. Then take 5 real outputs from your logs and score each one under the rubric. Note any case where you waver between two anchors — that wavering is your next anchor refinement.

Step 3 — reproduce from memory. Without scrolling up, redraw the rubric anatomy diagram and the measuring-stick mental model. Then write the chapter's load-bearing rule about what a rubric scores. Then explain in one sentence why kappa beats percent agreement. If you can do all three cold, you carry the chapter into chapter 08.

What you should remember¶

This chapter explained why two careful readers — humans, judges, or one of each — disagree on the same conversation even when the model and the prompt do not move. The cause is almost never reviewer skill or model variance. The cause is the rubric: a sentence vague enough that each reader fills it with different content. The fix is the discipline of anchored, observable, multi-dimensional rubric design: name 3-6 independent dimensions, write a one-sentence observable definition per dimension, fix numeric anchors with concrete example outputs at 1, 3, and 5, and version the artifact like code.

You learned to test a rubric the way you would test code. Two raters, 10 cases, per-dimension agreement, Cohen's kappa with the chance-correction interpretation. You watched one disagreement on chat 9 stop being a reviewer dispute and become an anchor refinement, and you saw how brand_tone at kappa 0.42 is information about the rubric, not about the reviewers. You also learned to suspect every rising score: Goodhart's law applies to rubrics, and a rubric that becomes a target will quietly stop being a good measure unless you refresh anchors against recent low-CSAT conversations every quarter.

Carry this diagnostic forward: when two raters disagree, do not retrain the raters. Read the dimension they disagreed on, find the anchor that did the leaking, and refine it with a real example output. If you see the eval score climbing while CSAT falls, do not push the judge harder — pull 30 low-CSAT cases and look for the anchor that is over-crediting a behaviour users do not value. The rubric is the measuring stick. Keep it sharp, keep it versioned, and never trust a score whose rubric has no version stamp.

Remember:

A rubric is the question the judge is being asked. Vague question, noisy answer — for humans, models, and the dashboard.
Anchored 1/3/5 is the practical sweet spot. Two anchors leave too much interpolation; five anchors invite spurious precision.
Score per dimension, never as one blob. Independent failure modes need independent numbers.
Cohen's kappa beats percent agreement when one anchor dominates the distribution.
Rising eval score with falling CSAT is the Goodhart signal. Refresh anchors, do not push the judge.
Version every rubric. A score without a rubric version is a number without units.

Bridge. A sharp rubric is necessary, but two judges — or a judge and a human — can still drift apart on the soft dimensions, and over time even a stable judge will start scoring differently as the world the rubric describes changes. We have solved what to score; we have not yet solved how to keep two scorers reading the same rubric the same way week after week. That is calibration: golden sets, periodic re-checks, agreement targets, and the bias corrections that keep a judge honest against its own past.

→ 08-judge-calibration.md