Skip to content

07. Judging mechanism fit

Versioning is the artefact discipline. The next concern is the judging mechanism — for each case, the way the system's output is compared to the expected behaviour. Exact-match vs reference vs rubric, picked per case type. Fit is what makes the score meaningful.


A data engineer at a Pune SaaS company runs an eval against a customer-summarisation system. The set uses exact-match labels — the expected output is one specific string. The system's outputs are slightly different from the expected strings (different wording, same meaning); the eval scores everything as failure. The engineer questions her labelling approach. The summarisation outputs are not deterministic strings; they are paraphrases of a customer's history. Exact-match is the wrong mechanism. She switches to rubric grading — the system's output is scored against criteria (mentions the right facts, accurate, appropriate tone). The score jumps from 0.12 to 0.84. The system was working well; the judging mechanism was the wrong fit.

This chapter is the discipline of picking the right mechanism per case. Mechanism fit drives whether the score reflects what the team thinks it does.


The three mechanisms

Recapped from chapter 03 with depth.

Exact-match. Expected output is one specific value; system output is compared byte- or shape-equal. Used for structured outputs.

Reference comparison. Expected output is a known-good reference; system output is compared by similarity (semantic, edit-distance, or rubric-graded against the reference).

Rubric. Expected output is defined by criteria (must mention X, must not exceed N words, must not say Y); system output is graded by an LLM judge or human against the criteria.

Each fits different case types; mismatches produce misleading scores.


When each fits

Exact-match fits: - Classifications (intent, category, sentiment) - Structured extractions (specific field values from an input) - Decisions (approve / reject / escalate) - Tool-call arguments (the call must have argument X) - ID extractions (the right account_id, order_id)

The right answer is one value or a small set of values; either the system produces it or not.

Reference comparison fits: - Translations (one good translation as reference; similar translations acceptable) - Summaries with a known-good summary - Code generation with a known-good implementation - Rewrites with a known-good rewrite

A canonical good answer exists; alternative phrasings can be acceptable; similarity to the reference is the proxy.

Rubric fits: - Open-ended responses to customer questions - Explanations and walkthroughs - Multi-criteria outputs (the answer must satisfy several conditions) - Cases where multiple correct answers exist that differ in surface form

Many correct answers; the criteria define what makes them correct.


When each fails

Exact-match fails on open-ended outputs. The chapter-opening case. The system produces semantically equivalent but textually different outputs; exact-match scores them all as failure.

Reference comparison fails when references are over-specific. A summary reference that captures one particular framing; alternative framings that are also good fail the similarity threshold. The reference encodes one valid answer as the valid answer.

Rubric fails when criteria are too loose or too tight. Loose criteria pass bad outputs (rubric: "addresses the question"); tight criteria fail acceptable outputs (rubric: "uses the phrase 'Dear customer'"). Calibration (chapter 03) tightens the rubric to actionable specificity.


Mixed-mechanism sets

Most real sets mix mechanisms by case type. A typical customer-support set:

Case type Mechanism
Intent classification Exact-match
Order ID extraction Exact-match
Response to "where is my order" Rubric
Refund decision (approve/refuse) Exact-match
Refund explanation message Rubric or reference
Long-form policy explanation Reference (known-good response)

The set's metadata records the mechanism per case; the runner applies the right comparator. A single mechanism across the whole set is rarely optimal.


LLM-as-judge for rubrics

For rubric grading, a human is expensive. LLM-as-judge is the production substitute. An LLM is prompted with the input, the system's output, and the rubric; it produces a score and a reasoning.

System prompt to the judge:
"You are evaluating an AI assistant's response to a customer question.
The rubric is below. Score the response on each criterion as pass or
fail. Provide a brief reasoning for each.

Rubric:
- Mentions the actual order status from the data
- Acknowledges the customer's order number
- Avoids promising specific times not in the data
- Is under 100 words

Input: { ... }
System output: { ... }

Score:"

The judge's output is structured (per-criterion pass/fail + reasoning); the score is the fraction of criteria passed (or a weighted version).

Module 00_ai_evals_release_gates covers judge design, calibration, and bias mitigation in depth. The summary: judges are useful but require calibration against human labels, with periodic re-validation.


Multi-judge for noisy mechanisms

Rubric grading has variance. Two runs of the same judge on the same case can produce different scores. Mitigation: run the judge N times (typically 3) and aggregate (majority vote for pass/fail; average for numeric).

For high-stakes evals, three different judge models can be ensembled (using different LLMs as judges, agreeing or majority). The variance reduction is meaningful; the cost is N× judge inference.

For most platforms, a single judge with periodic human spot-checks is the practical balance.


What the mechanism does not solve

  • Bad labels. No mechanism saves a wrong label. The mechanism scores against the label; if the label is wrong, the score is wrong.
  • Bad rubric. A rubric that does not capture what the team values produces a score the team distrusts.
  • The cost of judging. Each rubric-graded case costs an LLM call (or a human); at scale, this is real cost. Chapter 08 covers economics.

Common mistakes

One mechanism for the whole set. Forced exact-match on open-ended outputs (the chapter-opening case) or forced rubric on classifications (over-engineering and noise).

Reference comparison with one rigid reference. Alternative-but-correct answers fail; the score under-states the system's quality.

Rubric without calibration. Judge variance is unmonitored; scores have hidden noise.

LLM judge biased by the system being evaluated. If the judge is the same model that produced the output, scores skew toward the model's own outputs.

Single-judge without spot-checks. Drift in the judge's behaviour is invisible; calibration aligns the judge to human labels periodically.


Interview Q&A

Q1. The eval scores everything as failure even though the system's outputs look right. What is the likely cause? Mechanism mismatch. The labels use exact-match; the outputs are open-ended (different wording for the same meaning); exact-match scores them as failure. The fix is to switch to rubric or reference for those cases. The labels were not necessarily wrong; the mechanism was. The investigation is to look at a few cases manually — if the outputs look correct, the score is artifactual. Wrong-answer notes: "the system regressed" without checking the mechanism is the chapter-opening misdiagnosis.

Q2. Walk through how you would pick a mechanism for each case type in a customer-support eval. Intent classifications and ID extractions: exact-match. Decisions (approve/refuse/escalate): exact-match. Short responses to specific questions ("what is the status of order X"): rubric with 3–5 criteria covering completeness and accuracy. Long explanations or paraphrased responses: reference comparison if a canonical good answer exists, otherwise rubric. The set's metadata records the mechanism per case; the runner applies the right comparator. The mix is the normal state. Wrong-answer notes: one mechanism for everything is the over-uniform failure.

Q3. The team uses LLM-as-judge with one model. The scores are noisy. What do you do? Three options. Run the judge N times per case and aggregate (majority for binary criteria; average for numeric); reduces variance. Use multiple judge models and ensemble; reduces single-model bias. Calibrate the judge against human labels on a subset; if the judge's pass-rate on the subset diverges from human labels, refine the judge prompt or the rubric. Most platforms start with N=3 single-judge and add ensemble only for high-stakes cases. Wrong-answer notes: "accept the noise" loses signal that matters.

Q4. You use the same model as both the system being evaluated and the judge. What is the risk? Self-evaluation bias. The judge may favour outputs the same model would have produced; scores look better than they would with an independent judge. The mitigation is to use a different model as the judge — ideally a more capable model than the system, so the judge's evaluations are at least as discerning as the system's outputs. Or use a model from a different family. The judge's independence is the discipline. Wrong-answer notes: "the judge is fine because it's the same model" misses the bias direction.


What to do differently after reading this

  • Pick the mechanism per case type, not per set. Mix is normal.
  • Record the mechanism in the case metadata; the runner applies the right comparator.
  • Use rubric with calibration for open-ended outputs.
  • Use a different model as judge than the system being evaluated.
  • Run multi-judge or aggregate across judge runs for variance reduction.

Bridge. Mechanism fit makes scores meaningful. Operating the set at scale is the next discipline — the economics of running evals on every change, every nightly job, every production sample. → 08-cost-and-throughput.md