Skip to content

LLM Judge — Analysis

The three biases this implementation addresses

1. Position bias. Models given two answers tend to prefer whichever comes first (or sometimes second, depending on the model). The defence: run the comparison twice with reversed order; surface disagreements.

2. Length bias. Models often prefer longer answers (more detail = more confident-looking). Counter-strategy: include conciseness in the rubric so length isn't conflated with quality; calibrate per-criterion.

3. Self-enhancement bias. Models prefer text that resembles their own style. Mitigation: use a different judge model from the generator; or use multiple judges and aggregate.

The shuffle defence in detail

Run 1: judge(query, A, B) → picks "first arg" or "second arg"
Run 2: judge(query, B, A) → picks "first arg" or "second arg"
Translate to canonical (A vs. B in original order):
  Run 1: winner is A iff judge picked first arg.
  Run 2: winner is A iff judge picked second arg.
If Run 1 and Run 2 agree → high-confidence winner.
If they disagree → position bias detected; report "tied" or flag for human review.

The test test_compare_disagreement_surfaces proves this catches a position-biased judge (one that always picks A regardless of content). With shuffle, the judge picks "first input" twice — which translates to A in run 1 and B in run 2 — disagreement caught.

Why scoring beats pairwise (and vice versa)

Scoring (Judge.score) gives multi-dimensional rubric output: correctness, helpfulness, conciseness, format. Useful for:

  • Tracking a single answer's quality over time.
  • Diagnosing what is wrong (high correctness, low conciseness → answer is right but bloated).
  • Aggregating across many answers.

Pairwise (Judge.compare) gives a single winner. Useful for:

  • A/B testing two model versions.
  • Tournament-style ranking.
  • Cases where absolute scoring is hard (judges struggle with "is this a 3 or a 4?" but are good at "is A better than B?").

Production teams typically use both: scoring for regression testing, pairwise for model selection.

What this implementation deliberately omits

  • Real LLM call. The mock simulates a length-biased judge. In production, plug in a stronger model than the generator (use GPT-4 to judge GPT-3.5 output; or Claude to judge GPT-4).
  • Multi-judge ensembling. Run N independent judges, aggregate. The standard production technique.
  • Calibration to human ratings. A judge's scores should correlate with human ratings on a held-out set. Without calibration, you trust scores you haven't verified.
  • Chain-of-thought reasoning. Asking the judge to "think step by step" before scoring often improves quality.

Interview probes

  • "Name three known LLM-judge biases and how you mitigate each."
  • "When does pairwise comparison beat absolute scoring?"
  • "How would you calibrate a judge to human ratings?"
  • "What if the judge model is the same as the generator?"
  • "Walk through the shuffle defence for position bias."

The test suite mirrors what an interviewer would push on — particularly test_compare_disagreement_surfaces, which proves the structural defence works on a deliberately biased judge.