LLM Judge — Analysis¶

The three biases this implementation addresses¶

1. Position bias. Models given two answers tend to prefer whichever comes first (or sometimes second, depending on the model). The defence: run the comparison twice with reversed order; surface disagreements.

2. Length bias. Models often prefer longer answers (more detail = more confident-looking). Counter-strategy: include conciseness in the rubric so length isn't conflated with quality; calibrate per-criterion.

3. Self-enhancement bias. Models prefer text that resembles their own style. Mitigation: use a different judge model from the generator; or use multiple judges and aggregate.

The shuffle defence in detail¶

Run 1: judge(query, A, B) → picks "first arg" or "second arg"
Run 2: judge(query, B, A) → picks "first arg" or "second arg"
Translate to canonical (A vs. B in original order):
  Run 1: winner is A iff judge picked first arg.
  Run 2: winner is A iff judge picked second arg.
If Run 1 and Run 2 agree → high-confidence winner.
If they disagree → position bias detected; report "tied" or flag for human review.

The test test_compare_disagreement_surfaces proves this catches a position-biased judge (one that always picks A regardless of content). With shuffle, the judge picks "first input" twice — which translates to A in run 1 and B in run 2 — disagreement caught.

Why scoring beats pairwise (and vice versa)¶

Scoring (Judge.score) gives multi-dimensional rubric output: correctness, helpfulness, conciseness, format. Useful for:

Tracking a single answer's quality over time.
Diagnosing what is wrong (high correctness, low conciseness → answer is right but bloated).
Aggregating across many answers.

Pairwise (Judge.compare) gives a single winner. Useful for:

A/B testing two model versions.
Tournament-style ranking.
Cases where absolute scoring is hard (judges struggle with "is this a 3 or a 4?" but are good at "is A better than B?").

Production teams typically use both: scoring for regression testing, pairwise for model selection.

What this implementation deliberately omits¶

Real LLM call. The mock simulates a length-biased judge. In production, plug in a stronger model than the generator (use GPT-4 to judge GPT-3.5 output; or Claude to judge GPT-4).
Multi-judge ensembling. Run N independent judges, aggregate. The standard production technique.
Calibration to human ratings. A judge's scores should correlate with human ratings on a held-out set. Without calibration, you trust scores you haven't verified.
Chain-of-thought reasoning. Asking the judge to "think step by step" before scoring often improves quality.

Interview probes¶

"Name three known LLM-judge biases and how you mitigate each."
"When does pairwise comparison beat absolute scoring?"
"How would you calibrate a judge to human ratings?"
"What if the judge model is the same as the generator?"
"Walk through the shuffle defence for position bias."

The test suite mirrors what an interviewer would push on — particularly test_compare_disagreement_surfaces, which proves the structural defence works on a deliberately biased judge.