Skip to content

Exercise 12 — LLM-as-Judge Loop

Timebox: 45 minutes

Goal

Implement a small rubric-based judge loop with bias mitigations. A miniature version of Module 11's assignment, sized for a live coding rep.

Work in

  • judge.py

Tasks

  1. Judge.score(query, answer, rubric) -> {criterion: score, reasoning} using a single LLM call (or a mock for offline rep).
  2. Multi-dimensional rubric (correctness, helpfulness, conciseness, format).
  3. Pairwise eval Judge.compare(query, a, b) returning preferred answer + reasoning.
  4. Bias mitigation: shuffle order in compare, then run twice and report disagreements.
  5. Aggregate over a list of (query, answer) pairs into a CSV report.

Done when

  • A mock judge passes a hand-crafted test set
  • Pairwise comparison runs with shuffle and disagreements are surfaced
  • You can articulate three known judge biases (position, length, self-enhancement) and the mitigations you implemented

Stretch

  • Add a self-consistency loop: judge n times, return modal score with confidence
  • Calibrate against a 10-row human-scored set; print Spearman