Exercise 12 — LLM-as-Judge Loop¶
Timebox: 45 minutes
Goal¶
Implement a small rubric-based judge loop with bias mitigations. A miniature version of Module 11's assignment, sized for a live coding rep.
Work in¶
judge.py
Tasks¶
Judge.score(query, answer, rubric) -> {criterion: score, reasoning}using a single LLM call (or a mock for offline rep).- Multi-dimensional rubric (correctness, helpfulness, conciseness, format).
- Pairwise eval
Judge.compare(query, a, b)returning preferred answer + reasoning. - Bias mitigation: shuffle order in
compare, then run twice and report disagreements. - Aggregate over a list of (query, answer) pairs into a CSV report.
Done when¶
- A mock judge passes a hand-crafted test set
- Pairwise comparison runs with shuffle and disagreements are surfaced
- You can articulate three known judge biases (position, length, self-enhancement) and the mitigations you implemented
Stretch¶
- Add a self-consistency loop: judge n times, return modal score with confidence
- Calibrate against a 10-row human-scored set; print Spearman