Skip to content

AI Engineering Playbook

Exercise 12 — LLM-as-Judge Loop

Exercise 12 — LLM-as-Judge Loop¶

Timebox: 45 minutes

Goal¶

Implement a small rubric-based judge loop with bias mitigations. A miniature version of Module 11's assignment, sized for a live coding rep.

Work in¶

judge.py

Tasks¶

Judge.score(query, answer, rubric) -> {criterion: score, reasoning} using a single LLM call (or a mock for offline rep).
Multi-dimensional rubric (correctness, helpfulness, conciseness, format).
Pairwise eval Judge.compare(query, a, b) returning preferred answer + reasoning.
Bias mitigation: shuffle order in compare, then run twice and report disagreements.
Aggregate over a list of (query, answer) pairs into a CSV report.

Done when¶

A mock judge passes a hand-crafted test set
Pairwise comparison runs with shuffle and disagreements are surfaced
You can articulate three known judge biases (position, length, self-enhancement) and the mitigations you implemented

Stretch¶

Add a self-consistency loop: judge n times, return modal score with confidence
Calibrate against a 10-row human-scored set; print Spearman