AI Evals Release Gates¶

The chapters in this module, in reading order.

#	Chapter
00	Evals & Production — The Five-Year-Old Version
01	Shipping on vibes — when a flawless demo hides a 38-point quality drop
02	Eval taxonomy — four axes, one decision per cell
03	Golden datasets — the labelled tray that turns every eval claim into evidence
04	Synthetic generation — manufacturing the cases you cannot hand-write fast enough
05	The metrics zoo — three families, one honest truth, many lying numbers
06	LLM as judge — verification is cheaper than generation
07	Rubric design — when two careful readers score the same chat and disagree
08	Judge calibration — the rubric is anchored, but the judge still drifts
09	Drift detection — when 78% quietly becomes 64% and nobody pages
10	A/B testing — when the offline winner loses the live argument
11	Logging & tracing — when the A/B winner has a 9% bug nobody can name
12	Alerting & dashboards — turning ten thousand traces into one glance and one page
13	Eval-driven development — when the test is written before the prompt
14	Honest admission — five things evals still cannot do for you