00. Evals & Production — The Five-Year-Old Version¶

Module 10 built agent teams. This module teaches you how to know if they actually work.

Imagine a busy restaurant on a Saturday night. Orders keep coming, customers keep waiting, and the chef is the model turning each ticket into a dish. The dining room is production, and the customers waiting at their tables are your users. One pretty plate at the pass does not tell you whether the kitchen will hold up for the next four hours.

Now enter the health inspector. The inspector does not clap for one pretty plate; the inspector checks whether the whole restaurant runs safely and consistently under load. That is what evals do for AI products. They do not ask, "Can the chef cook one great dish?" They ask, "Can this kitchen keep doing the right thing under pressure?" A demo is one table served nicely. Production is the full dining room during rush hour, and the gap between the two is where most AI products quietly fail.

So what goes wrong? Teams often trust the smiling waiter at the front and never inspect the kitchen behind the door. The system looks magical in a meeting on Monday and disappoints real users by Friday. The inspector's job is to break that habit by using checklists (rubrics), spot checks (sampling representative conversations instead of reading everything), kitchen logs (logging and tracing the hidden steps inside an agent), and shift-change reviews (watching what happens when you deploy a new model or prompt).

A restaurant can fail in many ways: food can be wrong, late, allergy-blind, or look fine while still being unsafe. AI systems fail in the same shapes — answers can be wrong, slow, policy-violating, or sound polished while being harmful. That is why this module exists: to give you the inspector's toolkit before your kitchen starts losing tables.

The placeholders you will see called back¶

Placeholder	Meaning
the inspection	the eval — systematic quality assessment
the rubric	scoring criteria — observable, anchored
the spot check	sample-based eval — representative subset
the kitchen log	logging and tracing — internal visibility
the shift change	deployment — rolling out a new version

See the restaurant flow¶

customers arrive ──→ dining room fills
                      │
                      ▼
chef cooks dish ──→ plate leaves kitchen
                      │
                      ▼
inspector checks rubric, spot checks, logs, and shift changes

When quality drops, the inspector wants evidence, not confidence. That same attitude keeps AI teams honest. If users complain, you need the dish, the checklist, and the kitchen log. If a new release lands, you need to inspect the shift change. If a dashboard looks green, you still need good spot checks. The habit to build is simple: measure before, during, and after launch, and trust evidence over confidence.

Memory map¶

Concept	Prerequisite	Pressure family	Recurs later as	Layer touched
the inspection (the eval)	a shipped feature with real users	shipping reliability under sampling	every chapter in this module	product / users
the rubric	a clear definition of "acceptable" output	rater consistency, ambiguity	judge calibration, EDD regression sets	scoring
the spot check	live traffic logs	cost vs coverage trade-off	drift detection, A/B sampling	sample
the kitchen log	structured logging on every step	post-hoc diagnosis of hidden failures	tracing, alerting, dashboards	runtime
the shift change	a release process you control	blast radius of a new prompt or model	A/B testing, post-deploy drift	deploy
the golden dataset	curated examples with known answers	preventing regression over time	EDD inner loop, judge agreement	data

What's coming¶

01-shipping-on-vibes.md — Why demo success hides production failure.
02-eval-taxonomy.md — The main eval types and when each one helps.
03-golden-datasets.md — How to build trusted evaluation sets with owners and versions.
04-synthetic-generation.md — How to generate many useful test cases quickly.
05-metrics-reference.md — Which metrics fit text similarity, behavior, and product outcomes.
06-llm-as-judge.md — How one model can score another model responsibly.
07-rubric-design.md — How to write crisp scoring criteria with anchors.
08-judge-calibration.md — How to make judges agree and reduce bias.
09-drift-detection.md — How to catch quality decay after shipping.
10-ab-testing.md — How to compare versions safely in live traffic.
11-logging-tracing.md — How to see what happened inside an LLM workflow.
12-alerting-dashboards.md — What to watch, alert on, and review daily.
13-eval-driven-development.md — How to make evals the inner loop of improvement.
14-honest-admission.md — What evals still miss, and why humility matters.

Top resources¶

OpenAI Evals guide — Practical starting point for building repeatable AI evaluations.
Anthropic test and evaluate docs — Clear framing on defining success criteria before prompt tweaking.
LangSmith evaluation docs — Good product-style examples for offline and online eval workflows.
Braintrust evals overview — Useful for dataset versioning, experiment tracking, and regression checks.
Arize Phoenix evaluations — Helpful for tracing, drift, and LLM-as-judge patterns together.
promptfoo docs — Lightweight way to run prompt comparisons and assertion-based checks locally.

Evals are just disciplined inspection. Rubrics make the inspection visible. Spot checks keep it affordable. Kitchen logs make hidden failures explainable. Shift changes make deployments safer. That is the whole restaurant game. So first we must feel the pain of skipping all this.

Bridge. Before taxonomy and metrics, you need one uncomfortable lesson: many teams ship because a demo felt good. That mistake is the fastest way to disappoint a busy dining room. → 01-shipping-on-vibes.md