00. Evals & Production — The Five-Year-Old Version¶
Module 10 built agent teams. This module teaches you how to know if they actually work.
Imagine a busy restaurant on a Saturday night. Orders keep coming, customers keep waiting, and the chef is the model turning each ticket into a dish. The dining room is production, and the customers waiting at their tables are your users. One pretty plate at the pass does not tell you whether the kitchen will hold up for the next four hours.
Now enter the health inspector. The inspector does not clap for one pretty plate; the inspector checks whether the whole restaurant runs safely and consistently under load. That is what evals do for AI products. They do not ask, "Can the chef cook one great dish?" They ask, "Can this kitchen keep doing the right thing under pressure?" A demo is one table served nicely. Production is the full dining room during rush hour, and the gap between the two is where most AI products quietly fail.
So what goes wrong? Teams often trust the smiling waiter at the front and never inspect the kitchen behind the door. The system looks magical in a meeting on Monday and disappoints real users by Friday. The inspector's job is to break that habit by using checklists (rubrics), spot checks (sampling representative conversations instead of reading everything), kitchen logs (logging and tracing the hidden steps inside an agent), and shift-change reviews (watching what happens when you deploy a new model or prompt).
A restaurant can fail in many ways: food can be wrong, late, allergy-blind, or look fine while still being unsafe. AI systems fail in the same shapes — answers can be wrong, slow, policy-violating, or sound polished while being harmful. That is why this module exists: to give you the inspector's toolkit before your kitchen starts losing tables.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| the inspection | the eval — systematic quality assessment |
| the rubric | scoring criteria — observable, anchored |
| the spot check | sample-based eval — representative subset |
| the kitchen log | logging and tracing — internal visibility |
| the shift change | deployment — rolling out a new version |
See the restaurant flow¶
customers arrive ──→ dining room fills
│
▼
chef cooks dish ──→ plate leaves kitchen
│
▼
inspector checks rubric, spot checks, logs, and shift changes
When quality drops, the inspector wants evidence, not confidence. That same attitude keeps AI teams honest. If users complain, you need the dish, the checklist, and the kitchen log. If a new release lands, you need to inspect the shift change. If a dashboard looks green, you still need good spot checks. The habit to build is simple: measure before, during, and after launch, and trust evidence over confidence.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| the inspection (the eval) | a shipped feature with real users | shipping reliability under sampling | every chapter in this module | product / users |
| the rubric | a clear definition of "acceptable" output | rater consistency, ambiguity | judge calibration, EDD regression sets | scoring |
| the spot check | live traffic logs | cost vs coverage trade-off | drift detection, A/B sampling | sample |
| the kitchen log | structured logging on every step | post-hoc diagnosis of hidden failures | tracing, alerting, dashboards | runtime |
| the shift change | a release process you control | blast radius of a new prompt or model | A/B testing, post-deploy drift | deploy |
| the golden dataset | curated examples with known answers | preventing regression over time | EDD inner loop, judge agreement | data |
What's coming¶
- 01-shipping-on-vibes.md — Why demo success hides production failure.
- 02-eval-taxonomy.md — The main eval types and when each one helps.
- 03-golden-datasets.md — How to build trusted evaluation sets with owners and versions.
- 04-synthetic-generation.md — How to generate many useful test cases quickly.
- 05-metrics-reference.md — Which metrics fit text similarity, behavior, and product outcomes.
- 06-llm-as-judge.md — How one model can score another model responsibly.
- 07-rubric-design.md — How to write crisp scoring criteria with anchors.
- 08-judge-calibration.md — How to make judges agree and reduce bias.
- 09-drift-detection.md — How to catch quality decay after shipping.
- 10-ab-testing.md — How to compare versions safely in live traffic.
- 11-logging-tracing.md — How to see what happened inside an LLM workflow.
- 12-alerting-dashboards.md — What to watch, alert on, and review daily.
- 13-eval-driven-development.md — How to make evals the inner loop of improvement.
- 14-honest-admission.md — What evals still miss, and why humility matters.
Top resources¶
- OpenAI Evals guide — Practical starting point for building repeatable AI evaluations.
- Anthropic test and evaluate docs — Clear framing on defining success criteria before prompt tweaking.
- LangSmith evaluation docs — Good product-style examples for offline and online eval workflows.
- Braintrust evals overview — Useful for dataset versioning, experiment tracking, and regression checks.
- Arize Phoenix evaluations — Helpful for tracing, drift, and LLM-as-judge patterns together.
- promptfoo docs — Lightweight way to run prompt comparisons and assertion-based checks locally.
Evals are just disciplined inspection. Rubrics make the inspection visible. Spot checks keep it affordable. Kitchen logs make hidden failures explainable. Shift changes make deployments safer. That is the whole restaurant game. So first we must feel the pain of skipping all this.
Bridge. Before taxonomy and metrics, you need one uncomfortable lesson: many teams ship because a demo felt good. That mistake is the fastest way to disappoint a busy dining room. → 01-shipping-on-vibes.md