Skip to content

13. Honest admission

Twelve chapters of discipline. None of them solve the problem entirely. This chapter lists the limits a thoughtful lead is transparent about — with their team, their stakeholders, and themselves — when telling the story of operating a golden set.


The set is a workbench. The workbench is necessary; the workbench is not the work. The honest admissions below are the limits within which a golden set operates and the gaps where the discipline is bounded.


1 — The set measures the set, not the world

The score is honest about the cases in the set. It is silent on cases outside the set. A platform that conflates "we scored well" with "we are correct in production" makes a category error. The set is a sample; the population is what the system actually sees; the relationship between them is statistical and approximate.


2 — Coverage is always partial

Even with stratification and refresh, the set misses cases. New failure modes appear faster than refreshes can catch them. Long-tail rare cases (1 in 10,000) are unlikely to be in a 300-case set. The mitigation is the production-traffic eval and the customer-impact monitoring; the set alone is incomplete.


3 — Labels age

Policies change; rubrics sharpen; the "correct" answer for a case may differ today from the labelling six months ago. The discipline of label updates (chapter 11) reduces the lag; it does not eliminate it. Stale labels produce false positives or hidden regressions; the discipline catches most, not all.


4 — Judge variance is real

LLM-as-judge introduces noise. Multi-judge and calibration reduce it; they do not eliminate it. Two runs of the same eval on the same set may produce slightly different scores. Trends matter more than single readings; the discipline acknowledges this rather than hides it.


5 — The set's strata are the team's hypotheses

Stratification is the team's model of what matters. A failure mode the team has not anticipated is not a stratum; cases in it are absent. The set's coverage reflects the team's understanding of the system; it does not exceed that understanding.


6 — Production-traffic eval lags by sampling rate

The production-traffic eval scores a small fraction of production calls. Sub-population failures below the sampling rate are missed by this signal. A 1% sample misses most cases; larger samples cost more. The trade-off is platform-specific and not avoidable.


7 — Synthetic substitution may erase signal

Synthetic-only sets protect privacy. They may also erase signal: a system's behaviour on a real customer's specific data may differ from its behaviour on synthetic-substituted data. For most platforms, the structural similarity is enough; for some (highly personalised systems), the substitution removes too much of what mattered. The mitigation is targeted use of redacted-real data in restricted-access stores; not a clean fix.


8 — Cross-team governance is people work

The cross-team forum, the conflict resolution, the priority allocation — these are organisational disciplines as much as technical ones. A platform with strong technical eval and weak cross-team governance has uneven coverage; a platform with strong governance and weak technical setup has good conversations and noisy scores. Both are required; neither is enough.


9 — Cost is bounded by attention

The eval cost can be optimised technically; the labelling and review cost is bounded by domain expert time. A platform that grows its eval coverage may grow faster than labelling can keep up; the set's quality decays even with technical investment. The honest answer is to scope the set's growth to sustainable labelling capacity.


10 — A perfect set still cannot prevent the next surprise

The system will surprise the team. A new model, a new prompt, a new input distribution, a new edge case the set did not include. The set reduces the rate of surprise; it does not eliminate it. The discipline is to make surprise informative — the production-traffic eval, the customer-impact monitoring, the incident workflow — so each surprise improves the set.


What this module does not teach

  • The internals of judge design and calibration (covered in 00_ai_evals_release_gates)
  • The eval framework itself (CI integration, runners, dashboards — operational tooling)
  • Eval metrics in detail (precision/recall, BLEU, ROUGE, etc. — depends on domain)
  • Specific labelling tools (Labelbox, Scale, custom — operational choice)
  • Synthetic data generation techniques in depth (a separate discipline)

This module is the operating posture for the set itself; the neighbours fill in around it.


How to use this module after reading it

  1. Audit the platform's eval set against chapter 12. Identify the top three reds.
  2. If items 1 or 17 are red (role clarity or ownership), fix those first. The substantive minimum.
  3. Build the sourcing, labelling, coverage discipline (items 2–6). A real set.
  4. Establish refresh, versioning, judging fit (items 7–11). A maintained set.
  5. Operate at scale (items 12–15) and govern (items 16–20) as the platform matures.
  6. Re-read this honest admission every quarter. Limits surface as the platform evolves.

Closing

The golden set is the team's accumulated judgement of what the system should do, encoded in cases and labels, operated as a versioned artefact, refreshed against production, governed across teams, and defended against incidents.

It is not a guarantee of correctness. It is the most concrete artefact the team has of "what good looks like" — and the discipline this module taught makes that artefact reliable enough to drive day-to-day decisions about the system.

That is what production-grade eval-set operation looks like.


Bridge. This module covered the set itself. The next module, 02_telemetry_feedback_loops, is about capturing production signals into the eval and prompt processes — the feedback loop that grows the set's coverage from what the system actually sees in the wild. → ../02_telemetry_feedback_loops/00-eli5.md