00. Dataset golden-set operations — First-principles overview¶
Module 00 of this category — 00_ai_evals_release_gates — taught you that evals exist as release gates. This module is the discipline of operating the dataset behind the evals: the golden set itself, its sourcing, labelling, refresh, ownership, and lifecycle.
A data lead at a Pune SaaS company audits the team's evals six months after a successful launch. The eval scores have stayed steady at 0.84 since launch. The team is pleased. The audit asks two questions: when was the eval set last updated, and how representative is it of current production traffic? Both answers are unsettling. The set has not been touched since launch. The cases were drawn from a one-week production sample six months ago. The set has 80 happy-path cases and 20 failure cases — many of which represent failure modes that have since been fixed and are no longer common. Meanwhile, the failure modes seen in customer support today are not in the set at all. The eval score of 0.84 is real but increasingly hollow: it measures performance on yesterday's distribution against yesterday's failure shapes. Today's quality is unknown.
This is the operational problem of golden sets. The set is not a one-time artefact; it is a living dataset that drifts away from reality unless it is operated. This module is the discipline.
What golden-set operations is¶
Golden-set operations is the discipline of keeping the eval dataset representative, well-labelled, current, versioned, and owned — so that evals continue to mean what they claim to mean.
Six surfaces.
| Surface | One-liner | Pressure it answers |
|---|---|---|
| Sourcing | Where new cases come from (production sample, expert authoring, synthetic generation) | coverage: the set must represent what the system actually sees |
| Labelling | Who labels expected behaviour, by what process, with what calibration | accuracy: the labels are the spec; wrong labels make evals meaningless |
| Coverage | Stratification across failure modes, segments, edge cases | representativeness: averages hide problems in slices |
| Refresh | Cadence of adding new cases and retiring stale ones | currency: production drifts; the set must follow |
| Versioning | The set as a versioned artefact with changelog | reproducibility: score comparisons across time require frozen versions |
| Ownership | Who owns the set; who can change it; who reviews | accountability: an unowned set decays |
The chapters of this module build each surface.
What this module is not about¶
Two distinctions to keep clear.
Not the judge. Module 00_ai_evals_release_gates covers LLM-as-judge and rubric design. This module is about the cases the judge evaluates, not the judging mechanism itself.
Not the eval framework. The runner that executes evals (CI integration, regression checks, dashboards) is operational tooling. This module is about the data the framework runs on.
The set is the workbench. The judge is the inspector. The framework is the rig. This module covers the workbench.
The recurring vocabulary¶
| Name | Surface | What it is |
|---|---|---|
| the case | Sourcing | One input-expected-behaviour pair in the set |
| the rubric | Labelling | The criteria a good output for a case must satisfy |
| the source distribution | Sourcing | The mix of where cases come from (production, expert, synthetic) |
| the failure-mode stratum | Coverage | A category of failure the set explicitly covers |
| the refresh cadence | Refresh | How often new cases are added and stale ones retired |
| the set version | Versioning | A snapshot of the set, frozen, with a changelog |
| the set owner | Ownership | The team or person responsible for the set's quality |
| the calibration session | Labelling | The structured exercise that aligns multiple labellers' judgement |
The journey¶
This module has three acts.
Act 1 — Build the set (files 01–04). What a golden set is for, where cases come from, how labels are made, how coverage is structured.
Act 2 — Keep it alive (files 05–08). Refresh, versioning, the judging mechanism's interaction, the cost and throughput of operating the set.
Act 3 — Govern it (files 09–11). Privacy in the set, cross-team ownership, what happens when the set is wrong (incidents).
Synthesis (files 12–13). Architect checklist and honest admission.
Memory map¶
| # | File | Surface | What it adds |
|---|---|---|---|
| 01 | what-a-golden-set-is-for | — | the role of the set in the eval discipline |
| 02 | sourcing-cases | Sourcing | where cases come from and how to balance the sources |
| 03 | labelling-discipline | Labelling | who labels, with what process and calibration |
| 04 | coverage-and-stratification | Coverage | how to ensure the set represents what matters |
| — milestone: the set is real — | |||
| 05 | refresh-and-drift | Refresh | the cadence that keeps the set current |
| 06 | versioning-the-set | Versioning | the artefact discipline |
| 07 | judging-mechanism-fit | Labelling+Coverage | rubric vs reference vs exact-match per case type |
| 08 | cost-and-throughput | Cross-cutting | the operational economics |
| — milestone: the set is operated — | |||
| 09 | privacy-in-the-golden-set | Cross-cutting | personal data and the set |
| 10 | cross-team-ownership | Ownership | how multiple teams contribute and review |
| 11 | eval-set-incident-response | All | what to do when the set is wrong |
| — milestone: the set is governed — | |||
| 12 | architect-checklist | Synthesis | 20 items |
| 13 | honest-admission | Boundaries | what golden sets cannot solve |
How this module relates to its neighbours¶
00_ai_evals_release_gates— sibling module on evals as release gates and judge design. This module is the dataset side.02_telemetry_feedback_loops— the next module on capturing production signals into evals. Feeds the sourcing in chapter 02.14_legacy_ai_modernization— chapter 03 there is "the eval backstop"; this module is what makes that backstop sustainable over months.03_ai_security_safety/03_data_access_governance— chapter 05 (PII) and chapter 09 (RTBF) apply to the golden set itself.13_prompt_lifecycle_operations— prompt changes are gated against the golden set; the set is the substrate.
Top resources¶
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
- Snorkel — programmatic labelling — https://snorkel.ai/
- OpenAI evals — example designs — https://github.com/openai/evals
- Anthropic — building evals for LLM apps — https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
- Promptfoo — eval framework — https://www.promptfoo.dev/
What's coming¶
- 01-what-a-golden-set-is-for.md — The role of the set; what it does and what it does not.
- 02-sourcing-cases.md — Production sample, expert authoring, synthetic generation; how to balance.
- 03-labelling-discipline.md — Who labels, with what process and calibration.
- 04-coverage-and-stratification.md — Failure modes, segments, edge cases as strata.
- 05-refresh-and-drift.md — The cadence; what triggers new cases.
- 06-versioning-the-set.md — The set as a versioned artefact with changelog.
- 07-judging-mechanism-fit.md — Rubric vs reference vs exact-match; per case type.
- 08-cost-and-throughput.md — The economics of operating evals at scale.
- 09-privacy-in-the-golden-set.md — Synthetic substitution, redaction, RTBF.
- 10-cross-team-ownership.md — Multiple teams contribute; how to govern.
- 11-eval-set-incident-response.md — When the set is wrong: false positives, missed cases.
- 12-architect-checklist.md — Twenty items.
- 13-honest-admission.md — What golden sets cannot solve.
Bridge. Before designing sourcing or labelling, the team needs to be clear on what the golden set is and is not for. The first chapter is the role: what evaluation problem the set solves, what problems it does not, and the trap of treating it as a test suite. → 01-what-a-golden-set-is-for.md