00. Dataset golden-set operations — First-principles overview¶

Module 00 of this category — 00_ai_evals_release_gates — taught you that evals exist as release gates. This module is the discipline of operating the dataset behind the evals: the golden set itself, its sourcing, labelling, refresh, ownership, and lifecycle.

A data lead at a Pune SaaS company audits the team's evals six months after a successful launch. The eval scores have stayed steady at 0.84 since launch. The team is pleased. The audit asks two questions: when was the eval set last updated, and how representative is it of current production traffic? Both answers are unsettling. The set has not been touched since launch. The cases were drawn from a one-week production sample six months ago. The set has 80 happy-path cases and 20 failure cases — many of which represent failure modes that have since been fixed and are no longer common. Meanwhile, the failure modes seen in customer support today are not in the set at all. The eval score of 0.84 is real but increasingly hollow: it measures performance on yesterday's distribution against yesterday's failure shapes. Today's quality is unknown.

This is the operational problem of golden sets. The set is not a one-time artefact; it is a living dataset that drifts away from reality unless it is operated. This module is the discipline.

What golden-set operations is¶

Golden-set operations is the discipline of keeping the eval dataset representative, well-labelled, current, versioned, and owned — so that evals continue to mean what they claim to mean.

Six surfaces.

Surface	One-liner	Pressure it answers
Sourcing	Where new cases come from (production sample, expert authoring, synthetic generation)	coverage: the set must represent what the system actually sees
Labelling	Who labels expected behaviour, by what process, with what calibration	accuracy: the labels are the spec; wrong labels make evals meaningless
Coverage	Stratification across failure modes, segments, edge cases	representativeness: averages hide problems in slices
Refresh	Cadence of adding new cases and retiring stale ones	currency: production drifts; the set must follow
Versioning	The set as a versioned artefact with changelog	reproducibility: score comparisons across time require frozen versions
Ownership	Who owns the set; who can change it; who reviews	accountability: an unowned set decays

The chapters of this module build each surface.

What this module is not about¶

Two distinctions to keep clear.

Not the judge. Module 00_ai_evals_release_gates covers LLM-as-judge and rubric design. This module is about the cases the judge evaluates, not the judging mechanism itself.

Not the eval framework. The runner that executes evals (CI integration, regression checks, dashboards) is operational tooling. This module is about the data the framework runs on.

The set is the workbench. The judge is the inspector. The framework is the rig. This module covers the workbench.

The recurring vocabulary¶

Name	Surface	What it is
the case	Sourcing	One input-expected-behaviour pair in the set
the rubric	Labelling	The criteria a good output for a case must satisfy
the source distribution	Sourcing	The mix of where cases come from (production, expert, synthetic)
the failure-mode stratum	Coverage	A category of failure the set explicitly covers
the refresh cadence	Refresh	How often new cases are added and stale ones retired
the set version	Versioning	A snapshot of the set, frozen, with a changelog
the set owner	Ownership	The team or person responsible for the set's quality
the calibration session	Labelling	The structured exercise that aligns multiple labellers' judgement

The journey¶

This module has three acts.

Act 1 — Build the set (files 01–04). What a golden set is for, where cases come from, how labels are made, how coverage is structured.

Act 2 — Keep it alive (files 05–08). Refresh, versioning, the judging mechanism's interaction, the cost and throughput of operating the set.

Act 3 — Govern it (files 09–11). Privacy in the set, cross-team ownership, what happens when the set is wrong (incidents).

Synthesis (files 12–13). Architect checklist and honest admission.

Memory map¶

#	File	Surface	What it adds
01	what-a-golden-set-is-for	—	the role of the set in the eval discipline
02	sourcing-cases	Sourcing	where cases come from and how to balance the sources
03	labelling-discipline	Labelling	who labels, with what process and calibration
04	coverage-and-stratification	Coverage	how to ensure the set represents what matters
	— milestone: the set is real —
05	refresh-and-drift	Refresh	the cadence that keeps the set current
06	versioning-the-set	Versioning	the artefact discipline
07	judging-mechanism-fit	Labelling+Coverage	rubric vs reference vs exact-match per case type
08	cost-and-throughput	Cross-cutting	the operational economics
	— milestone: the set is operated —
09	privacy-in-the-golden-set	Cross-cutting	personal data and the set
10	cross-team-ownership	Ownership	how multiple teams contribute and review
11	eval-set-incident-response	All	what to do when the set is wrong
	— milestone: the set is governed —
12	architect-checklist	Synthesis	20 items
13	honest-admission	Boundaries	what golden sets cannot solve

How this module relates to its neighbours¶

00_ai_evals_release_gates — sibling module on evals as release gates and judge design. This module is the dataset side.
02_telemetry_feedback_loops — the next module on capturing production signals into evals. Feeds the sourcing in chapter 02.
14_legacy_ai_modernization — chapter 03 there is "the eval backstop"; this module is what makes that backstop sustainable over months.
03_ai_security_safety/03_data_access_governance — chapter 05 (PII) and chapter 09 (RTBF) apply to the golden set itself.
13_prompt_lifecycle_operations — prompt changes are gated against the golden set; the set is the substrate.

Top resources¶

Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
Snorkel — programmatic labelling — https://snorkel.ai/
OpenAI evals — example designs — https://github.com/openai/evals
Anthropic — building evals for LLM apps — https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
Promptfoo — eval framework — https://www.promptfoo.dev/

What's coming¶

01-what-a-golden-set-is-for.md — The role of the set; what it does and what it does not.
02-sourcing-cases.md — Production sample, expert authoring, synthetic generation; how to balance.
03-labelling-discipline.md — Who labels, with what process and calibration.
04-coverage-and-stratification.md — Failure modes, segments, edge cases as strata.
05-refresh-and-drift.md — The cadence; what triggers new cases.
06-versioning-the-set.md — The set as a versioned artefact with changelog.
07-judging-mechanism-fit.md — Rubric vs reference vs exact-match; per case type.
08-cost-and-throughput.md — The economics of operating evals at scale.
09-privacy-in-the-golden-set.md — Synthetic substitution, redaction, RTBF.
10-cross-team-ownership.md — Multiple teams contribute; how to govern.
11-eval-set-incident-response.md — When the set is wrong: false positives, missed cases.
12-architect-checklist.md — Twenty items.
13-honest-admission.md — What golden sets cannot solve.

Bridge. Before designing sourcing or labelling, the team needs to be clear on what the golden set is and is not for. The first chapter is the role: what evaluation problem the set solves, what problems it does not, and the trap of treating it as a test suite. → 01-what-a-golden-set-is-for.md