Skip to content

00. Dataset golden-set operations — First-principles overview

Module 00 of this category — 00_ai_evals_release_gates — taught you that evals exist as release gates. This module is the discipline of operating the dataset behind the evals: the golden set itself, its sourcing, labelling, refresh, ownership, and lifecycle.


A data lead at a Pune SaaS company audits the team's evals six months after a successful launch. The eval scores have stayed steady at 0.84 since launch. The team is pleased. The audit asks two questions: when was the eval set last updated, and how representative is it of current production traffic? Both answers are unsettling. The set has not been touched since launch. The cases were drawn from a one-week production sample six months ago. The set has 80 happy-path cases and 20 failure cases — many of which represent failure modes that have since been fixed and are no longer common. Meanwhile, the failure modes seen in customer support today are not in the set at all. The eval score of 0.84 is real but increasingly hollow: it measures performance on yesterday's distribution against yesterday's failure shapes. Today's quality is unknown.

This is the operational problem of golden sets. The set is not a one-time artefact; it is a living dataset that drifts away from reality unless it is operated. This module is the discipline.


What golden-set operations is

Golden-set operations is the discipline of keeping the eval dataset representative, well-labelled, current, versioned, and owned — so that evals continue to mean what they claim to mean.

Six surfaces.

Surface One-liner Pressure it answers
Sourcing Where new cases come from (production sample, expert authoring, synthetic generation) coverage: the set must represent what the system actually sees
Labelling Who labels expected behaviour, by what process, with what calibration accuracy: the labels are the spec; wrong labels make evals meaningless
Coverage Stratification across failure modes, segments, edge cases representativeness: averages hide problems in slices
Refresh Cadence of adding new cases and retiring stale ones currency: production drifts; the set must follow
Versioning The set as a versioned artefact with changelog reproducibility: score comparisons across time require frozen versions
Ownership Who owns the set; who can change it; who reviews accountability: an unowned set decays

The chapters of this module build each surface.


What this module is not about

Two distinctions to keep clear.

Not the judge. Module 00_ai_evals_release_gates covers LLM-as-judge and rubric design. This module is about the cases the judge evaluates, not the judging mechanism itself.

Not the eval framework. The runner that executes evals (CI integration, regression checks, dashboards) is operational tooling. This module is about the data the framework runs on.

The set is the workbench. The judge is the inspector. The framework is the rig. This module covers the workbench.


The recurring vocabulary

Name Surface What it is
the case Sourcing One input-expected-behaviour pair in the set
the rubric Labelling The criteria a good output for a case must satisfy
the source distribution Sourcing The mix of where cases come from (production, expert, synthetic)
the failure-mode stratum Coverage A category of failure the set explicitly covers
the refresh cadence Refresh How often new cases are added and stale ones retired
the set version Versioning A snapshot of the set, frozen, with a changelog
the set owner Ownership The team or person responsible for the set's quality
the calibration session Labelling The structured exercise that aligns multiple labellers' judgement

The journey

This module has three acts.

Act 1 — Build the set (files 01–04). What a golden set is for, where cases come from, how labels are made, how coverage is structured.

Act 2 — Keep it alive (files 05–08). Refresh, versioning, the judging mechanism's interaction, the cost and throughput of operating the set.

Act 3 — Govern it (files 09–11). Privacy in the set, cross-team ownership, what happens when the set is wrong (incidents).

Synthesis (files 12–13). Architect checklist and honest admission.


Memory map

# File Surface What it adds
01 what-a-golden-set-is-for the role of the set in the eval discipline
02 sourcing-cases Sourcing where cases come from and how to balance the sources
03 labelling-discipline Labelling who labels, with what process and calibration
04 coverage-and-stratification Coverage how to ensure the set represents what matters
— milestone: the set is real —
05 refresh-and-drift Refresh the cadence that keeps the set current
06 versioning-the-set Versioning the artefact discipline
07 judging-mechanism-fit Labelling+Coverage rubric vs reference vs exact-match per case type
08 cost-and-throughput Cross-cutting the operational economics
— milestone: the set is operated —
09 privacy-in-the-golden-set Cross-cutting personal data and the set
10 cross-team-ownership Ownership how multiple teams contribute and review
11 eval-set-incident-response All what to do when the set is wrong
— milestone: the set is governed —
12 architect-checklist Synthesis 20 items
13 honest-admission Boundaries what golden sets cannot solve

How this module relates to its neighbours


Top resources

  • Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
  • Snorkel — programmatic labelling — https://snorkel.ai/
  • OpenAI evals — example designs — https://github.com/openai/evals
  • Anthropic — building evals for LLM apps — https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
  • Promptfoo — eval framework — https://www.promptfoo.dev/

What's coming

  1. 01-what-a-golden-set-is-for.md — The role of the set; what it does and what it does not.
  2. 02-sourcing-cases.md — Production sample, expert authoring, synthetic generation; how to balance.
  3. 03-labelling-discipline.md — Who labels, with what process and calibration.
  4. 04-coverage-and-stratification.md — Failure modes, segments, edge cases as strata.
  5. 05-refresh-and-drift.md — The cadence; what triggers new cases.
  6. 06-versioning-the-set.md — The set as a versioned artefact with changelog.
  7. 07-judging-mechanism-fit.md — Rubric vs reference vs exact-match; per case type.
  8. 08-cost-and-throughput.md — The economics of operating evals at scale.
  9. 09-privacy-in-the-golden-set.md — Synthetic substitution, redaction, RTBF.
  10. 10-cross-team-ownership.md — Multiple teams contribute; how to govern.
  11. 11-eval-set-incident-response.md — When the set is wrong: false positives, missed cases.
  12. 12-architect-checklist.md — Twenty items.
  13. 13-honest-admission.md — What golden sets cannot solve.

Bridge. Before designing sourcing or labelling, the team needs to be clear on what the golden set is and is not for. The first chapter is the role: what evaluation problem the set solves, what problems it does not, and the trap of treating it as a test suite. → 01-what-a-golden-set-is-for.md