Skip to content

01. What a golden set is for

Before designing sourcing or labelling, the team needs to be clear on what the golden set is and is not for. This chapter is the role: what problem the set solves, what problems it does not, and the trap of treating it as a test suite.


A platform engineer at a Mumbai SaaS company shows the team a spreadsheet of 200 eval cases. The PM asks: "is the AI working?" The engineer points at the eval score of 0.86. The PM is satisfied; the executive is satisfied; the deploy ships. Two weeks later customer complaints reveal a failure mode the eval set did not cover: when a customer asks a question that combines two topics, the agent answers one and ignores the other. The eval was 0.86 across the cases it had; the production reality has a gap the eval missed by construction.

The misunderstanding: the team treated the eval like a unit test ("if it passes, the code works"). The set is not a test suite. It is a sample of the input distribution, against which the system's behaviour is graded. It tells you about the cases in the sample; it does not warrant behaviour outside the sample. This chapter is the discipline of holding that distinction clearly.


What the golden set does

Three things, each load-bearing.

1. It produces a comparable number. Two versions of the system can be scored on the same set; the difference in scores is meaningful. Without the set, "is the new version better?" has no operational answer.

2. It catches regressions on known cases. A change that drops the score on a case the team had judged "correct" is a regression worth investigating. The set is the memory of "we already decided this case should pass."

3. It encodes the team's behavioural spec. The labels — what counts as a good output — are the team's accumulated judgement of what the system should do. Reading the set tells you what the team has decided.

These three together are why the set exists. Anything else the set is asked to do should be examined against these.


What the golden set does not do

Three traps the team falls into when treating the set as more than it is.

It does not prove the system is correct. A 0.95 score means the system handles the cases in the set well. It does not mean the system handles cases outside the set well. Cases outside the set have not been measured.

It does not detect novel failure modes. A failure shape the set does not include is invisible to the eval. New failures surface in production; the set catches them on the next refresh.

It is not a test suite. Tests are pass/fail and deterministic. Eval scores are distributions of judgements with variance, run against non-deterministic models. The right comparison is statistical, not binary; the right action on a small score drop is investigation, not automatic block.


The set as a sample

The set is a sample — not the population. The population is "all the inputs the system will ever see"; the set is "100 cases the team curated." The relationship between the score on the sample and the system's true quality is statistical.

Two consequences.

Coverage matters more than size. A 100-case set that spans the failure modes is more useful than a 1,000-case set drawn from the happy path. Chapter 04 covers stratification; this is its premise.

Drift in the population breaks the sample. If the system's input distribution changes (new feature, new users, new market), the sample becomes less representative. Chapter 05 covers refresh; this is its premise.

The team that holds the set as the population — "we tested everything" — produces the chapter-opening surprise. The team that holds the set as a sample — "this is what we tested" — designs for what the sample cannot cover.


Two flavours: regression set vs distribution set

Most platforms operate two sets, with different roles.

The regression set. A small, stable set of cases the team has decided must continue to work. Curated. Slow-changing. Used as a CI gate — a change that regresses any case is examined. Typically 50–200 cases.

The distribution set. A larger, refreshing set drawn from production traffic, covering the breadth of inputs the system actually sees. Stratified by failure mode and segment. Used for measuring overall quality. Typically 500–5,000 cases.

The regression set is the floor. The distribution set is the measurement. Most platforms benefit from both.

For this module, "the golden set" usually refers to the regression set unless context implies otherwise. The distribution set follows similar disciplines with looser labelling (often rubric-only).


When the set is too small or too big

Too small. Under 50 cases, the set is so small that single-case scores dominate the average. A 1/50 case flipping shifts the score by 2 percentage points; noise drowns signal.

Too big. Over a few thousand cases, the set becomes expensive to run, slow in CI, and the marginal case adds little signal. The work of maintaining the set exceeds its returns.

The sweet spot for a regression set is roughly 100–300 cases. The sweet spot for a distribution set is 1,000–5,000 cases. These are not exact; they are rough scales.


The trap of the high score

A high eval score is satisfying. It is also dangerous. The team that sees 0.95 says "we're good" and stops investing in the set. The set ages; the score stays 0.95 because the cases in the set still pass; the production reality drifts away from the score; the executive's confidence is based on an artefact that has detached from reality.

The discipline:

  • Treat the score as one signal among several (production-traffic eval, customer complaints, dashboard metrics).
  • A stable high score is a signal to refresh the set, not to stop watching.
  • Score deltas matter more than absolute scores; a drop is a question regardless of the absolute level.
  • Compare absolute scores only against fresh-set versions; old-set scores are not comparable to current production.

The roles people fill around the set

Three roles, sometimes one person, often distributed.

The owner. Accountable for the set's quality. Decides what gets added, what gets retired, when to refresh. The owner is rarely an engineer alone; a product manager or domain expert is often co-owner.

The labellers. The people who decide what a good output is. Often domain experts (lawyers, doctors, support agents) more than engineers. Labelling discipline is chapter 03.

The runner. The engineer who maintains the eval framework and ensures the set runs against changes. The runner is the operator; the owner and labellers shape the substance.

Confusing the roles produces sets that are technically operational but substantively wrong. A runner alone produces sets that engineers can read but domain experts cannot validate. A labeller alone produces sets that may not run reliably in CI.


Common mistakes

Treating the eval as a test suite. "It passed" is the wrong framing. "The score is X on this version of the set" is the right one.

One static set forever. The set ages; the production distribution drifts; the score becomes meaningless. Chapter 05.

A set without an owner. Decisions about cases get made ad-hoc by whoever happens to be in the file; the set's coherence decays.

A set composed only of happy path. Looks good, catches nothing meaningful. Chapter 04.

A set composed only of failure modes. Looks bad, encodes the wrong success criteria, demotivates the team. Failure-mode coverage plus happy path.

No comparable baseline. Without a baseline score on a stable set, a change's "improvement" is unclear. The set's first job is to produce a comparable baseline.


Interview Q&A

Q1. Why is the golden set not a test suite? A test suite is deterministic and binary. The system being evaluated is non-deterministic; the test would pass for many runs and fail for one. The judgement of what counts as success is often rubric-graded rather than exact-match. The right framing is statistical: the score is a distribution of judgements over the set, and changes are evaluated by score deltas with awareness of variance. Treating it as pass/fail produces false confidence on passes and false alarms on transient fails. Wrong-answer notes: "it's like a test, just for AI" misses the statistical posture.

Q2. Walk through the two-set design and when each is used. The regression set is small (50–200), stable, curated, and used as a CI gate. A change is checked against it pre-merge; regressions on it are examined before shipping. The distribution set is larger (1,000–5,000), refreshes from production traffic, stratified by failure mode and segment, and used to measure overall quality. The regression set is the floor; the distribution set is the measurement. The two roles are different; combining them into one set produces conflict — the CI needs stability; the measurement needs currency. Wrong-answer notes: "one set for everything" produces the conflict.

Q3. The set has been stable at 0.86 for six months. The team says "the system is performing well." What is your push-back? The score is 0.86 on the cases in the set, which has not been refreshed. The cases reflect the world at the time of the last refresh. Production has likely drifted; new failure modes have likely appeared; old failure modes may have been fixed and are inflating the score. The 0.86 is honest about what it measures; the team's claim "the system is performing well" extends beyond what the score warrants. The discipline: refresh the set, re-baseline, then make claims about current performance. Wrong-answer notes: taking the score at face value is the chapter-opening trap.

Q4. The team is debating "100 cases or 1,000 cases?" for the regression set. What do you recommend? 100–300 cases is the sweet spot for a regression set. Under 50, single-case flips dominate the average and noise overwhelms signal. Over 1,000, the set is expensive to maintain, slow in CI, and the marginal case adds little. The work of maintaining 1,000 cases exceeds the work of maintaining 200 by more than 5×; the additional cases are usually unread, unverified, and silently rotting. For coverage of a wider distribution, use a separate distribution set that refreshes from production traffic, with rubric-based judging that does not require the same per-case curation. Wrong-answer notes: "more is always better" misses the maintenance economics.


What to do differently after reading this

  • Hold the set as a sample, not the population. Communicate scores with that framing.
  • Design two sets if your platform's needs warrant: a stable regression set and a refreshing distribution set.
  • Treat score deltas as signals; absolute scores only with set-version provenance.
  • Assign the owner, the labellers, and the runner explicitly. Roles distributed across people are fine; ambiguity is not.
  • A stable high score is a signal to refresh, not to relax.

Bridge. With the role clear, the next question is concrete: where do the cases come from? Production traffic, expert authoring, synthetic generation — each has trade-offs. The next chapter is sourcing. → 02-sourcing-cases.md