Skip to content

03. Golden datasets — the labelled tray that turns every eval claim into evidence

~18 min read. A taxonomy without labelled data is a map without a territory. This chapter walks the refund-chatbot team from zero rows to a defensible 89-case golden set, and shows why every shortcut on the way produces a number nobody should trust.

Builds on the ELI5 in 00-eli5.md. The spot check is only as honest as the tray it samples from. The rubric only earns its reputation when the same tray gives stable answers across runs. And the inspection collapses to theater the moment the labelled set rots.


What chapters 01–02 settled, and the gap that still kills launches

Chapter 01 dismantled the demo. The refund chatbot scored 100% on five hand-picked prompts and 62% on 100 live ones, and the 38-point gap was a sampling failure, not a model failure. Chapter 02 broke evals into a taxonomy — offline against a frozen set, online against live traffic, single-turn vs trace-level, rule-based vs judge-based — and showed that each cell of the taxonomy answers a different shipping question. The decision frame is now clean.

The problem is that every cell of that taxonomy demands a labelled set, and without one, no eval is grounded. An offline regression check needs a fixed tray with known-good answers. A judge needs anchor examples to calibrate against. A drift detector needs a baseline distribution. Even an online metric like CSAT needs a labelled subset to validate that the metric tracks what humans actually want. Strip away the labelled tray and the entire taxonomy becomes vocabulary with nothing to measure.

The refund-chatbot team finished chapter 02 with a clear taxonomy and the same uncomfortable position they started in: they still cannot tell whether tomorrow's prompt change helps or hurts, because they have no fixed tray to run it against. This chapter is how that tray gets built — from the first 47 mined queries to a 89-case golden set with provenance, slices, ownership, and a version number that means something.

What this file solves

A "saved CSV of test cases" is not a golden dataset, and the difference shows up the first time a model swap moves the score by four points and nobody can tell whether the model improved, the data drifted, the labeller was tired, or three new examples got quietly added on Tuesday. This chapter shows the refund-chatbot team building the spot check's underlying tray from scratch — 47 mined prod queries, 50 LLM-generated edge cases, SME validation that catches 8 mislabels, three slices, version v1.0 — and names the six properties (provenance, labels, slices, freshness, ownership, version) that turn a CSV into an asset. By the end you can defend each row in the tray on its merits, refuse rows that fail the test, and know when 89 cases are enough vs when you need 1000.

Why a labelled tray has to come before any eval number

Picture the eval pipeline you want to run every Monday. A model candidate arrives, you point an offline runner at a CSV of test cases, the runner produces outputs, a scoring function compares each output against a label, and you get a number. That pipeline has three load-bearing inputs: the cases, the labels, and the scoring function. The scoring function is mechanical. The cases and labels are not — they encode a claim about what the system is supposed to do on inputs that look like real traffic. Every number that pipeline emits is implicitly a footnote that says "valid only on this case set, with these labels, scored this way." Change the case set silently and the footnote becomes a lie.

Now picture the next conversation. Your PM says "the new prompt scored 4 points higher, can we ship?" If the case set is undocumented and unowned, the only honest answer is "I don't know what that 4 points means." Maybe the prompt is better. Maybe somebody added six easy cases last week. Maybe the labels for the contested slice were never reviewed by a person who actually understands refund policy. The 4-point delta has no audit trail. The inspection depends on the tray being a stable, defensible artifact — otherwise every comparison is a story, not a measurement.

The need is sharpest the moment an eval has to be defended outside the team. Compliance asks "how do you know the bot doesn't invent refund exceptions for EU customers?" A skeptical engineering lead asks "what cases regressed when you upgraded from Claude 3.5 to Claude 4?" A new hire asks "where did this test case come from and why is it labelled this way?" Three different audiences, one underlying question: show me the labelled tray, its provenance, and its owner. A team that can answer that question can ship. A team that cannot is back on the inspection-without-substance trap from chapter 01.

Teacher voice. A golden dataset is not a folder. It is a contract. Every row says "on this kind of input, the right behaviour is this", and every row is signed by somebody who has the authority to make that claim.

The naive attempt every team tries first — and exactly how it falls apart

The first move a careful team makes is sensible. They scrape 200 recent production conversations, drop the obvious junk, and call it a golden set. The eval runner works. The numbers come out. For two weeks everyone is happy. Then three things go wrong, usually in this order.

First, the score plateaus around 73% and stops moving. Every prompt change adds half a point or subtracts half a point, and no signal is decisive. The team's intuition says "the model is fine, we're at a local maximum." The real diagnosis is that the 200 cases over-represent the easy half of traffic — the half that any reasonable bot would already handle — because production logs are dominated by simple queries. The set is unable to discriminate between a mediocre prompt and a good one, because both pass on the easy cases and the hard cases are too rare.

Second, the compliance team asks "what is the EU-customer pass rate?" Nobody can answer. The set has no slice labels. Splitting it post-hoc requires re-reading 200 transcripts and tagging jurisdiction, which takes a day and a half, during which the launch is blocked.

Third, the labels themselves are wrong on roughly 8% of rows. Nobody noticed because nobody validated them. An ML engineer, working at speed, marked "the bot should refuse this refund" on three cases where policy actually allows it, and "the bot should approve this refund" on five cases where policy forbids it. Every model that passes those rows is being rewarded for being wrong.

Not a coverage problem. Not a label-quality problem in isolation. Not a slice problem in isolation. A provenance and ownership problem — nobody owned the question "why is this row here, who validated the label, and what slice does it represent?" So the natural question becomes: what would the tray need to carry, per row, so that every score the eval emits is defensible all the way down to the row?

When a single bad label silently rewards the wrong behaviour

Here is one row from the team's first naive attempt. It looks fine.

case_id:   v0_037
prompt:    "I bought the wireless headphones two months ago.
            They stopped working last week. I want a refund."
label:     APPROVE_REFUND
labelled_by: <unowned>
slice:     <untagged>
source:    prod_log_2026_03_14

The model that approves this refund gets a green check. The model that refuses it loses a point. But the policy is clear: physical defects after the 30-day window route to the manufacturer warranty, not to a refund. The correct label is OFFER_WARRANTY_HANDOFF. The naive set is teaching the eval that the wrong behaviour is the right one. Every model now climbs toward the wrong target. The score goes up. The user experience does not.

That is the load-bearing failure mode of un-curated golden sets: they create a measurable false north. The team is not failing to measure; they are measuring something, repeatedly, that does not match the policy. The bug is not in the model and not in the metric. It is in the row.

The rule: every row has provenance, label, slice, freshness, ownership, version — or it isn't golden

State it plainly: a row earns its place in the golden tray only when six things travel with it — where it came from, what the right behaviour is, which slice it represents, when it was last reviewed, who owns it, and which version of the set it belongs to. If any of the six is missing, the row is a candidate, not an asset. If any of the six rots — stale freshness, departed owner, slice taxonomy that changed — the row falls back to candidate until it is re-validated.

This is the chapter's invariant. Every operational signal in section 8, every boundary in section 9, every wrong mental model in section 10, every interview question in the Q&A — all of it relates back to this six-tuple. The rubric from chapter 02 is what produces the label field. The spot check is what samples from the tray. The inspection is what runs the tray and reports. Without the six-tuple per row, all three collapse to vibes-with-a-CSV.

Why this rule exists. Every eval number is a claim about a population. The population is the union of rows in the tray. If you cannot defend a row, you cannot defend the population. If you cannot defend the population, you cannot defend the number.


1) Where labels come from — four sources, four different lies

Labels are the hard part. A case without a label is just a prompt. A case with a wrong label is worse than no case at all, because it teaches the eval to reward the wrong behaviour. The team needs labels at scale, and there are only four places they actually come from. Each is partial, each lies in a different direction, and the only honest move is to use all four against each other.

Production log mining pulls real prompts from real users. Provenance is unbeatable — these are the questions the bot actually has to answer. The lie is that production logs are dominated by easy queries, biased toward users who do not give up, and missing the silent-failure cases entirely. Coverage is high on volume and low on edge cases.

LLM-generated bootstraps use a strong model to fabricate plausible variants — adversarial inputs, edge cases, multi-turn escalations, jurisdiction-specific phrasing. Coverage is fast and broad. The lie is style drift: LLM-generated prompts read smoother and more grammatical than real users, and they cluster around what the generator model thinks edge cases look like rather than what edge cases actually look like in your traffic. Chapter 04 goes deep on this. For golden-set construction, LLM cases are useful for coverage, dangerous as the only source.

SME labels — subject matter experts who write or validate the correct behaviour field. For a refund bot, this is someone from policy or legal who can read a prompt and write "this should refuse with reason code R-407." Quality is the highest available. The lie is cost and bias: SME time is expensive ($1–5 per row), SMEs see the world through the policy they wrote, and an SME-only set tends to over-represent policy edges and under-represent the boring middle of real traffic.

User feedback signals — thumbs-down, escalations to a human agent, refund-after-the-fact corrections. These are the most direct measure of "did this fail in production?" The lie is volume and bias: feedback is sparse (5–10% of failures, per chapter 01), skewed toward loud users, and arrives only on the obvious failures. Silent failures never reach this stream.

No source is enough alone. A team that uses only production logs gets a tray that cannot discriminate. A team that uses only LLM-generated cases ships a model that looks great on a smooth simulator and worse on real traffic. A team that uses only SME-written cases catches policy edges but misses the messy middle. A team that uses only user feedback chases the loudest 10% of failures while the silent debt accumulates. The four-source mix is not a best practice; it is the only honest tray.

flowchart LR
    A[Production logs<br/>real prompts, easy-biased] --> M[Candidate pool]
    B[LLM-generated<br/>broad coverage, style drift] --> M
    C[SME-written<br/>policy edges, expensive] --> M
    D[User feedback<br/>real failures, sparse + biased] --> M
    M --> V[SME validation<br/>label + slice + reject]
    V --> G[Golden set v1.0<br/>89 rows, 3 slices, owned]
    V -.rejected 8 mislabels.-> X[Candidate reject bin]

For the refund chatbot, the team's mix at v1.0 ends up at roughly 50% production-mined, 30% LLM-generated, 15% SME-written, 5% user-feedback-derived. The exact ratio is workload-dependent — a regulated domain pushes SME share higher, a long-tail consumer product pushes LLM-generated share higher — but the four-source shape is constant.

2) The core mental model — the tray, the candidate pool, and the validation gate

The picture to carry is three boxes and one gate.

   ┌─────────────────────────────────────┐
   │            CANDIDATE POOL            │
   │  prod logs · LLM-generated · SME    │
   │  drafts · user-feedback flags        │
   │  (cheap to add, untrusted)           │
   └──────────────────┬──────────────────┘
          ┌───────────────────────┐
          │   VALIDATION GATE     │
          │   (SME + rubric)      │
          │  reject · re-label    │
          │  · tag slice · sign   │
          └──────────┬────────────┘
                     │  promote
   ┌─────────────────────────────────────┐
   │          GOLDEN TRAY  (v1.x)         │
   │  provenance · label · slice ·        │
   │  freshness · owner · version         │
   │  (trusted, every row defensible)    │
   └──────────────────┬──────────────────┘
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
   PR sanity     release gate    drift baseline
   (50 rows)    (200+ rows)     (full set vs prod)

Three things to notice. First, the candidate pool is cheap and untrusted; the golden tray is expensive and trusted. The gate between them is where SMEs spend their time. Second, the same tray feeds three different consumers — fast PR checks, slower release gates, drift baselines — and each consumer pulls a different subset. Third, the arrows point one way: candidates get promoted, golden rows get retired, but rows never sneak into the tray without passing the gate. The moment they do, the tray stops being trusted.

This is what the spot check actually samples from. This is what the rubric is applied against. This is what the inspection runs every Monday. Lose the gate, and all three downstream mechanisms inherit the rot.

3) The refund chatbot builds its golden set — from 47 mined queries to 89 owned rows

Pick up the running example from chapters 01–02. The team has the 38-point demo-to-prod gap, a taxonomy that says they need an offline regression set, and zero labelled rows. Day one of construction.

Step 1 — log mining (Tuesday). An engineer pulls the last 30 days of refund-bot conversations. 12,847 chats total. After deduplication (same user repeating themselves), PII scrubbing (names, card numbers, addresses), and intent filtering (drop pure FAQ lookups), the candidate pool is 1,203 conversations. Manual triage by the support lead picks 47 that look representative of the hard cases — a mix of "I want a refund," "shipment never arrived," "wrong item delivered," "address auto-filled wrong," and the awkward edge cases.

Step 2 — LLM-generated coverage (Wednesday morning). The same engineer prompts Claude to generate 50 additional cases across slices the prod logs under-covered: EU jurisdiction cases (GDPR refund rights differ), multi-turn escalations (3+ turns where the bot lost context), and adversarial framings ("my friend said you'd give me a free refund if I complain enough"). The candidate pool is now 97 rows.

Step 3 — SME validation (Wednesday afternoon to Thursday). The policy lead — actual authority on what the bot is supposed to do — walks every row. She labels each one with the correct behaviour code from the policy doc (APPROVE, REFUSE_WITH_REASON, ESCALATE_TO_HUMAN, OFFER_WARRANTY_HANDOFF, REQUEST_MORE_INFO). She rejects 8 rows outright — 3 from prod logs that turned out to be duplicates of corner-case behaviours she did not want over-weighted, and 5 LLM-generated rows that were not actually plausible ("no real EU customer would phrase it that way"). She also flags 8 prod-log rows where the engineer's first-pass label was wrong — exactly the failure mode from section 4 above. The candidate pool is now 89 validated rows.

Step 4 — slice tagging (Thursday). Each row gets three slice tags: jurisdiction (US / EU / other), turn count (single / multi), and policy risk (low / medium / high). The slice table looks like this:

Slice n Source mix (prod / LLM / SME) Expected pass-rate floor
US, single-turn, low risk 31 28 / 2 / 1 90%
US, multi-turn, medium risk 18 14 / 3 / 1 80%
EU, any-turn, any risk 22 1 / 18 / 3 75%
Adversarial framings 12 0 / 11 / 1 70%
User-flagged past failures 6 4 / 0 / 2 60%

Step 5 — versioning and ownership (Friday). The set is committed as golden_set_v1.0. The policy lead owns labels. The ML engineer owns composition. The repo records every row's source, label, slice tags, validator initials, and timestamp. Future PRs run against v1.0. Future releases that change the bot's behaviour must re-validate any row the policy lead flags as affected.

That is the full week. 89 rows, six properties per row, three slices, two owners, one version. The team can now answer every defensibility question from earlier. Where did this row come from? The provenance field. Who said the label is right? The validator initials. Does this represent EU traffic? The slice tag. Did the set change since last Monday? The version number.

Mini-FAQ. "Why 89 and not 200 or 1000?" Because 89 was the honest output of one week of work. The sizing question — what 89 is enough for, what it isn't — is section 5. The rule is to start with what the week produces, ship the regression gate, and grow the set deliberately when specific decisions need more power.

4) Sizing — when 50 sanity rows beat 1000 noisy ones

How big does the tray need to be? The answer is workload-shaped, not best-practice-shaped. Three operating points cover most of the decisions a team actually faces.

~50 rows — the sanity tier. Catches catastrophic regressions: the prompt change that broke parsing, the model swap that disabled tool use, the config edit that silenced refusals. 50 rows takes a couple of minutes to run, fits inside a PR gate, and reliably catches changes that move pass rate by 10+ points. It is useless for distinguishing two prompts that differ by 2 points because the statistical noise on 50 binary outcomes is roughly ±7 points at 95% confidence. Use it as the cheap inner loop, not the decision gate.

~200 rows — the regression tier. Resolves changes that move pass rate by 3–5 points. Sliced into 4–5 segments, each slice has 30–50 rows, which is enough to flag a collapsed slice but not enough to declare a slice "fixed." This is the right size for the release gate of a single-feature product. The refund-chatbot team's 89-row v1.0 is undersized for this tier; they need to grow it to ~200 before they trust release decisions on it.

1000+ rows — the confident-swap tier. The size at which a model swap, a major prompt rewrite, or a retrieval-pipeline change can be evaluated with slice-level statistical confidence. Each slice carries enough rows that a 2-point pass-rate move is real. This is the size at which Claude 3.5 → Claude 4 can be defended on internal data instead of vendor benchmarks. Cost: dozens of SME hours, and a labelling pipeline that runs continuously instead of in bursts.

The mistake teams make is jumping straight to 1000+ before they have a working 50. The right order is sanity → regression → confident-swap, with each tier earning its budget by proving its predecessor is insufficient for an actual decision the team faced.

Tier Size What it catches What it cannot Cost to build Run cost
Sanity ~50 10+ point regressions 2–5 point deltas ~1 day ~$5 / run
Regression ~200 3–5 point deltas, slice collapse model-vs-model fine grain ~1 week ~$20 / run
Confident-swap 1000+ 2-point slice-level moves rare long-tail edges (<1%) ~1–2 months ~$100 / run
Long-tail audit 5000+ true distribution of edge cases nothing about typical traffic ongoing ~$500 / run

Cost numbers assume LLM-judge scoring at roughly $0.001–0.02 per row plus occasional SME spot-validation. SME-only scoring runs 100–500× higher.

5) Production-mined vs LLM-generated vs SME-written vs synthetic-only — picking the mix per workload

The four-source rule says use all four. The mix changes by workload. Three concrete profiles.

High-volume consumer product, low per-error cost. Think a casual-chat assistant or a summarisation tool. The dominant pressure is coverage of the long tail. Mix leans heavy on production-mined (60%) and LLM-generated (30%), light on SME (5%) and user-feedback (5%). SMEs are expensive and the per-error cost does not justify their hours.

Regulated domain, high per-error cost. Refund bot, medical-triage, legal drafting. Per-error cost is huge — one wrong refund decision can be a tribunal case (Air Canada, chapter 01). SME share rises sharply: 15–25%. Production-mined drops to ~40% because real traffic is dominated by easy cases that do not stress policy edges. LLM-generated rises to ~30% to cover adversarial and jurisdictional variants. User-feedback rises to ~10% because past failures are the most credible test cases.

Long-tail enterprise search. Glean-style internal search across documents nobody has read. Production-mined is the only source that knows what queries actually happen. Mix is 80% production-mined, 10% LLM-generated (for adversarial query types), 10% SME (for the few high-stakes departmental queries). Synthetic-only would be disastrous here because the generator does not know the workspace.

The wrong-default-for-every-workload is synthetic-only. It looks attractive — cheap, fast, broad — and it produces a tray that the model can pass without ever seeing a real user input. Every team that has shipped synthetic-only has rediscovered the same lesson: the production distribution is shaped like nothing the generator imagined.

For the refund-chatbot team in section 3, the regulated-domain profile is the right one, which is why their v1.0 mix landed at roughly 50/30/15/5 and why the policy lead's 8 rejections were not waste — they were the cost of being in a regulated domain.

6) The four-source mix as defence against single-source pathology

The reason all four matter is that each one fails in a different direction, and the failures partially cancel. Production logs miss edge cases; LLM generation invents them. LLM generation drifts in style; production logs anchor it. SMEs over-weight policy edges; production logs swamp them with the easy middle. User feedback misses silent failures; the other three sources mine for them.

When the refund-chatbot team's v1.0 was being assembled, the policy lead caught 8 mislabels (section 3) precisely because the four-source structure made the contradictions visible. Three prod-log cases looked like clean refund-approvals to the engineer; the policy lead, holding the policy doc open, saw they were warranty-handoff cases. Five LLM-generated EU cases sounded plausible to the engineer; the policy lead, who deals with actual EU customers, said the phrasing was wrong. The mix did not just produce coverage. It produced cross-validation — every source got checked against the others, and the disagreements surfaced the bugs.

A single-source tray loses this property entirely. Production-only has nothing to validate the labels against. LLM-only has nothing to anchor to real users. SME-only has nothing to test against typical traffic. The four-source mix is the same epistemic move as triangulating a position with three landmarks instead of one — each landmark is imprecise, but the intersection is sharp.

7) Leakage discipline — when the test set is secretly in the prompt

The single most expensive mistake in golden-set construction is leakage: the eval rows have somehow ended up inside the system the eval is testing. Three flavours show up regularly.

Few-shot leakage. The team adds example conversations to the system prompt to teach the bot how to respond. Six months later, somebody pulls those same conversations into the golden set as "real cases we should pass." The bot now passes them because it has memorised them, not because it generalises. The slice pass rate looks great. Production behaviour does not change. The fix is to keep the prompt examples in a separate file with a do-not-use-in-eval tag, and to enforce that tag in the dataset-loading code.

Fine-tune leakage. The model was fine-tuned on a corpus that included production transcripts. The golden set, drawn from the same transcripts, is now testing memorisation. This is harder to catch because the corpus is upstream of the team's view. The discipline is provenance metadata on every row: if a row's source timestamp predates the model's training cutoff, it is at risk of leakage and should be replaced or held out.

Judge leakage. A judge model (chapter 06) is given the rubric and the golden answers in its system prompt, and is then asked to score model outputs. The judge now matches outputs to canned answers rather than evaluating them on the rubric. This collapses when a new model gives a correct-but-differently-worded answer that the judge marks wrong.

For the refund-chatbot team, the discipline at v1.0 is: golden rows are stored in a file the prompt-builder never reads, fine-tuning corpus (if any) is filtered against the golden-row IDs before training, and the judge prompt contains only the rubric — never the canonical answer.

Teacher voice. Treat the golden set like a held-out test set in classical ML. If you can show me the row before evaluation, you can train against it. If you can train against it, the eval is no longer measuring generalisation.

8) Operational signals — the tray is healthy, stale, or leaked

A team that uses the golden set for a year will watch three signal layers.

The healthy signature has four properties. The set's row count is growing — slowly, deliberately, with each addition logged. The label-disagreement rate when SMEs spot-validate is under 5%. The set's aggregate pass rate moves when the model genuinely changes and stays flat when nothing meaningful changed. And the slice-level pass rates correlate with user-visible quality on the corresponding slice (EU pass rate up, EU CSAT up).

The first signal of rot is also the easiest to catch: the set has not been updated in 60 days, and production traffic has visibly shifted (new product lines, new policy, new regions). The set is now testing yesterday's product. The score keeps moving and the team keeps celebrating, but the inspection is measuring the wrong distribution. The fix is a quarterly refresh ritual: pull 30 recent prod conversations, walk them through the validation gate, decide which deserve promotion to the tray.

The second signal is more subtle: the pass rate climbs steadily over months while user CSAT does not. This is the Goodhart signal from chapter 01, applied one layer down. The team has been tuning to a set that no longer represents what users care about, either because the set's slice mix drifted from production slice mix, or because the labels themselves encode an older version of the policy.

The third signal — the deepest, hardest to spot — is leakage. The diagnostic is unreasonable score: the golden set pass rate is 95%+ while live sampling shows 70%. A 25-point gap between a held-out set and the live distribution is mechanically very hard to produce by chance; the most common cause is that the held-out set is no longer held out. Inspect the few-shot examples in the system prompt, the fine-tuning corpus, and the judge prompt. Something has leaked.

The metric a beginner watches first is golden-set pass rate. The metric an experienced team watches first is golden-set pass rate minus live-sample pass rate. The dashboard an expert opens first is the slice-level delta plot over the last 8 weeks — eight tiny sparklines, one per slice, that show whether any slice is drifting independently of the aggregate.

9) Boundary of applicability — when 89 hand-curated rows are enough, and when they're not

A small hand-curated set is the right tool when three conditions hold. The product is narrow enough that the failure surface fits in a person's head; failure surface roughly means "the number of distinct ways the bot can be wrong." The per-error cost is bounded enough that a 3-point slice miss is recoverable. And the team has direct access to an SME who can validate labels in hours, not weeks. For an internal tool, a single-feature consumer product, or the v1.0 of a regulated bot, 50–200 hand-curated rows is honest evidence.

The boundary breaks the moment any of those conditions fails. A product that handles a million conversations a day has a failure surface that does not fit in one person's head; 200 rows will miss long-tail failures that hit thousands of users daily. A regulated bot where a single wrong answer is a tribunal case cannot defend a 5-point slice estimate; the bar is statistical confidence, which requires 1000+ rows per high-stakes slice. A team without SME access cannot validate labels fast enough to keep the tray fresh, so the tray rots before the model does.

At the far end, the pathology is the frozen prestige set. A team builds a beautiful 500-row golden set, ships it, and never updates it because updating it is politically expensive — every change invalidates historical comparisons. Two years later the set is testing a product that no longer exists. The discipline is to budget staleness as an operational cost from day one, and to publish version diffs (what was added, what was retired, why) every quarter.

The scale limit the refund-chatbot team will eventually hit is the EU slice. 22 rows is enough to detect a complete EU regression. It is not enough to declare an EU fix successful — the confidence interval on 22 binary outcomes is too wide. When the team needs to ship a major EU-specific change, they will need to grow that slice to ~200 rows before the decision can be defended.

10) Wrong mental model — "the golden set is a fixed asset"

The seductive belief is that a golden set is built once, frozen, and run forever. The same tray, every release, for years. Stability of the tray equals stability of the measurement. The team that resists adding rows feels disciplined; the team that resists removing rows feels rigorous.

This is wrong in a way that is initially invisible. Production traffic is not static. The product changes. The user base changes. The policy changes. The model itself changes. A tray frozen for a year is measuring a product distribution that no longer exists. The score is stable because the test is stable; the user experience is drifting because the test is not stable with respect to reality. The team is measuring fidelity to an old snapshot of the product and calling it quality.

Replace the wrong model with the right one: the golden set is a living instrument that drifts when traffic drifts, and the team's job is to keep the instrument aligned with the territory it is supposed to measure. Concretely, that means a quarterly refresh ritual, a documented retirement policy (rows older than 18 months are reviewed for relevance, rows representing retired product features are retired), a slice-mix audit against production slice mix (is the EU share of the golden set still close to the EU share of traffic?), and version diffs published with every release.

The asymmetry to internalise: it is much easier to detect a model regression than a dataset regression. When the model regresses, the score drops. When the dataset regresses — the rows go stale, the labels go wrong, the slice mix drifts — the score does not drop. It just stops being meaningful. The drop is silent.

11) Six other failure shapes a golden set can produce

  • Label drift inside a "stable" set. A row's label was right under the old policy and wrong under the new one. Nobody updates it. The eval rewards old behaviour.
  • Slice-mix drift. The golden set is 60% US even though traffic moved to 40% US after a global launch. The aggregate score is no longer representative.
  • Ownership collapse. The policy lead who validated labels leaves the company. Nobody inherits authority. New rows pile up unvalidated.
  • Vibes-graduated rows. Somebody adds a "real customer complaint" to the set without SME validation because it felt obviously wrong. Three months later the label turns out to be incorrect.
  • Adversarial-only inflation. The team is proud of its 200 adversarial cases and forgets that 95% of real traffic is non-adversarial. Adversarial pass rate becomes the headline; aggregate quality on typical traffic degrades unmonitored.
  • The eternal v1.0. The version is never bumped. The set has changed silently dozens of times. No historical comparison is possible. Every score is its own footnote.

Each of these is a specific failure of the six-tuple rule. Each one disappears when provenance, label, slice, freshness, ownership, version are enforced as gate conditions, not optional metadata.

12) Where this pressure recurs

  • Same invariant, later module. Chapter 04 (synthetic generation) is the same pressure as this one, sharpened: when you generate test cases instead of mining them, the provenance and style drift problems get worse, and the validation gate becomes load-bearing. The four-source mix is the upstream defence; chapter 04 is the downstream technique.
  • Failure geometry recurs. The leakage pathology in section 7 is the same geometry as overfitting in classical ML (05_model_evaluation). Different layer, same shape: the test set has secretly become part of the training input.
  • Constraint echo. The ownership-collapse failure (label drift, retired SMEs) is the same operational pressure as on-call ownership in production systems — when nobody owns the artifact, the artifact rots, regardless of how good v1.0 was.

13) A fast self-test before you call a tray "golden"

  • Can every row name its provenance, label, slice, freshness, owner, and version?
  • Can the policy/domain owner sign off on every label without re-reading the policy doc?
  • Does the slice mix match production traffic within 10 percentage points per slice?
  • Have rows been added or retired in the last 90 days, with logged reasons?
  • Is the tray stored in a file that the system prompt cannot read?

Five yeses means you have a golden set. One no means the next eval number it produces is partly fiction.


Where labelled trays earn their keep in the wild

The market reveals who has done this work and who has skipped it.

  • Anthropic's evals cookbook — publishes scaffolding for building labelled eval sets specifically because every customer team rediscovers that without the tray, alignment-tuned models look indistinguishable from base models on real tasks.
  • OpenAI Evals — the registry model is a labelled-tray-as-code pattern; rows are versioned in git, labels travel with provenance, slices are first-class.
  • LangSmith Datasets — exists as a product because every team running LangChain agents rebuilt this layer themselves; the dataset object enforces version + ownership + slice tags.
  • Braintrust — dataset versioning and experiment comparison are the headline features, because the customers who buy Braintrust have already been burned by silent dataset changes.
  • Promptfoo — local-first golden set runner; the asset format is YAML with explicit per-row metadata because the team learned that CSV-only flows lose provenance fast.
  • Vectara HHEM — exists because customer deployments kept missing real hallucinations with faithfulness-only scoring on un-curated test sets; the company is the operationalisation of "your golden set is not enough."
  • RAGAS — provides synthetic test-set generation paired with grounding-aware metrics; the explicit warning in the docs is that synthetic-only trays drift in style from real queries.
  • BeIR and MTEB — public retrieval benchmarks with versioned labelled queries; the slice-by-domain structure is exactly the four-source-mix problem at academic scale.
  • Casetext CoCounsel (post-Avianca) — the golden set for citation accuracy is owned jointly by ex-litigators and ML engineers, because the Mata v. Avianca incident proved that ML engineers cannot own legal labels alone.
  • Harvey — uses BigLaw associate review as the SME validation gate; every row in the golden set is signed by an attorney, not a labeller.
  • Hebbia — finance-domain golden sets are owned by ex-buy-side analysts, not data labellers, because the per-error cost of a wrong earnings-extraction makes SME labelling cost-justified.
  • Glean — enterprise search golden sets are per-customer-tenant because what counts as a "good" answer differs by org; the multi-tenant tray is the operationalisation of "the slice mix must match production traffic."
  • Notion AI Q&A — workspace-specific golden sets feed pre-release evals; the team learned that a generic Q&A set could not catch workspace-shape failures.
  • GitHub Copilot Chat — repo-shaped golden sets cover multi-language and multi-style code; each repo type is a slice, with separate pass-rate targets.
  • Cursor — tool-call success rate is measured on a held-out repo benchmark with explicit version pins; the team rejects releases that regress on a single named slice.
  • Perplexity — citation-accuracy golden set is human-validated and refreshed weekly; the team treats freshness as a primary operational property of the tray.
  • Intercom Fin — golden tickets are sampled per customer-tenant and validated by customer support leads; the deflection-rate claim is footnoted by the per-tenant tray.
  • Salesforce Einstein Copilot — adversarial-injection golden cases are owned by the trust layer team, separate from the standard-behaviour cases owned by feature teams.
  • AWS Bedrock Knowledge Bases — the retrieval-failure-analysis tooling exists because customers needed slice-level diagnostics on labelled trays they built themselves.
  • Air Canada (2024) — counter-example. There is no public evidence of a golden set covering policy-violation edge cases. The tribunal's finding implied the missing slice directly.

The pattern is consistent. Teams that ship reliably have a labelled tray with provenance, ownership, slices, and a version number. Teams that do not are running the inspection on a CSV.

Recall — can you reproduce the chapter cold?

  1. Name the six properties that turn a CSV of test cases into a golden dataset.
  2. Why is a four-source mix more honest than any single source, even a great single source?
  3. In the refund-chatbot example, why did the SME reject 8 rows during validation, and what did each rejection teach?
  4. State the chapter's load-bearing rule about what makes a row "golden."
  5. At what tray size can you trust a 2-point pass-rate move on a single slice?
  6. Name three flavours of leakage and how each is detected.
  7. What is the first operational signal that a golden set has gone stale?
  8. Why is "the golden set is a fixed asset" the wrong mental model, and what is the right one?

Interview Q&A

Q1. Your team has 200 saved transcripts labelled by a junior engineer. The PM calls it a golden set. Is it?

A. Not yet. A golden set is the six-tuple per row — provenance, label, slice, freshness, owner, version — and a labelled transcript dump usually has at most two: source and label. The first move is to walk the 200 rows through SME validation, expect 5–15% label corrections, add slice tags, name an owner for labels and an owner for composition, and freeze it as v1.0. Until that work happens, the 200 rows are a candidate pool, and any eval number it produces is partly fiction. The most dangerous version is the one nobody questions because "we have a golden set." Common wrong answer to avoid: "If it has labels, it's a golden set."

Q2. Why not just use 5,000 LLM-generated cases? The model is cheap and the coverage is broad.

A. Three failure modes stack. Style drift — LLM-generated prompts read smoother than real users, so the model gets rewarded for handling cleaner inputs than it will actually face. Coverage hallucination — the generator covers what it imagines edge cases look like, not what edge cases look like in your traffic. And the validation problem — 5,000 LLM-generated rows still need labels, and labelling 5,000 rows is the expensive step regardless of where the prompts came from. The honest mix is four-source — production logs anchor style, LLM generation fills coverage, SMEs fix labels, user feedback flags real failures. Each source corrects a different lie in the others. Common wrong answer to avoid: "Synthetic data is now good enough to be the only source."

Q3. Your golden set pass rate is 94%. Your live sample pass rate is 71%. What do you investigate first?

A. Leakage. A 23-point gap between a held-out tray and a live sample is mechanically hard to produce honestly; the most common cause is that the held-out tray has secretly entered the system. Check three places: the few-shot examples in the system prompt (are any of them in the golden set?); the fine-tuning corpus (does it predate the golden set's row provenance?); and the judge prompt (does it contain canonical answers, not just the rubric?). If all three are clean, the second hypothesis is slice-mix drift — the golden set's slice mix no longer matches production. If that is also clean, you have an unusually good model on test cases and an unusually bad one on real ones, which means the test cases are not representative. Common wrong answer to avoid: "The model is great; we just need to retrain on production data."

Q4. You inherit a 6-month-old golden set. It looks well-organised. What's your first audit step?

A. Pull the version log and the slice-mix comparison against current production. Two things usually rot first. The slice mix — if the product launched in EU two months ago and the golden set is 95% US, the aggregate score is no longer representative of the user experience. And the freshness — rows older than 6 months should be re-validated against the current policy and product, because labels that were right at v1.0 may be wrong at v1.x without anyone updating them. The third audit step is ownership — if the SME who validated labels has left, the trust chain is broken until somebody re-signs. Common wrong answer to avoid: "If the set is well-organised, it's still good."

Q5. Your PM says "we need a 1000-row golden set before launch." It's Tuesday and launch is next Tuesday. What's the right move?

A. Push back with the tier ladder. 1000 rows is the confident-swap tier and takes 1–2 months of SME time; nobody builds it in a week without ruining label quality. The right move for next Tuesday is the regression tier at ~200 rows, sliced into 4–5 segments, with the policy owner spending 1–2 days validating. That catches 3–5 point regressions and any slice collapse, which is enough to ship safely. The 1000-row build starts next sprint and serves the next major model swap, not this one. The wrong move is to rush 1000 rows with thin labels — you ship a tray that looks defensible and produces unreliable numbers for years. Common wrong answer to avoid: "Yes, we'll get 1000 rows done by Monday."

Q6. Cumulative — chapter 01 said complaints are not a measurement signal. So why include user feedback as a source for the golden set?

A. Because complaints fail as a measurement signal (sparse, biased, late) but succeed as a case-mining signal. A user who complained loud enough to escalate has handed the team a real failure that is worth promoting to a labelled test case. The flow is one-way: complaints are mined for cases, the cases are validated by an SME, the validated cases enter the golden set, and the golden set is what produces measurement. Treating complaints as cases (yes) is different from treating them as numbers (no). The four-source mix uses each signal for what it is good at, not for what it is bad at. Common wrong answer to avoid: "Chapter 01 said never use complaints."

Q7. The team wants to add 30 new edge cases to the golden set this week. The eng manager says "lock it, comparisons will break." Who's right?

A. Both are partially right, which is why versioning exists. Locking the set entirely means it goes stale and stops representing the product; adding rows freely means historical comparisons are meaningless. The correct move is to add the 30 rows as v1.1, publish the diff (which rows were added, why, which slices they cover, who validated), and run both versions in parallel for one cycle so the team can re-anchor scores. After re-anchoring, v1.1 is the active set and v1.0 is archived. The cost is one cycle of running both; the benefit is a tray that stays aligned with reality without losing comparability. Common wrong answer to avoid: "Lock the set forever to preserve comparability."

Q8. Define "ownership" for a golden set in one sentence, and name the two roles.

A. "Two named people who have the authority to approve a row's label and the composition of the set, respectively." The label owner is the SME — for the refund bot, the policy lead — who can defend why this row's correct behaviour is what it is. The composition owner is usually the ML or eval engineer who decides which slices need more coverage and when the set needs to be refreshed. Conflating the roles fails because ML engineers do not have policy authority, and SMEs do not have eval-statistics intuition. Splitting them produces two independent quality gates. Common wrong answer to avoid: "The ML team owns the dataset."

Apply now (10 min)

Step 1 — model the exercise. Pick five rows the refund-chatbot team should add to v1.1 next sprint. For each row, fill the six-tuple:

case_id provenance label slice freshness owner
v1_1_001 prod log 2026-05-03 OFFER_WARRANTY_HANDOFF US/single/medium new policy lead
v1_1_002 LLM-generated, EU GDPR variant APPROVE_WITH_GDPR_NOTICE EU/single/high new policy lead
v1_1_003 user escalation 2026-04-29 ESCALATE_TO_HUMAN US/multi/high new policy lead
v1_1_004 SME draft, jurisdictional edge REFUSE_WITH_REASON_R407 other/single/high new policy lead
v1_1_005 prod log 2026-05-10 REQUEST_MORE_INFO US/multi/low new policy lead

Notice the source mix — prod, LLM, user escalation, SME, prod again. Notice the slice spread — three jurisdictions, two turn counts, three risk levels. Notice that every row has a named owner. This is what one week of v1.1 work looks like.

Step 2 — your turn. Pick one AI feature in your own product (or one from a previous chapter's running example). Write the first 10 rows of its golden set in the same six-column shape. Decide the source mix in advance (e.g., 5 prod, 3 LLM, 2 SME) and stick to it. Name the label owner and the composition owner before you write a single row.

Step 3 — reproduce from memory. Without scrolling up, redraw the candidate-pool → validation-gate → golden-tray ASCII diagram from section 2, and write the six-tuple invariant from the rule heading in one sentence. Then connect it backward to the 38-point demo-to-prod gap from chapter 01 in one sentence. If you can do all three cold, you carry the chapter.

What you should remember

This chapter explained why every eval number — the 62% pass rate from chapter 01, the slice-tier estimates from chapter 02, every score chapter 04 onward will quote — collapses to fiction the moment the labelled tray behind it cannot be defended row by row. The refund-chatbot team built a defensible v1.0 in one week from 47 mined prod queries, 50 LLM-generated edge cases, SME validation that caught 8 mislabels, three slices, and two named owners. The set is small (89 rows), but every row passes the six-tuple test, and every score it produces has a footnote that says here is the population this claim covers and here is who signed for it.

You learned the four-source mix as the only honest sourcing strategy — production logs anchor style, LLM generation fills coverage, SMEs fix labels, user feedback flags real failures — and the sizing tiers (50 sanity, 200 regression, 1000+ confident-swap) that match tray size to the decision the team actually faces. You learned the three flavours of leakage that silently destroy a tray's meaning, and the difference between a fresh, healthy tray and a frozen prestige set whose score has stopped tracking reality.

Carry this diagnostic forward: when somebody quotes an eval number, ask one question — "show me five rows from the set, with provenance, label, slice, freshness, owner, and version." If they cannot, the number is a story. The inspection is only as honest as the tray it runs on. The spot check is only as representative as the slice mix the tray enforces. The rubric is only as trustworthy as the SME who signed the labels. The labelled tray is the load-bearing infrastructure for everything else in this module.

Remember:

  • A row is golden only when six things travel with it: provenance, label, slice, freshness, owner, version. Anything less is a candidate, not an asset.
  • Use four sources together (production, LLM, SME, user feedback). Each one lies in a different direction; the mix cross-validates.
  • Size to the decision: ~50 for sanity, ~200 for regression, 1000+ for confident swaps. Skipping tiers wastes SME hours.
  • A frozen tray rots silently. The score does not drop; it just stops being meaningful. Budget staleness as an operational cost.
  • A 20+ point gap between golden-set pass rate and live-sample pass rate is the leakage signature. The held-out set is no longer held out.

Bridge. Solving the labelled tray exposes the next pressure: SME-validated rows are expensive, slow to produce, and bottlenecked on humans, which means the four-source mix tilts toward LLM-generated cases whenever coverage has to scale faster than SME hours allow. The next chapter is how to generate those cases without inheriting the synthetic-only pathology — what good synthetic looks like, what bad synthetic does to your eval, and how to keep generated rows from drifting in style or memorising the prompt.

04-synthetic-generation.md