02. Sourcing cases¶

The set's role is clear. The next question is where its cases come from. Three sources — production sample, expert authoring, synthetic generation — each have trade-offs. Balancing them is the sourcing discipline.

A data engineer at a Bengaluru fintech is asked to build the first eval set for the credit-decisioning agent. Her first instinct is to ask the team for "representative cases." She receives 20 hand-authored examples that look reasonable but feel rehearsed. She runs the system against them; everything passes. She is suspicious. She pulls a one-week sample of real production calls instead. Of 200 samples, 14 produce outputs the team agrees are wrong. The hand-authored cases were the team's imagination of the system's challenges; the production sample is the system's actual challenges. The set she builds combines both — production for representativeness, expert authoring for known edge cases, synthetic for cases the production sample is too small to cover.

This chapter is the sourcing discipline. Each source has a role; the balance per platform varies.

The three sources¶

Source	What it provides	Cost
Production sample	Real cases the system actually sees	Cheap to collect; expensive to label; privacy concerns
Expert authoring	Curated cases targeting known patterns	Slow; high quality; bounded by the expert's imagination
Synthetic generation	Cases at scale, covering combinations the natural data misses	Fast; lower fidelity; requires careful design

A useful starting mix for a regression set: 60% production sample, 25% expert authoring, 15% synthetic — but this varies widely by domain.

Production sample¶

The default and the most important source. Cases drawn from real production traffic.

The procedure:

Sample. Random sample over a representative window (a week is typical for steady-state platforms). 200–500 cases is a useful starting pool.
Stratify. Within the pool, group by feature, by tenant segment, by recognised failure shape. The eventual set draws from each stratum.
De-identify. Apply PII redaction per 03_ai_security_safety/03_data_access_governance chapter 05. Synthetic substitutes for direct identifiers; leave structure intact.
Label. Domain experts label expected behaviour per chapter 03.
Add to set. Cases that pass labelling enter the set with provenance metadata (source: production-sample, sampled_at, original_audit_id where retained).

The biggest sourcing decision is the sampling window. Too short (one day) misses cyclical patterns; too long (one quarter) over-weights any drift. One to two weeks is a useful default.

Expert authoring¶

Cases written by domain experts to target known patterns the production sample may not cover.

When to use:

Edge cases the team has discussed but not seen often. A regulatory edge case, a rare but high-stakes scenario.
Known failure modes the team has fixed. Authoring cases that exercise the fix prevents regression.
New features without production traffic yet. Pre-launch evaluation.
Difficult-to-sample-from-production cases. Sensitive cases that are hard to capture due to privacy constraints.

The risk: expert-authored cases reflect the expert's model of the system's failures, not the system's actual failures. They may target imagined problems while real problems escape. The discipline is to complement production sampling, not replace it.

Authored cases should be marked as such (source: expert-authored, author, intent). When the system shifts, authored cases may become irrelevant; the set's owner reviews them at refresh.

Synthetic generation¶

Cases generated programmatically — by templates, by an LLM acting as a case generator, by data augmentation of real cases.

When to use:

Stratification gaps. The production sample over-weights some segments; synthetic fills the under-represented ones.
Combinatorial coverage. Combinations of conditions the natural data is too small to include.
Adversarial cases. Inputs designed to probe specific failure modes (often generated by an LLM with a "find a hard case" prompt).
Privacy-sensitive cases. Synthetic data for scenarios where real production data is too sensitive to use.

The risk: synthetic cases reflect the generator's biases. An LLM generator may produce cases that look like training data, missing the long tail of real human input. The discipline is to validate synthetic cases against a known-real reference — do they produce similar score distributions to real cases of the same intent?

A useful pattern: use synthetic generation to complete the set's stratification, not to dominate it. If the production sample has 5 cases in a stratum and the set needs 20, generate 15 synthetic cases to fill — and verify they score similarly to the 5 real ones.

The mix per platform¶

The right mix depends on what is available.

Situation	Suggested mix
Mature platform with rich production traffic	70% production, 20% authored, 10% synthetic
New platform with little traffic	30% production, 40% authored, 30% synthetic
Highly regulated; production data sensitive	30% production, 30% authored, 40% synthetic
Multi-tenant with heterogeneous segments	60% production stratified across segments, 30% authored for edge cases, 10% synthetic

Adjust over time as the platform matures.

Provenance metadata¶

Every case carries metadata about its source.

- id: case_001
  input: {...}
  expected: {...}
  source: production-sample
  sampled_at: 2026-05-10
  original_audit_id: aud_...
  stratum: failure-mode/wrong-account
  added_to_set: 2026-05-20
  added_by: jane.doe@example.com

- id: case_002
  input: {...}
  expected: {...}
  source: expert-authored
  author: ravi.k@example.com
  intent: "Tests handling of multi-account customer with shared email"
  stratum: edge-case/multi-account

- id: case_003
  input: {...}
  expected: {...}
  source: synthetic
  generator: case-augmentor-v2
  base_case: case_001
  augmentation: "currency change INR -> USD"
  stratum: failure-mode/wrong-account

The provenance lets the owner reason about the set: how much is from each source, how stale is the production sample, which strata are over-represented by synthetic data. The metadata is what makes the set legible.

What sourcing does not solve¶

Bad labelling. Sourcing produces inputs; chapter 03 produces labels. A perfectly sourced set with wrong labels is worse than no set.
Stale data. The freshest sourcing on day one ages by day 90. Chapter 05 handles refresh.
Coverage drift. New failure modes emerge in production; the set may not include them until the next refresh sweep.

Common mistakes¶

Production-only with no authoring. The set captures common cases; edge cases the team cares about are absent unless authored.

Authored-only with no production. The set captures the team's imagination; real cases are absent.

Synthetic-dominant without validation. Cases that look like training data; the score does not reflect production reality.

No provenance. The owner cannot tell which cases are real, authored, or synthetic; refresh decisions are blind.

Over-sampling stale production windows. Cases from six months ago that no longer represent the current input distribution.

Interview Q&A¶

Q1. The team has 100 hand-authored eval cases that all pass. The PM wants to ship. What do you say? "The hand-authored cases reflect what the team imagined the system would handle. Let's verify against production: pull a 200-case sample of real recent traffic; label expected outputs; check the system. If it passes those too, we have a stronger signal. The hand-authored set alone is a partial answer." This usually finds real-world cases the authoring missed. Wrong-answer notes: trusting hand-authored cases alone produces the chapter-opening surprise.

Q2. The production sample contains heavy PII. How do you build a set from it? PII redaction per 03_ai_security_safety/03_data_access_governance chapter 05: synthetic substitutes for direct identifiers, keep structure intact. A case becomes "customer Ravi Kumar" → "customer [SYNTHETIC_NAME]" while preserving the shape of the query. The labelling refers to the structure, not the specific identifier. For cases where the PII is load-bearing (e.g., aadhaar validation logic), use entirely synthetic cases that exercise the logic without holding real PII. Wrong-answer notes: "we'll just use the production sample" without redaction puts the set itself at risk.

Q3. Walk through how you would use synthetic generation appropriately. After sampling production and identifying gaps in stratification — segments or failure modes underrepresented in real data. Use a generator (LLM with a careful prompt, or a templated augmenter) to produce 5–20 cases per under-covered stratum. Validate against a small set of known-real cases in the same stratum: do the synthetic cases score similarly when run against the system? If yes, the synthetic cases are reasonable; if no, the generator is producing unrealistic inputs. Keep synthetic as a minority of the set; the majority is real or authored. Wrong-answer notes: "generate everything synthetically" loses the production-distribution fidelity.

Q4. The set has been running for a year. The team is debating whether to do a full refresh or incremental update. What is your view? Incremental for the regression set: cases are added as new failure modes appear; cases are retired when the production distribution no longer warrants them. The set evolves continuously without a single big refresh that disrupts comparability across versions. For the distribution set (chapter 1's distinction), full periodic refreshes are appropriate because that set is meant to reflect current production. Full refreshes of the regression set break the score continuity that the regression set's role depends on. Wrong-answer notes: "full refresh every quarter" loses the regression set's stability property.

What to do differently after reading this¶

Sample production as the primary source; supplement with authoring and synthetic where the production sample is gappy.
Stratify before adding to the set; ensure failure modes and segments are represented.
Apply PII redaction at source intake; synthetic substitutes preserve structure.
Record provenance metadata per case; the owner needs it for refresh decisions.
Validate synthetic cases against real references; do not let synthetic dominate.

Bridge. Cases sourced are inputs; labels are the spec. The next chapter is the labelling discipline — who labels expected behaviour, by what process, with what calibration. → 03-labelling-discipline.md