04. Coverage and stratification¶
A set with labels exists. The question is whether it covers what matters. An average score over the wrong distribution is the chapter-1 trap. Coverage and stratification are the disciplines that ensure the set represents the failure modes and segments the team cares about.
A data engineer at a Bengaluru consumer-tech company reviews her team's eval set after a month of use. The set has 200 cases; the average score is 0.88; the team feels good. She slices the score by customer segment and finds that the average is held up by 150 cases from the largest segment (where the system works well) and lowered by 50 cases from smaller segments. The smaller segments — premium tier customers, regulated-industry customers, voice-input customers — have per-segment scores of 0.71, 0.65, and 0.59. The 0.88 average is hiding three segment-level concerns. The team's discipline shifts: instead of one average, the eval reports per-segment scores; the regression gate refuses ships that drop any segment, not just the average.
This is the coverage problem. The set must represent strata — failure modes, segments, edge cases — not just a sampling that produces a single defensible number.
Why averages mislead¶
A single average score combines many populations. A change can improve performance overall while making it worse for an important sub-population. A flat overall score can mask a sub-population whose performance is collapsing.
The discipline:
- Report per-stratum scores, not just the overall average.
- Set regression gates per stratum, not just on the overall.
- Communicate scores by stratum to stakeholders, especially when segments matter to revenue, compliance, or reputation.
The set's design supports this: cases are tagged with strata; the runner aggregates per stratum; dashboards show the breakdown.
The strata to design for¶
Four common dimensions of stratification.
Failure modes. Categories of failure the team has identified (chapter 02 of 14_legacy_ai_modernization covered the audit; the failure modes from there are strata here). Each failure mode has cases in the set; the score per failure mode tells you whether the system handles each.
User or customer segments. Different segments may have different needs or different input distributions. Premium vs free; enterprise vs SMB; English-language vs multi-language; regulated industry vs general. Per-segment scores surface segment-level issues.
Input shape. Long vs short inputs, structured vs free-text, with-tools vs without-tools, with-context vs without-context. The system may handle one shape well and another poorly.
Task subtype. Within a feature, multiple sub-tasks. A support agent might handle order queries, billing queries, technical queries; each is a stratum.
Most platforms operate with 2–4 dimensions, producing 10–30 strata. The cases distribute across the strata (chapter 02's sourcing ensures the sample covers the strata).
Stratum sizes¶
Each stratum needs enough cases to produce a stable score. Rules of thumb:
- Under 5 cases per stratum. Single-case flips dominate; the per-stratum score is too noisy.
- 5–15 cases. Useful for known critical strata where the cases are expensive to produce; the per-stratum score is somewhat noisy but trends are visible.
- 15–50 cases. The sweet spot for most strata; scores are stable enough that small changes are meaningful.
- Over 50 cases. Over-investment for most strata; the marginal case adds little.
A regression set of 100–300 cases (chapter 01's range) splits across 10–20 strata, with the largest strata getting more cases and the smaller strata getting at least 5–10.
Tagging cases¶
Cases carry stratum tags. A case can be in multiple strata.
- id: case_042
input: { customer_query: "voice-input transcript of a billing question..." }
expected: { ... }
strata:
failure_mode: hallucinated_account_number
segment: premium
input_shape: voice
subtask: billing
The runner reads the tags; per-stratum scores are produced by filtering on each tag.
Building the stratification¶
A reasonable workflow.
1. List the strata. The failure modes from the audit, the segments the business cares about, the input shapes the system supports, the subtasks the feature covers.
2. Inventory existing cases against strata. For each existing case, tag it with the strata it belongs to. Surface strata with zero or near-zero cases.
3. Source additional cases for under-covered strata. Chapter 02's mix (production, authored, synthetic) — sample more from production in those strata, author specifically for them, synthesise to fill if needed.
4. Validate stratum sizes. Each critical stratum has at least 10–15 cases; non-critical strata at least 5.
5. Run the eval. Report per-stratum scores. Investigate any stratum below threshold.
The build is one-time, then maintained at refresh (chapter 05) — new strata may appear (a new segment, a new feature), and the set is rebalanced.
The two-set design and stratification¶
The regression set (chapter 01) prioritises stratification — every important stratum has cases. The distribution set draws from production traffic, naturally stratified by the production distribution.
For the regression set, you over-sample strata that are important regardless of frequency. A regulated-industry segment with 1% of traffic but 20% of legal risk gets disproportionate cases in the regression set. The set is designed, not random.
For the distribution set, the sampling reflects production naturally; the per-stratum scores reflect what happens to most users.
Both views are useful. The regression set says "we are correct on the cases we care about"; the distribution set says "we are correct on the cases users see."
When a stratum changes¶
Strata are not fixed. Three causes for change.
A new failure mode appears. Production complaints reveal a category the team did not previously track. Add the stratum to the set; source cases for it.
A failure mode is fixed and no longer common. Cases for it remain (as regression-prevention), but new cases are not added.
A new segment matters. A business decision adds a new customer segment; the set adds the segment as a stratum and sources cases.
The strata evolve with the platform; the set's owner (chapter 10) manages the evolution.
Common mistakes¶
One average, no strata. The chapter-opening case: hidden sub-population problems.
Strata too small. 1–3 cases per stratum; per-stratum scores are noise.
Strata defined by engineers only. The business or product team's important segments are missed.
Static strata. New strata that should exist are not added; the set ages out of relevance.
Per-stratum gates that block too much. Every stratum has variance; a regression gate on every stratum's score can produce false-positive blocks. Stratum gates are typically on threshold deltas (≥3 cases regressing) rather than any drop.
Interview Q&A¶
Q1. The eval set scores 0.88 overall. Why is that not a complete answer? Because the 0.88 averages across whatever distribution the set has. A change that improves overall performance while making it worse for a small but important sub-population would still show 0.88 or better. Per-stratum scores reveal sub-population issues that the average hides. The discipline is to report per stratum and gate per stratum, not just on the average. Wrong-answer notes: trusting the average is the chapter-opening trap.
Q2. Walk through how you would stratify a customer-support agent's eval set. Four dimensions. Failure modes from the audit: wrong-account, missing-context, refusal-of-valid-request, etc. Customer segments: premium, free, enterprise, regulated. Input shapes: short queries, multi-turn, voice-transcribed. Subtasks: order, billing, technical, general. Cases get tagged with all applicable strata. Build the matrix; ensure each critical stratum has 10+ cases. Run; report per stratum; investigate any stratum below threshold. Wrong-answer notes: stratifying by one dimension misses the multi-dimensional reality.
Q3. A stratum has only 3 cases. The scores swing wildly between runs. What do you do? Source more cases for that stratum — at least 10–15. Until the stratum has enough cases, the per-stratum score is noise. If the stratum is important and cases are hard to source (rare in production), supplement with expert authoring or synthetic generation per chapter 02. Until coverage is adequate, report the stratum as "low confidence" rather than as a number that pretends to be reliable. Wrong-answer notes: "use the noisy number anyway" produces wrong decisions.
Q4. The regression gate fails because one stratum dropped from 0.80 to 0.78. How do you decide whether to block the ship? Look at the cases in the stratum that flipped. If 3+ cases regressed (or any regulated-data case regressed), block. If 1 case flipped and it is a borderline rubric judgement, the noise is plausible and the team may proceed with attention. The gate is a signal, not an absolute; the team's judgement on the specific cases informs the action. The discipline is to look at the cases, not just the score, for every gate failure. Wrong-answer notes: "score threshold passed or failed" without case-level review is mechanistic.
What to do differently after reading this¶
- Stratify the set along 2–4 dimensions that matter for the platform.
- Tag every case with its strata. The runner reports per stratum.
- Ensure each critical stratum has at least 10–15 cases.
- Set regression gates per stratum, not just on the overall.
- Update strata as new failure modes and segments emerge.
Bridge. Coverage on day one is one thing. Coverage at month six requires refresh — production drifts, new failure modes appear, old ones are fixed. The next chapter is the refresh discipline that keeps the set current. → 05-refresh-and-drift.md