05. Refresh and drift¶
Coverage on day one is one thing. Coverage at month six requires refresh. Production drifts; new failure modes appear; old ones are fixed; segments grow and shrink. The refresh discipline keeps the set current without destroying the regression-set property of comparability over time.
A platform engineer at a Mumbai healthcare-tech company finds her eval set six months out of date. The platform's user base has grown 4×; a new feature has launched; a new model version is in production. The eval set still reflects the world at launch. Refreshing is not a one-time activity; she designs a quarterly refresh process. Each quarter: pull a fresh production sample, identify new failure modes from complaints, source 20–40 new cases, label them, retire the cases that no longer represent current production. After two cycles the set tracks reality with a 90-day lag — short enough that drift is bounded, long enough that the team's labelling effort is sustainable.
This chapter is that discipline. Cadence, triggers, retire vs add, and the version discipline that lets refreshed sets remain comparable.
What refresh is for¶
Two pressures on the set.
The distribution drifts. Production input changes — new features, new segments, new behaviour. The set may not represent the current distribution.
The system changes. Models change; prompts change; what used to fail now passes (the case is no longer informative). What used to pass now fails for different reasons.
Refresh addresses both. Cases are added when new patterns emerge; cases are retired when they no longer represent useful regression signal.
The cadence¶
A reasonable default: monthly review, quarterly refresh.
Monthly review — a 30-minute look at: production failures in the last month; eval-score changes per stratum; new features shipped; new failure modes from customer support. Output: a list of candidate cases to add and a list of cases to consider retiring.
Quarterly refresh — the actual work: source 30–60 new cases addressing the candidates; retire 10–30 cases that are stale; re-baseline the score; bump the set version.
Some platforms refresh more frequently (monthly) if production is changing fast; some less (twice yearly) if it is stable. The cadence is a knob.
What triggers a non-cadenced refresh¶
Sometimes the cadence is interrupted by events.
- A new failure mode in production at material rate — customers are seeing the failure; the eval should include it now, not at the next quarterly.
- A major feature launch — the eval should include cases for the new feature before the launch is considered stable.
- A model migration — the eval should be confirmed current before the migration's promotion gate.
- An incident postmortem identifying a missing eval case — the case is added immediately.
- A regulatory change — new required behaviour; new cases to verify the system handles it.
The triggered refresh adds cases as needed; the next quarterly takes care of broader cleanup.
What to add at refresh¶
A reasonable refresh batch:
- 5–15 cases per new failure mode (depending on importance).
- 5–10 cases for each new segment that has become operationally meaningful.
- 5–15 cases per new feature with eval coverage.
- Targeted additions to under-covered strata identified in the monthly review.
The cases follow the sourcing discipline (chapter 02): production sample where possible, expert authoring for edge cases, synthetic for stratification gaps.
What to retire¶
Cases come out when they no longer add value.
Stale cases. The case represents a failure mode that has been fixed and is no longer common in production. The case was useful as regression-prevention for a while; now it is verifying behaviour the system reliably handles.
Duplicate cases. The set has accumulated near-duplicates over time; some can be merged or removed.
Mis-labelled cases. Cases whose labels have been found to be wrong (chapter 03) and that are no longer useful even with corrected labels.
Cases for retired features. A feature that has been deprecated or removed; its cases are no longer relevant.
The retirement is recorded in the set's changelog with the reason. Retired cases are not deleted from history (chapter 06's versioning), but they are not part of the active set.
What not to retire¶
Regression-prevention cases. Even if a case is now boring (the system handles it well), keeping it ensures no future change accidentally breaks it. Retire only when the case is no longer representative of any production behaviour, not just because it passes.
Diverse-segment cases. Cases from small segments are easy to retire ("they don't fail much"); keeping them maintains segment coverage for the rare regression. Chapter 04's stratification depends on these.
The default is keep unless retirement is justified; the active set is a curated subset of all history.
Re-baselining¶
After a refresh, the score on the new set is not directly comparable to the score on the old set — different cases, different labels in some places. Re-baseline:
- Run the new set against the current production model and prompt; record the new baseline score per stratum.
- The next regression check is against this new baseline.
- The dashboard graph shows the version bump explicitly (a discontinuity, with a note about the refresh).
The discontinuity is honest; the alternative (silent drift in what the score means) is misleading.
How refresh interacts with versioning¶
Every refresh bumps the set version (chapter 06 covers this). The change log records what was added, what was retired, what was re-labelled. A score reported with the set version is comparable to other scores on the same version.
A platform doing four refreshes a year operates v1 in Q1, v2 in Q2, v3 in Q3, v4 in Q4. The score graph shows the version transitions; comparisons across versions require care.
Drift detection on the set itself¶
A practical signal: the production-traffic eval (chapter 11 of 01_model_gateway_provider_ops's observability) scores production samples against the set's rubric or labels. If production-traffic eval scores drift below the regression-set scores, the set is no longer representative — production includes cases the set does not cover. The drift is the trigger for an off-cycle refresh.
Conversely, if production-traffic eval scores match regression-set scores closely, the set is representative and the cadence can continue.
Common mistakes¶
No refresh. The chapter-opening case. The set ages; the score loses meaning.
Refresh without retirement. The set grows indefinitely; cases age and stop being relevant; the work to maintain grows.
Big-bang refresh. All cases replaced at once; comparability is broken; the team cannot tell if the new score reflects new cases or new system behaviour.
No re-baseline. Cases changed; baseline did not; gates fire on the wrong reference.
Refresh decisions in the engineer's head. No record of what changed; future investigations cannot tell why the score moved.
Interview Q&A¶
Q1. The eval set has been stable for nine months. Score is 0.86. Production complaints are climbing. What is happening? The set is stale. The production distribution has likely drifted; the cases in the set still pass (the score holds) but the cases the system fails on are not in the set. The score is a real measurement of yesterday's distribution, not today's. The fix is a refresh: source from current production, identify new failure modes from the complaints, add cases, retire what no longer applies. The score after refresh may drop; the drop is information, not failure. Wrong-answer notes: "the score is fine, complaints must be wrong" inverts the priority.
Q2. Walk through a quarterly refresh. Monthly review built a list of candidates: new failure modes, new features, under-covered strata. Quarterly refresh source 30–60 cases from production sample plus authoring/synthetic per chapter 02. Cases are labelled per chapter 03. Existing cases are reviewed: any stale (now-fixed failures), duplicate, or mis-labelled retired per the discipline. The set version is bumped; the changelog records adds, retires, and re-labels. The new set is run against the current production model; new baseline scores per stratum are captured. Dashboards show the version transition. Wrong-answer notes: ad-hoc adds without retirement or re-baseline lose the discipline.
Q3. The team retires a case "because it always passes." Why is that the wrong reason? Because passing is the desired state; the case's role is regression-prevention. Retiring it means future changes might regress it without anyone noticing. The right reason to retire is "the case no longer represents production behaviour" — the failure mode is gone, the feature is removed, the segment is no longer served. "Always passes" is a feature of a working regression set, not a reason to remove cases. Wrong-answer notes: treating eval as a test suite and "trimming the green tests" is the misframe.
Q4. The production-traffic eval scores have drifted below the regression-set scores. What does this tell you? The regression set is no longer representative — production includes cases the set does not cover. New patterns are landing in production that the curated set has not absorbed. This is a signal to refresh off-cycle: sample the new patterns from production, identify the new failure modes, add cases. The regression set's score is correct for the cases it has; it is just incomplete relative to production. Wrong-answer notes: "the production eval is buggy" without investigation; the dual scores are designed to surface drift.
What to do differently after reading this¶
- Establish a refresh cadence; the default of "monthly review, quarterly refresh" is a useful starting point.
- Identify triggers for off-cycle refreshes; act on them.
- Retire stale cases; the active set is curated, not accumulated.
- Re-baseline after every refresh; the score is honest about the version it reflects.
- Monitor production-traffic eval against the regression set as a drift signal.
Bridge. Refresh changes the set. Versioning is what makes those changes auditable and comparisons possible across time. The next chapter is the set as a versioned artefact. → 06-versioning-the-set.md