12. Architect checklist¶
Twenty items. Source, label, cover, refresh, version, judge, cost, privacy, govern, respond. If you can answer all of them with an artefact, the eval-set operation is defensible. If you cannot, the gaps are the work.
The checklist a lead uses in design review, in eval setup, in quarterly governance, and at the first incident postmortem.
Build the set (1–6)¶
1. Role clarity. Is the set understood as a sample, not a test suite? Are regression and distribution roles distinguished if both apply? (Chapter 01.)
2. Sources balanced. Is the set sourced from production sample, expert authoring, and synthetic generation in proportions appropriate to the platform's maturity and data sensitivity? (Chapter 02.)
3. Labelling discipline. Are domain experts involved in labelling? Is there calibration across labellers? Is inter-labeller agreement measured? (Chapter 03.)
4. Coverage stratified. Is the set tagged across failure modes, segments, input shapes, and subtasks? Are per-stratum scores reported and gated? (Chapter 04.)
5. Stratum sizes adequate. Does each critical stratum have at least 10–15 cases? Is the total set size in the 100–300 range for regression, 1,000–5,000 for distribution? (Chapter 04.)
6. Mechanism fit. Is each case using the appropriate judging mechanism (exact-match, reference, rubric)? Is the mechanism per case recorded in metadata? (Chapter 07.)
Keep the set alive (7–11)¶
7. Refresh cadence. Is there a monthly review and a quarterly refresh? Are off-cycle refreshes triggered by new failure modes, features, or migrations? (Chapter 05.)
8. Retirement discipline. Are stale cases retired with reason in the changelog? Does retirement preserve regression-prevention coverage? (Chapter 05.)
9. Versioning. Is the set version bumped per semver convention? Is every eval run recorded with the set version? (Chapter 06.)
10. Re-baselining. After each refresh, is the new baseline captured per stratum? Are dashboards annotated with version transitions? (Chapter 06.)
11. Production-traffic eval drift signal. Does the platform run production-traffic eval against the rubric? Does drift between production-traffic and regression-set scores trigger refresh? (Chapter 05.)
Operate at scale (12–15)¶
12. Cost visibility. Is the eval cost tracked per run, per day, per team? Is there an explicit budget? (Chapter 08.)
13. Tiered runs. Is there a fast smoke set for pre-merge, regression for nightly, distribution for weekly? (Chapter 08.)
14. Caching. Are eval results cached on unchanged inputs? Is the cache hit rate monitored? (Chapter 08.)
15. Judge calibration. Is the judge calibrated against human labels periodically? Is judge drift detected via an eval-of-the-judge set? (Chapter 07, 11.)
Govern and respond (16–20)¶
16. Privacy discipline. Is real PII excluded from the set via synthetic substitution at intake? Does the set participate in RTBF? (Chapter 09.)
17. Set owner. Is there an accountable owner with authority and time? Is the owner empowered to mediate cross-team conflicts? (Chapter 10.)
18. Cross-team governance forum. Is there a quarterly sync with contributing teams? Are conflicts surfaced and resolved? (Chapter 10.)
19. Incident workflow. Is there a documented response for false positives, missed cases, mis-labels, and judge drift? Are incidents documented in the set's changelog? (Chapter 11.)
20. Case-level investigation. Does the team always look at the cases on a regression gate block, not just the score? (Chapter 11.)
How to use the checklist¶
In setup: walk the items; most are red on day one; that is normal. Schedule the path to green.
At three months: items 1–6 should be green or near-green; the set is real. Items 7–11 (refresh discipline) are starting.
At six months: items 7–11 are green; items 12–15 (scale) are being operated.
At twelve months: items 16–20 (govern and respond) are routine; the operation is sustainable.
Common postmortem-to-checklist mappings¶
- "Score was 0.86 but production was failing" → items 1, 4 (role clarity, coverage)
- "We labelled wrong" → items 3 (labelling discipline), 11 (eval drift signal)
- "Set was stale" → items 7 (cadence), 11 (drift signal)
- "Score dropped after refresh; no one knew why" → items 9 (versioning), 10 (re-baselining)
- "Eval cost runaway" → items 12, 13, 14
- "Real PII in source control" → item 16
- "Team A's cases dominated; Team B silent" → items 17, 18
- "Regression gate blocked good change" → item 20 (case-level investigation)
Interview Q&A¶
Q1. You inherit a platform's eval set. Which three items do you check first? Item 1 (role clarity) — is the set understood correctly. Item 7 (refresh cadence) — has the set been kept current. Item 17 (owner) — is anyone accountable. These three determine whether the set is producing useful signal or comfortable noise. The other items follow. Wrong-answer notes: starting with item 12 (cost) misses the deeper questions about whether the set is meaningful.
Q2. The team objects that the checklist is "too heavy for our small platform." How do you respond? A small platform can land items 1, 3, 4, 6, 9, 17 first — role, labelling, coverage, mechanism, versioning, owner. The rest (cost optimisation, governance forum, judge calibration) become relevant as the platform scales. The checklist is a triage tool; not all items apply uniformly. The first six are the substantive minimum. Wrong-answer notes: dismissing the checklist entirely produces unmonitored eval drift.
Q3. Which item on this checklist is most under-appreciated? Item 20 (case-level investigation on every regression gate block). The eval's value depends on the team's willingness to look at the cases when the score moves. A team that trusts the score blindly ships past false positives or rolls back good changes. The discipline of always looking at the cases is the difference between a useful eval and a number that decays. Wrong-answer notes: any specific item is defensible; what distinguishes is the reasoning about discipline-over-time.
Q4. The set has a 0.86 score that has been steady for nine months. The team is comfortable. What three items do you investigate? Item 7 (refresh cadence) — has the set been refreshed; if not, the steady score may reflect a stale measure. Item 11 (drift signal) — does production-traffic eval show a different number, indicating coverage drift. Item 4 (coverage stratification) — is the steady score the average of healthy strata or a hidden mix. The investigation usually reveals that the score is honest about the set it measures but the set is no longer measuring what the team cares about. Wrong-answer notes: taking the score at face value is the chapter-1 trap.
Bridge. The checklist is the engineer's defence. The last chapter is the honest opposite — what golden sets cannot solve, where the discipline is bounded, and the limits a thoughtful lead is transparent about. → 13-honest-admission.md