06. Versioning the set¶

Refresh changes the set. Versioning is what makes those changes auditable and what makes scores comparable across time. The set is an artefact like any other; it has a version, a changelog, an owner, and a release process.

A platform engineer at a Chennai SaaS company is asked to explain why the eval score dropped from 0.84 to 0.79 in the last week. She investigates. The drop is not from a system change; it is from a refresh that added 15 new failure-mode cases the system handles less well than the existing happy-path cases. The set version changed; the score did not adjust for the version change in the dashboard. The team had panicked over a system regression that was actually a set refresh. The fix is straightforward: the set's version is recorded with every score; the dashboard distinguishes "score on v3 of the set" from "score on v4 of the set"; the discontinuity is visible and explained.

This chapter is the discipline. The set is an artefact; treating it as one makes scores legible across time.

What versioning is for¶

Three uses.

Comparability. Two scores on the same set version are directly comparable. Two scores on different set versions are not; comparing them requires care.

Audit. When a question is asked about a historical decision (a model promotion, an incident), the set version at that decision must be recoverable.

Communication. A score reported with version provenance is honest about what it measures.

Without versioning, scores blur — "the eval went up" hides whether the system improved or the set got easier.

What the version captures¶

A version is a frozen snapshot of:

The case list (which cases are in the set)
The labels (the expected behaviour for each case)
The strata tags (chapter 04)
The rubric or judging mechanism per case (chapter 07)
The provenance metadata (source, sampled_at, author)
The retired cases as a historical reference (not in the active set, but findable)

Schema-wise, the set is a directory of YAML files in version control, or a small service backed by storage. The version is a git tag or a service-level identifier.

Semver for sets¶

A reasonable convention.

Bump	What changed	Comparable to previous?
Patch (3.1.0 → 3.1.1)	Label corrections; metadata fixes; no case adds or retires	Yes, for most purposes
Minor (3.1.0 → 3.2.0)	Case additions; targeted retirements; no rubric changes	Roughly, with attention
Major (3.x → 4.0.0)	Rubric changes; large refreshes; reclassification of strata	Not directly; re-baseline required

The convention is similar to API versioning. The point is that the version tells you the comparison posture.

The changelog¶

Every version bump has a changelog entry.

versions:
  - version: 3.2.0
    date: 2026-05-25
    summary: |
      Q2 refresh: added 38 cases (12 for new voice-input failure modes,
      8 for premium segment under-coverage, 18 from production sample).
      Retired 11 stale cases (3 for retired feature, 8 fixed failure
      modes that no longer occur in production).
    added: [case_201, case_202, ...]
    retired: [case_023, case_045, ...]
    relabelled: [case_087]
    baseline_score: 0.82            # new baseline after refresh
    previous_score: 0.86            # for context; v3.1.0 score
    owner_signoff: jane.doe@example.com

  - version: 3.1.0
    date: 2026-04-15
    summary: |
      Patch: corrected labels on 4 cases discovered during incident review.
    added: []
    retired: []
    relabelled: [case_032, case_044, case_077, case_098]
    baseline_score: 0.86

The changelog is readable; the changes are explicit; the score continuity is annotated.

Score dashboards with versioning¶

A dashboard showing scores over time must annotate version transitions.

Eval score (smart-reasoner) — last 6 months
+----------------------------------------------------+
|                                                    |
| 0.90 +                                             |
|      |       o                                     |
| 0.86 +    o-----o   o      *                       |
|      |   /        \ /     /                        |
| 0.82 +  o         o ---- o-----o                   |
|      |                          \                  |
| 0.78 +                            \                |
|      |                              o              |
|      +----+----+----+----+----+----+----+----+    |
|        v2.5  v2.6  v3.0      v3.1  v3.2           |
|                   (major)              (minor)    |
+----------------------------------------------------+

Notes:
- v3.0 was a major refresh with rubric changes. Re-baseline.
- v3.2 added voice-input cases (chapter 04); score drop reflects
  new stratum being harder, not system regression.

The version transitions are marked; major bumps have a re-baseline note; the team reads the chart with awareness of what shifts mean.

Reproducibility¶

When a question is asked about a historical decision ("we promoted model X to production at date Y; what was the eval score?"), the answer requires the exact set version at that time.

Store the set version on every eval run. The eval framework records set_version, run_at, model_version, prompt_version, score_per_stratum. A historical decision can be reproduced: pull the set version, re-run against the model and prompt versions, verify the score.

This is the production-grade discipline. Most platforms get here after the first time they cannot reproduce a historical score; the eval framework's per-run record is the cheap insurance against the cost of that one investigation.

Tagging the set as released¶

A version is released by tagging it. CI builds the set artefact (the YAML files plus the metadata) and tags it. Subsequent eval runs against that tag are reproducible.

A pre-release or "draft" version may exist for iteration — the team adding new cases, labellers labelling, calibration sessions. The pre-release is not used as the comparison baseline; only released versions are.

The release process: - The owner reviews the proposed set against the changelog. - The CI tests pass (set loads correctly; cases parse; no schema violations). - The version is tagged. - The eval framework's default set pointer moves to the new version (with the previous version still available for comparison).

What to keep, what to retire from version history¶

Released versions are kept forever (or per the platform's audit retention). They are the basis for any historical question.

The active version is what current evals run against. The set's "head" pointer.

Retired cases (no longer in the active set) are still in the historical versions where they were active; the set's history is intact.

Storage is cheap; retain the history.

Common mistakes¶

No version on scores. A graph of scores without version annotation is misleading; transitions look like system changes.

Big-bang refresh without version bump. Adding 50 cases in a "patch" loses the comparability the patch bump implies.

Version on the file but not on the run record. Scores cannot be reproduced because the version at run time is unknown.

Skipping the changelog. "What changed in v3.2?" requires reading git diffs and labelling sessions; the changelog is the readable answer.

Single eternal version. "Set v1" forever, with cases coming and going without bumps. Comparability is silently broken.

Interview Q&A¶

Q1. The eval score dropped from 0.86 to 0.79 in a week. The system did not change. What happened? A set refresh added cases the system handles less well than the existing distribution. The score drop is from the set, not the system. The discipline: the version is bumped; the changelog explains the new cases; the dashboard annotates the version transition; the previous and new baseline are both visible. The investigation goes to the changelog in 30 seconds. Wrong-answer notes: "the model regressed" without checking the set version is the chapter-opening panic.

Q2. Walk through what semver means for an eval set. Patch: label corrections, metadata fixes, no case changes. Comparable to previous. Minor: case additions, targeted retirements, no rubric changes. Roughly comparable, with attention to the stratum where additions landed. Major: rubric changes, large refreshes, reclassification of strata. Not directly comparable; re-baseline required. The convention parallels API versioning; the use is the same — readers know the comparison posture from the version. Wrong-answer notes: treating set versions as purely informational misses that they signal comparability.

Q3. Why store the set version on every eval run record? Reproducibility. A historical decision (a model promotion, an incident postmortem) may ask "what was the eval score at that time, and what did the set look like?" The version on the run record points to the exact set; pull the set; re-run if needed. Without the version on the record, the historical question is unanswerable beyond the score itself. The cost of recording the version on every run is small; the value the first time a reproduction is required is large. Wrong-answer notes: "we'll figure it out" without the version is the path to unanswerable historical questions.

Q4. The team has 18 months of eval scores. They want a year-over-year improvement number. How do you produce it honestly? Identify the set versions used at the start and end of the year. If they are the same, the comparison is direct. If they are different (likely; 18 months of evolution), the comparison requires care: re-run the older set against current model/prompt, re-run the current set against the old model/prompt, compare each pair to its own baseline; the deltas show change attributable to system vs to set. The honest year-over-year is "system improved by X on the older set; system improved by Y on the newer set; the difference between X and Y reflects what changed in the set's coverage." Wrong-answer notes: "compare end score to start score" produces a number that conflates set and system changes.

What to do differently after reading this¶

Version the set every refresh. Semver bumps per the convention.
Maintain a changelog readable by anyone in the team.
Record the set version on every eval run.
Annotate version transitions on score dashboards.
Retain version history; storage is cheap; reproducibility depends on history.

Bridge. Versioning is the artefact discipline. The next chapter is about the judging mechanism — for a given case type, exact-match vs reference vs rubric. The fit between case and mechanism is what makes the score meaningful. → 07-judging-mechanism-fit.md