Skip to content

03. Versioning and rollback — every change has a SHA, a diff, and a way back

~15 min read. The point of versioning is not history-for-history's-sake. It is the ability to undo a bad deploy in under five minutes. Everything else in this chapter exists to make that one number true.

Builds on 02-the-prompt-registry.md. The registry stores versions. This chapter is what you do with them — how to diff, when to promote, and how to roll back when the dashboard turns red.


1) Hook — a five-minute rollback as it actually happens

Look. The clock reads 14:32 on a weekday afternoon. The on-call dashboard shows a sharp dip in the refund agent's "correct intent extraction" score — from 96% steady-state to 78% in the last fifteen minutes. The alert went off two minutes ago. Here is what the next four minutes look like.

14:32  alert fires: refund_intent_accuracy dropped 18pp in 15min
14:32  on-call pulls a failed trace
       └─ trace metadata shows prompt_sha = b1d7e4c2
14:33  on-call runs:  promptctl history billing.classifier.refund_intent
       └─ sees b1d7e4c2 deployed at 14:18, parent a8c3f971
       └─ eval_score on promotion: 91.4 (vs 96.7 on parent)
14:34  on-call runs:  promptctl rollback billing.classifier.refund_intent \
                          --to a8c3f971 --reason "live regression on tone eval"
       └─ deployment pointer flips: production → a8c3f971
14:34  config propagates; runtime caches refresh within ~30s
14:36  refund_intent_accuracy recovers to 94%, climbing
14:37  on-call posts to #incidents with the rollback SHA, parent SHA,
       eval diff, and the link to the offending PR
14:42  postmortem ticket filed; the rolled-back SHA is preserved as
       status=rolled_back for replay analysis tomorrow

Five minutes from alert to recovery. The two ingredients that made it possible were (a) the trace knew its prompt SHA, and (b) the rollback was a single pointer flip, not a code deploy. Both are properties of the registry. Without them, the same incident is a two-hour git-archeology session under pressure, with the production team trying to remember what shipped today and which engineer owns the wording.

The rest of this chapter is the discipline that gets you to the five-minute number, every time.


2) The metaphor — git for the recipe book

The bakery has been doing this for a year now. The recipe book has hundreds of revisions across dozens of recipes. The head baker treats the book like git treats source code. Every revision is a commit. Every commit has a parent. Every recipe has a deployed pointer that says "today's sourdough is at this revision". When tomorrow's sourdough goes wrong, the fix is not to bake another batch — it is to flip the pointer back to yesterday's revision and put it on the shelves. The rollback is a ten-second operation, faster than baking a new loaf.

The mental model that transfers cleanly is git. Branch, edit, diff, review, merge, tag, revert. The vocabulary maps almost one-to-one onto prompt ops. What does not transfer cleanly is the diff — and that is the section that deserves the most attention, because prompt diffs lie in ways code diffs do not.


3) The diff problem — text changed, behavior changed differently

A code diff says this character changed. The reviewer reads the diff, knows the change, predicts the impact. The mapping from diff to impact is usually tight. Same character pattern, same behavioral effect.

A prompt diff says this word changed. The mapping from diff to impact is much looser. "Always greet the user" vs "Greet the user" is one character. The behavior shift is real but small. "Be helpful" vs "Be helpful and concise" is two words. The behavior shift can be enormous — the agent stops asking clarifying questions, gives shorter answers, and sometimes refuses to engage with ambiguous requests. The diff looks microscopic. The behavior is twenty percent different.

Two diffs, then.

SYNTACTIC DIFF          SEMANTIC DIFF
──────────────          ─────────────
text change             behavior change
shown as red/green      shown as eval-score delta
"is this prose I want"  "does the agent still do what I want"
read by humans          measured by the eval suite
fast (immediate)        slow (eval run)
necessary               necessary

A mature prompt-review surface shows both. The syntactic diff is the familiar red-and-green text view — the registry computes it character by character (or word by word, or sentence by sentence) and renders it. The semantic diff is the eval differential — both the old SHA and the new SHA are run against the same eval suite, the per-test outcomes are compared, and the differences are shown as a behavior-change table.

EVAL DIFFERENTIAL: b1d7e4c2 vs a8c3f971

case_id                  parent passed   new passed   delta
refund.basic_001         yes             yes          0
refund.basic_002         yes             yes          0
refund.tone_011          yes             no           regression
refund.tone_012          yes             no           regression
refund.edge_021          no              yes          improvement
refund.edge_022          yes             yes          0
─────────────────────────────────────────────────────────────
totals                   24/25 (96.0%)   21/25 (84.0%) -3 net

regressed cases shown above; promotion blocked.

The semantic diff is what catches the regression at promotion time rather than at 14:32 on a Friday. The syntactic diff is what tells the reviewer why the regression happened — "ah, removing 'and concise' is what dropped the tone scores".

Mini-FAQ. "Can we skip the syntactic diff and rely on eval scores?" No. Eval scores tell you behavior changed; they do not tell you which wording change caused it. Reviewers need both — the eval delta to know the change matters, the text diff to know which words to argue about.


4) Anatomy of a rollback

A rollback is one operation against one resource — the deployment pointer. The content store is untouched. The audit trail records a new event. The runtime picks up the change within its config-refresh window. That is the whole shape.

┌────────────────────────────────────────────────────────────┐
│ ROLLBACK OPERATION                                         │
├────────────────────────────────────────────────────────────┤
│ resource         deployment_pointer[billing.refund_intent] │
│ from_sha         b1d7e4c2  (current production)            │
│ to_sha           a8c3f971  (target — previously deployed)  │
│ environment      production                                │
│ actor            on-call@acme.com                          │
│ reason           "live regression: tone eval -18pp"        │
│ eval_diff        attached: 3 regressions vs parent          │
│ executed_at      2026-02-11T14:34:12Z                      │
│ propagation_sla  30s (runtime cache refresh)               │
│ status           applied | reverted | rolled_forward       │
└────────────────────────────────────────────────────────────┘

Three properties of this shape are worth dwelling on.

The unit of work is a single prompt. Rollback acts on billing.classifier.refund_intent, not on "the system" or "the deploy". This is critical because real production has many prompts, and a regression in one should not require touching the others. Rolling back the refund classifier should not touch the greeter, the summarizer, or the search rewriter.

The target is a SHA, not a relative position. "Roll back the last deploy" sounds intuitive and is a bug factory. If two deploys happened between the regression and the rollback, "the last deploy" is now ambiguous. Rolling back by SHA is unambiguous — "production will run text whose content hashes to a8c3f971". The history determines the SHA; clicking does not.

The audit trail is mandatory, not optional. Every rollback records who, what SHA, from what SHA, why, and what the eval diff was. The reason field is not a comment — it is the input to the postmortem that follows. Teams that allow blank reason fields lose half their incident analysis to "I don't remember why we rolled this back".


5) Worked example — three rollbacks of different shapes

Three real-shaped scenarios. Each one ends in a rollback, but each one exercises a different part of the system.

Scenario 1 — the clean rollback

The hook scenario. New SHA deployed at 14:18, regression detected at 14:32, rollback to parent SHA at 14:34, recovery at 14:36. Five minutes total. The new SHA goes to status rolled_back. The parent SHA returns to status deployed.

This is the easy case because the parent SHA was healthy and recent. Production runs it again with no further work.

Scenario 2 — the rollback-where-the-old-version-no-longer-fits

Two weeks ago, a downstream change happened. The intent classifier began emitting a new intent category — subscription_pause — that the parent prompt did not anticipate. The new SHA was the one that taught the model to use the new category. Rolling back to the parent SHA removes the regression and the new capability. Downstream code that now depends on subscription_pause as a possible intent will start receiving other instead and route those tickets to the wrong queue.

This is why rollback is not free. The decision is "do I want to lose the last two weeks of capability to fix the regression, or do I want to forward- fix the regression on the new SHA?".

FORWARD-FIX vs ROLLBACK — DECISION

Is the regression bigger than the capability loss?
   ┌───┴───┐
  yes      no
   │       │
   ▼       ▼
ROLLBACK   FORWARD-FIX
           (cut a new SHA that
            keeps the new capability
            and fixes the regression)

A rollback that loses capability needs a paired plan — either accept the capability loss until forward-fix lands, or restore capability through some other path. Either way, the rollback is not "the end of the incident".

Scenario 3 — the rollback-without-code-rollback

The new SHA changed the prompt's output format. The old SHA returned {"intent": ..., "confidence": ...}. The new SHA was supposed to return {"intent": ..., "confidence": ..., "rationale": ...}. The downstream parser was updated in the same release to expect the new format — specifically, to read the rationale field. Two PRs went out: a code PR that updated the parser, and a prompt PR that updated the prompt. Both shipped on the same day.

Now there is a regression and the on-call rolls back the prompt. The prompt now returns the old format. The parser, still expecting the new format, throws KeyError: 'rationale' on every call. The incident gets worse.

The rule that protects against this is sharp. Rolling back a prompt requires asking what code depends on the prompt's new shape, and rolling that back too. The registry surface that handles this well will warn the on-call — "this SHA changed the output format from X to Y; code at parsers/refund.py@abc123 was updated to expect Y; rolling back to the old format will break that code". The warning gives the on-call a choice — roll back both, or forward-fix on the new SHA. Either is fine. Doing only one is not.

The general lesson is that prompt and code are coupled at the contract, and the rollback unit needs to respect the coupling. Many teams discover this the hard way the first time they roll a prompt back. Add it to your runbook.


Mid-content recall

  1. What is the difference between a syntactic and a semantic prompt diff?
  2. Why is "roll back the last deploy" a worse instruction than "roll back to SHA a8c3f971"?
  3. When does rolling back a prompt require also rolling back the code that reads its output?

6) The rollback runbook — what the team actually executes

A registry without a runbook is a registry that gets used wrong under pressure. The runbook is short, fits on one page, and lives somewhere the on-call can find it in seconds.

PROMPT REGRESSION — ROLLBACK RUNBOOK

1. CONFIRM
   - Open a failed trace from the regression window.
   - Note prompt_name and prompt_sha.

2. LOOK
   - Run: promptctl history <prompt_name>
   - Find the previous deployed SHA (status was deployed before current).
   - Note its eval_score on promotion.

3. CHECK COUPLING
   - Read the changelog of the current SHA.
   - Does it change output format, add tools, remove tools, change tone?
   - If output format or tool list changed, check whether code depending on
     the new shape needs to be rolled back too.
   - If yes, coordinate the dual rollback with the service owner.

4. EXECUTE
   - Run: promptctl rollback <prompt_name> --to <target_sha> \
            --reason "<one-line reason>"
   - Wait for runtime cache refresh window (typically 30-90s).

5. VERIFY
   - Watch the regression metric for 5-10 minutes.
   - If recovered, post to #incidents with: from_sha, to_sha, reason,
     eval_diff link, recovery time.
   - If not recovered, the regression is not (only) the prompt — escalate
     to engineering on-call.

6. RECORD
   - File postmortem ticket within 24 hours.
   - Include: eval scores before/after, reviewers on the original promotion,
     why the eval gate missed it.

7. FORWARD-FIX
   - Once stable, the prompt owner cuts a new SHA that restores intent
     without the regression.
   - The rolled-back SHA stays as status=rolled_back for replay analysis.

The runbook reads short because it is supposed to. The decisions live in the registry and the eval surface; the runbook just sequences them.


7) The "rolled back but the eval still passes" surprise

A pattern that catches teams off-guard the first time. A prompt SHA is deployed. Two weeks pass. Production behavior drifts in a way that depends on the new prompt's framing — agents start trusting a specific phrase, ops teams build a triage workflow around the new output format, customer support scripts reference the new tone. A regression is detected. Rollback runs. The eval suite — designed before any of these external adaptations — still says the old SHA is correct. The eval passes. The metrics dip not because the rollback was wrong but because the world has moved.

The lesson is that an eval suite measures intent, not equilibrium. A rollback restores the intent the suite encodes. The equilibrium production adapted to is something else. When the two diverge enough, the right move is sometimes not to roll back — sometimes the right move is to accept that the new SHA has become load-bearing for downstream workflows, and to forward-fix the regression on it instead.

The decision is a judgment call, but it follows a clean rule. If the eval gate would have caught this regression at promotion time, roll back. The fact that the eval gate did not catch it means either the suite has a gap (file an eval task) or the regression is in some dimension the suite does not cover (file a triage task). Either way, rolling back puts you on solid eval ground while you decide what to do next.


8) Failure modes — where rollback breaks

SYMPTOM                                  FIX
───────                                  ───
rollback takes > 30 min                  registry should be one command;
                                         fix runtime cache refresh path
"which version was live at 14:31"        stamp prompt_sha on every trace
rollback breaks downstream parser        check coupling before rollback;
                                         dual-roll code if needed
rollback target is also broken           skip the parent, roll further back;
                                         use status=deployed history
two engineers race to roll back          registry must serialize pointer
                                         updates (atomic CAS)
nobody recorded why we rolled back       reason field on rollback is
                                         mandatory; lint at the CLI
forward-fix and rollback fight each      lock the prompt while rollback is
  other                                  in-flight; release after verify
rollback hit prod, but staging didn't    per-environment pointer; promote
                                         from prod to staging explicitly
"rolled back to a SHA that never         allow rollback only to SHAs whose
  shipped"                               status was once deployed

The leak that surprises new teams is cache refresh latency. The registry flips the pointer in milliseconds. The runtime that reads the registry may cache the resolved SHA for thirty seconds, sixty seconds, or longer. Rollback "happens" at registry-flip time; rollback takes effect at cache-refresh time. The runbook needs to know the difference. The deployment SLA is the cache window.


Where this lives in the wild

The pattern of versioned artifact, deployment pointer, one-command revert is well-trodden across many systems.

  • Langfuse — labels (production, staging) act as deployment pointers; flipping a label is the rollback operation.
  • PromptLayer — "releases" map a prompt template to a specific version per environment; rollback flips the release pointer.
  • Pezzo — environments and version pinning; one-click rollback in the UI.
  • Helicone — replay-against-prior-version flow makes rollbacks testable before they are applied.
  • Vellum — explicit promote and rollback actions with audit trail and reason fields.
  • Braintrust — eval differential is rendered alongside the diff; the semantic-diff workflow is native.
  • LangSmith — prompt versions pinned by hash; clients re-resolve on every cache refresh.
  • PromptHub — git-backed registry; rollback is a revert PR.
  • OpenAI Stored Prompts — version pins via the API; rollback by switching the pinned version.
  • Anthropic Workbench — version history per workspace; manual rollback by copying an earlier SHA.
  • Vercel AI SDK — registry-agnostic, defers rollback to the backing store.
  • GitHub PR review — when the registry is a git repo, the rollback is a revert commit on the deployments file.
  • GitLab Merge Requests — same flow on GitLab.
  • Phabricator — revert-commit flow in large monorepos using the older tool.
  • ReviewBoard — review surface for the revert PR.
  • dbt — model version history and rollback by deploying an earlier revision; same mental model.
  • Hashicorp Vault — secret versions and one-command rollback for the secret half.
  • AWS Parameter Store — version history on every parameter; rollback by updating the pointer.
  • AWS Secrets Manager — versioned secrets with rollback support.
  • ConfigCat — flag-value rollback via dashboard; works when a prompt is modeled as a flag.
  • LaunchDarkly — flag rollouts and rollbacks; same mental model for prompt pointer flips.
  • Statsig — config rollback with experiment-history context.
  • Flagsmith — open-source flag rollback pattern.
  • Optimizely — variant rollback for prompt A/B experiments.
  • Split.io — flag-and-rollout history with one-click revert.
  • Kubernetes deploymentskubectl rollout undo is the same shape; pointer flip with audit trail. The mental model transfers cleanly.
  • MLflow model registry — stage transitions and rollback; the closest ML-adjacent pattern.

The shared property across all of these is versioned artifact + deployment pointer + one-command revert + audit trail. If a registry pitch lacks any of those, the rollback story is not yet complete.


Pause and recall

  1. What is the unit of work in a rollback?
  2. Why does a rollback target a SHA rather than a relative position?
  3. What is an eval differential and what does it tell you that a text diff does not?
  4. Give two reasons a forward-fix can be better than a rollback.
  5. What is the "rolled back but the eval still passes" surprise about?
  6. What is the typical SLA from registry pointer flip to runtime effect?
  7. Why does rolling a prompt back sometimes require rolling code back too?

Interview Q&A

Q1. Walk me through a prompt rollback from alert to recovery. A. Alert fires on a regression metric. On-call pulls a failed trace, reads the prompt_sha. Checks the registry's history for that prompt name; finds the previous deployed SHA. Verifies no breaking coupling with code (output format, tool list). Runs the rollback command with a reason. Watches the metric for 5-10 minutes. If recovered, posts the SHA pair, eval diff, and recovery time. If not recovered, escalates. Files a postmortem within 24 hours. Forward-fix later. Trap: Skipping the coupling check. Rolling a prompt back without rolling back the code that reads its new shape makes the incident worse.

Q2. Why are prompt diffs harder than code diffs? A. Code diffs have tight mapping between text change and behavior change. Prompt diffs do not. Two words added to a prompt can shift behavior by double-digit percentage points; or change nothing. The right review surface shows both a syntactic diff (text red/green) and a semantic diff (eval differential) so the reviewer can read both layers. Trap: Reviewing only the text diff. The text looks fine; the eval catches what the eye missed.

Q3. How long should a rollback take? A. Under five minutes end-to-end, from alert to recovery. The registry pointer flip is sub-second. The runtime cache refresh is typically 30-90 seconds. The decision, history lookup, and coupling check take the rest. If a rollback takes longer, the failure is in the tooling, not the discipline. Trap: "We aim for an hour SLO." For a wrong-prompt regression, that is fifty-five minutes of bad answers to customers.

Q4. What is the difference between rolling back and forward-fixing? A. Rolling back flips the deployment pointer to a previous SHA. Forward- fixing cuts a new SHA that preserves the current direction but corrects the regression. Roll back when the previous SHA's behavior is what you want. Forward-fix when the new SHA introduced capability you cannot afford to lose. The two are not mutually exclusive — many teams roll back first to stop the bleeding and forward-fix afterward. Trap: "Always roll back." Sometimes the rollback loses two weeks of capability and the right move is to forward-fix in twenty minutes.

Q5. How do you record a rollback? A. Every rollback event captures who, the source SHA, the target SHA, the environment, the reason, and the eval diff between the SHAs. The rolled- back SHA's status changes from deployed to rolled_back. The target SHA's status changes back to deployed. The audit log keeps both as distinct events. Trap: Blank reason fields. The reason is the input to the next postmortem.

Q6. How do you handle a rollback that breaks downstream code? A. Check coupling before rolling back. If the new SHA changed output format, added a field, removed a tool, or shifted shape in a way code relied on, that code needs to roll back in lockstep. The registry should surface coupling warnings ("this SHA changed shape X; code at Y depends on it"). The runbook step is "rollback both or neither". Trap: Treating the prompt as if it lived in isolation. The prompt's output contract is part of the system; rollback respects the contract.

Q7. Why is content-addressed identity the foundation of rollback? A. Because rollback is "production points at this exact text". Without content-addressing, "this exact text" is a moving target — v17 today is different from v17 last week. With SHAs, the target is a content fingerprint that cannot lie. Every rollback knows exactly what it is restoring. Trap: Numbering versions and editing in place. The first rollback after that pattern reveals the broken link from SHA to text.

Q8. What is the most common rollback failure mode you have seen? A. Cache refresh latency. The team flips the pointer, expects instant recovery, watches the dashboard for two minutes, panics, rolls forward again. Meanwhile the original rollback was working — the runtime cache just had not refreshed yet. The fix is to publish the cache window as part of the runbook, set the watch interval accordingly, and add a verify-on- trace step that confirms a new SHA is actually in use. Trap: Assuming the registry change is the runtime change. It is the registry's change; the runtime catches up on its own schedule.


Apply now (5 min)

Step 1 — write your rollback runbook. One page. Steps from alert to recovery. The seven-step skeleton in section 6 is a starting point. Adapt to your registry's specific commands and your runtime's cache window.

Step 2 — measure your worst case. Pick one production prompt. Time yourself doing a fake rollback in staging — alert to verified recovery. Note where the minutes go. If you are above five minutes, the failure is usually either runtime cache or the coupling check.

Step 3 — add coupling checks. For each production prompt, write a one- line note of what code depends on its output shape. Keep the notes with the prompt entry in the registry. Next time someone proposes a shape change, the dependency is in the diff.


Bridge. Rollback is the safety net. It only catches you if the eval gate did not. And the eval gate only fires if a review happened. The registry, the SHA, the diff — they all assume someone is looking at the change before it ships. The next chapter is who that someone is, and what they should be looking at.

04-review-gates.md