05. Rollback discipline¶
The canary is the gradual rollout. Rollback is what happens when the canary or production shows the change is wrong. The discipline is to make rollback tested, fast, and rehearsed — so it works when the team needs it most.
A platform engineer at a Hyderabad SaaS company is paged: a model migration's canary at 25% is showing degraded feedback. The rollback procedure exists on paper. She executes it and runs into a configuration mismatch; the rollback takes 45 minutes instead of the expected 5. During the 45 minutes, the regressed model handles thousands of additional conversations. The postmortem identifies that the rollback was never tested; the discipline existed in documentation only. The team adopts a monthly rollback drill: a non-production environment runs through the rollback steps; the actual production rollback is verified to work. Next time, the rollback is 4 minutes.
This chapter is the rollback discipline. Test it; rehearse it; make it fast; verify it.
What rollback is¶
Rollback is the reversal of a release to a known-good prior state, tested in advance, executed fast when triggered, and verified after execution.
The discipline is in the "tested in advance" and "verified after execution" — many teams have a rollback in concept but not in practice.
What gets rolled back¶
| Change type | Rollback mechanism |
|---|---|
| Prompt change | Route weight to 0; previous version of prompt continues |
| Model change | Alias mapping to previous model; route weight shift |
| Agent code | Standard software rollback (redeploy previous version) |
| Eval change | Revert the set version; re-baseline |
| Data change | Revert to previous data version (depending on the storage); rebuild affected indexes |
Each type's rollback uses the type's normal mechanism. Prompt and model rollbacks are typically fastest (weight flips). Agent code rollback is software-typical. Eval and data rollbacks are slower (re-baselining, rebuilding).
The rollback flag¶
For prompt and model changes, the rollback is a flag flip:
aliases:
smart-reasoner:
candidates:
- id: anthropic:claude-sonnet-4-6:ap-south-1
version_pin: claude-sonnet-4-6
weight: 0 # <-- rolled back, previously 25
- id: anthropic:claude-sonnet-4-5:ap-south-1
version_pin: claude-sonnet-4-5
weight: 100 # <-- previous version restored
The flag is in the routing policy (02_ai_infrastructure/01 chapter 03); the change is fast (seconds to minutes). The new traffic starts hitting the previous version immediately.
The previous version must still be available — not retired. Chapter 6 (versioning) ensures versions are retained for the rollback window.
Testing the rollback¶
A rollback is tested before it is needed. Two approaches:
Pre-release. Before the change ships, the rollback procedure is documented; in a non-production environment, the rollback is executed; it works. The release ships with the verified rollback path.
Periodic drill. Once per quarter (or monthly for high-stakes platforms), the team runs through the rollback procedure in a staging environment as a drill. The drill catches drift in the procedure (a config change has broken the rollback; a flag has moved; documentation is stale).
Most platforms do both: per-release verification plus periodic drill.
Speed targets¶
The rollback should be fast — ideally minutes from decision to traffic-restored.
| Change type | Target rollback time |
|---|---|
| Prompt change | < 5 minutes |
| Model change | < 10 minutes |
| Agent code | Software rollback time (typically 10-30 minutes) |
| Eval change | Minutes (revert set version) |
| Data change | Depends on data; some take hours (index rebuild) |
The speed is a design property; it requires the registry/routing infrastructure (chapters 06) plus the tested procedure. Without infrastructure or testing, the rollback is slow.
Who decides to rollback¶
The decision authority:
- On-call engineer. Authorised for prompt and model rollbacks (the highest-velocity changes); decisions in minutes.
- Platform lead. Authorised for broader actions (e.g., rollback of multiple coordinated changes); decisions in tens of minutes.
- Senior leadership. For strategic decisions (rolling back a feature that has been heavily marketed); rare.
The on-call has the authority for the common case; escalation paths exist for harder cases. The decision is fast because the discipline is fast.
After rollback¶
The rollback is not the end; it is the start of investigation.
- Verify. Traffic shifted back; metrics returned to baseline; the user impact has ceased.
- Document. The incident record includes the rollback time, the decision, the verification.
- Postmortem. What caused the regression; what the gate or canary missed; what discipline improvement is needed (chapter 11).
The rollback restores; the postmortem prevents recurrence.
Roll forward instead of back¶
Sometimes rolling forward to a fixed version is faster than rolling back. For prompt changes:
- A new prompt version (the previous version's text with a small targeted fix) ships.
- The new version goes to canary on a fast track (skipped gates may need bypass discipline).
- If the fix holds, the rollback is unnecessary; the production has a fixed forward.
Roll-forward is appropriate when the fix is small and confident. Rollback is appropriate when the fix is unclear or the regression is severe.
Common mistakes¶
Rollback documented but never tested. The chapter-opening case: 45-minute rollback instead of 5.
Previous version retired. The rollback target is gone; the rollback is impossible.
Slow rollback infrastructure. Software-deploy-based rollback for what should be a registry flip; minutes become hours.
On-call lacks rollback authority. Decision is escalated; minutes become tens of minutes.
No post-rollback verification. Traffic shifted but the team did not verify; the regression may continue subtly.
Interview Q&A¶
Q1. Why is "documented rollback" insufficient? Because the documentation can drift. A config change makes the procedure fail; a flag has moved; the previous version was retired. The documentation says "do X, Y, Z"; in production, X has changed name and the rollback fails. The chapter-opening case is this exactly. The discipline is tested rollback — periodic drills in staging confirm the procedure still works. Documentation is the baseline; testing is the verification. Wrong-answer notes: "we have a runbook" without testing is the chapter-opening pattern.
Q2. The team's rollback time target for a prompt change is "as soon as possible." Why is that not a target? Targets are measurable. "As soon as possible" produces variable performance and no improvement signal. A specific target (< 5 minutes) is measurable; periodic measurement against it produces a baseline; deviations are investigation. The target also informs design: a registry/routing infrastructure that supports < 5 minute rollback is different from one that requires a deploy. Specific targets drive specific engineering. Wrong-answer notes: "we move fast" without measurement is unfalsifiable.
Q3. Walk through what the team does after a rollback. Verify the rollback executed (traffic shifted, metrics returned to baseline). Document in the incident record (when triggered, when executed, when verified). Maintain the rollback for a stability window (don't immediately try to re-ship the change). Conduct the postmortem (chapter 11): what caused the regression, what the gate or canary missed, what discipline improvement prevents recurrence. The rollback is the immediate response; the postmortem is the systemic response. Wrong-answer notes: "rollback and move on" misses the postmortem step that prevents recurrence.
Q4. When is roll-forward better than rollback? When the fix is small and confident, and the original change was substantially correct except for a specific issue. Example: a prompt change shipped that performs well overall but uses one phrase that customers find odd; instead of rolling back the whole change, roll forward with the prompt updated to fix the phrase. Roll-forward is faster than rollback + later refinement; it requires the fix to be quick and verified. Rollback is better when the original change is fundamentally questioned or the regression is severe. Wrong-answer notes: "always rollback" misses the speed benefit of roll-forward in some cases.
What to do differently after reading this¶
- Test rollback in advance of every release.
- Schedule periodic rollback drills in staging.
- Design infrastructure for fast rollback (registry/routing flips).
- On-call has authority for common rollbacks; escalation for harder cases.
- Always verify post-rollback and conduct postmortem.
Bridge. Rollback restores. The infrastructure that makes rollback fast — versioned prompts and models as artefacts — is the next chapter. → 06-versioning-prompts-and-models.md