06. Rollback and kill switches — the firebreak must already exist¶

~12 min read. The worst time to design a kill switch is after the model has already started approving the wrong thing.

Continues from 05-war-room-roles-and-comms.md. The status board is clear. Now the fire captain needs a firebreak that changes user risk faster than a full root-cause fix.

The previous chapter gave the team one timeline and one decision surface. That solved coordination, but users are still exposed until the system changes behavior. This chapter moves from talking clearly about the fire to pulling the right lever without pretending containment is the same as repair.

1) The wall — AI systems have more rollback surfaces¶

Rolling back a normal service often means deploying the previous build. Rolling back an AI system may involve the application build, prompt template, model route, retrieval index, guardrail threshold, tool permission, workflow flag, cache, or memory write policy.

That is why "we can roll back" is not a plan.

AI rollback surfaces
  ├─ prompt version
  ├─ model/provider route
  ├─ retrieval index or corpus slice
  ├─ reranker / query rewrite flag
  ├─ tool permission or action mode
  ├─ guardrail threshold
  ├─ memory read/write switch
  ├─ workflow step
  └─ full feature flag

The firebreak must match the suspected harm. If the refund issue is tied to tool execution, disable the tool. If it is tied to stale retrieval, force current-policy source filtering. If it is tied to model route, roll back the route.

Firebreak	Use when	Cost
Prompt rollback	bad prompt or template release	may reintroduce old issues
Model route rollback	provider/model regression	quality or latency may change
Tool disable	action risk or runaway loop	product becomes read-only
Degraded mode	answers need safer shape	less helpful UX
Retrieval index rollback	stale/bad corpus release	loses newer documents
Guardrail threshold tighten	safety bypass or risky content	more false refusals
Tenant/workflow flag off	narrow blast radius	affected users lose feature
Rate limit	abuse or runaway cost	legitimate users slowed
Full kill switch	high-severity broad harm	availability hit

The right design has these levers before launch. A lead asks during design review: "If this goes wrong at 3 AM, which lever do we pull?"

3) Worked example — refund firebreaks¶

For the refund incident, the team has three plausible firebreaks:

Disable the refund action tool.
Degrade enterprise refund answers to cited policy excerpts only.
Re-enable reranker and filter retrieval to current-policy documents.

The first one stops unauthorized money movement. The second stops the assistant from giving directive advice. The third may fix quality but is closer to root-cause repair.

The incident response order is:

disable action risk -> degrade answer mode -> test retrieval/reranker fix -> restore gradually

That order protects users while the team tests whether the suspected root cause is real.

4) Why not patch forward immediately¶

The tempting alternative is to ship a quick prompt or retrieval patch forward. It feels productive because the bad example starts passing.

It fails when the patch is unreviewed, unmeasured, and narrower than the incident. A forward patch can hide the original failure while creating a new one.

Rollback is often safer than invention during an incident because the older behavior has known history. Patch forward only when rollback is impossible or the old state is also unsafe.

Teacher voice. During an incident, prefer known-good over clever-new unless known-good is itself dangerous.

5) Production signals — firebreak quality¶

The first metric is exposed risky actions after containment. If the firebreak worked, the harmful path's traffic should drop to zero or to a safe degraded mode.

The misleading metric is "the original example now passes." One example passing does not prove the blast radius is contained.

The expert signal is a restore plan: what evals, traces, human review, and traffic ramp are required before the firebreak is lifted.

6) Boundary — firebreaks are not fixes¶

A firebreak stops spread. It does not prove root cause, repair trust, or prevent recurrence.

The pathology is closure after rollback. The team disables the risky tool, the incident calms down, and nobody writes the eval lock. The same class returns next quarter through a different tool or prompt.

Recall checkpoint¶

Why does AI rollback have many surfaces?
Which firebreak matches unauthorized tool action?
Why is patch-forward risky during an incident?
What proves a firebreak worked?

Interview Q&A¶

Q: What kill switches should a production AI agent have? A: Feature flag, tool disable, model route rollback, prompt rollback, degraded read-only mode, retrieval/index rollback, guardrail threshold control, tenant/workflow scoping, and rate limits.

Common wrong answer to avoid: "Just redeploy the previous app build." The dangerous behavior may live outside the app build.

Q: When do you patch forward instead of roll back? A: When rollback is impossible, the old state is also unsafe, or the fix is a narrow configuration change that can be validated faster than rollback.

Common wrong answer to avoid: "Always fix forward." AI patches can create new semantic failures under pressure.

Q: How do you safely restore after a kill switch? A: Require eval pass, trace review, affected-slice replay, human spot check for high-risk flows, and gradual traffic ramp with rollback ready.

Common wrong answer to avoid: "Turn it back on after the example passes." One example is not a restore criterion.

Apply now (10 min)¶

Model the exercise. Write three firebreaks for the refund incident and rank them by safety versus product availability.

Your turn. For one AI feature, list every rollback surface outside application code.

Reproduce from memory. Explain why a firebreak is containment, not closure.

What you should remember¶

This chapter explained rollback and kill switches for AI incidents. The important idea is that AI behavior lives across prompts, models, retrieval, tools, memory, guardrails, and workflow flags, so rollback must be designed across those surfaces.

Carry this diagnostic forward: if you cannot name the 3 AM firebreak before launch, the feature is not incident-ready.

Remember:

AI rollback is multi-surface.
Firebreaks should match harm class and blast radius.
Known-good usually beats clever-new during incidents.
Restore requires evals, trace review, and gradual ramp.

Bridge. Firebreaks stop obvious dangerous paths, but many AI incidents are not obvious. Next we detect soft failures: plausible answers that dashboards miss. → 07-soft-failure-detection.md