Skip to content

10. Emergency changes

Coordinated releases are the planned multi-change discipline. Emergency changes are the unplanned override discipline — when the normal gates, canaries, and windows must give way to urgent action. The discipline is to handle emergencies without abandoning all discipline.


A platform engineer at a Bengaluru SaaS company is paged at 22:00 with a critical security finding: the agent has been responding to an internal API in a way that exposes PII. The fix is straightforward (a prompt revision that adds an explicit redaction); shipping requires bypassing the normal change window (it is night) and shortening the canary (the harm is ongoing). The discipline: the bypass is approved by the security lead and the platform lead; the canary is compressed to 30 minutes at 100% with intensive monitoring; the rollback is verified ready; the customer-facing communication is queued. The fix ships in 2 hours total; the harm stops; the postmortem documents the emergency response.

This chapter is that discipline.


What an emergency change is

An emergency change is a release that bypasses one or more normal discipline elements (gate, canary, window, freeze) because the harm of not shipping exceeds the harm of bypassing.

Categories:

  • Security vulnerabilities with active exploitation or imminent exposure.
  • Regulatory deadlines that cannot be missed.
  • Production incidents whose resolution requires the change.
  • Customer-blocking issues with no workaround.

The category determines the bypass scope.


The bypass scope

Not all discipline elements get bypassed; only what is necessary.

Element Bypass justification Compensation
Eval gate Speed required; eval would slow Skip with documented reason; run eval after promotion; rollback if regression
Canary Harm is ongoing; canary speed cannot match Compress canary (faster steps) or skip; intensive monitoring at 100%
Change window Time-of-day necessity Document the bypass; ensure on-call coverage
Freeze period Cannot wait for freeze end Document the bypass; senior approval
Multi-change coordination Single change is the fix Standard sequence

Each bypass is documented; the reason is specific; the compensation is in place.


The approval path

Emergency changes need approval — not the normal CI gate, but a human decision.

  • Senior engineer. For changes within their expertise; the on-call engineer typically has this authority for prompt changes.
  • Platform lead. For changes that affect platform-wide operation (model migrations, broad agent changes).
  • Security lead. For security-driven emergencies.
  • Executive. For changes with broad customer or business impact.

The approval is recorded with the bypass documentation. The discipline ensures the bypass is authorised, not just convenient.


Compressed canary

For emergencies that bypass the normal canary, a compressed version:

  • Faster steps. 1% → 10% → 50% → 100% in hours instead of days.
  • Tighter monitoring. Every metric watched continuously; rollback triggered on any concerning signal.
  • Human attention. Engineers actively watching, not just monitoring dashboards.
  • Ready rollback. The rollback path is verified before the canary starts.

The compressed canary is still a canary; "ship to 100% immediately" without observation is reserved for the most urgent cases (active exfiltration, immediate regulatory deadline).


Post-emergency discipline

After the emergency ships:

  • Verify resolution. The harm has stopped; the fix is in production; the metrics confirm.
  • Run the bypassed gates. The eval that was skipped is run after promotion; any regression is investigated and may require additional fix.
  • Document the emergency. What happened; what was bypassed; why; what compensated; what the outcome was.
  • Postmortem. The standard incident postmortem (chapter 11) covers the emergency; the action items address both the original cause and the discipline gaps revealed.

What "emergency" does not include

The label is restrictive on purpose. Routine pressure, scheduling convenience, missed deadlines — these are not emergencies. Calling them emergencies erodes the discipline.

  • "We promised customers by Friday and missed the window" — not an emergency; reschedule communication.
  • "The PM wants this feature now" — not an emergency; standard discipline.
  • "It would be inconvenient to wait" — not an emergency; ordinary planning.

True emergencies are rare; if the team has multiple per month, the bar is too low or upstream issues are creating false-urgency.


Compressed canary for true emergencies

For the rare case where every minute counts (active security exploit, regulatory enforcement starting):

  • The change ships to 100% immediately with intensive monitoring.
  • Rollback is staffed and ready.
  • A senior engineer or platform lead is the decision-maker for any rollback.
  • Communication to customers is parallel (the fix is shipping; the customer is informed).

This is the maximum compression; reserved for genuine emergencies. The risk of bypassing the canary entirely is real; the discipline acknowledges it.


Common mistakes

Routine framed as emergency. The bypass becomes the norm; discipline erodes.

Emergency without approval. The engineer acts unilaterally; accountability is unclear.

Emergency without compressed canary. "Ship to 100% immediately" when a 30-minute compressed canary would have caught issues.

Emergency without rollback ready. The change shipped; can't be reversed cleanly.

Emergency without post-incident discipline. The bypassed gates are not run after; regressions stay hidden.


Interview Q&A

Q1. What qualifies as an emergency change? Security vulnerabilities with active exploitation or imminent exposure; regulatory deadlines that cannot be missed; production incidents whose resolution requires the change; customer-blocking issues with no workaround. The bar is "harm of not shipping exceeds harm of bypassing discipline." Routine pressure, scheduling convenience, missed deadlines do not qualify. True emergencies are rare; if frequent, the bar is too low. Wrong-answer notes: "anything urgent" produces routine bypass and discipline erosion.

Q2. Walk through a compressed canary for an emergency security fix. The change ships behind compressed canary: 1% → 10% → 50% → 100% in hours (instead of days). At each step, intensive monitoring (every metric watched; human attention; not just dashboards). Rollback path verified before canary starts. Approval from security lead and platform lead recorded. If any step shows concerning signals, rollback immediately. After promotion, the bypassed gates (eval, etc.) run as soon as possible; any regression triggers additional fix. The discipline preserves what can be preserved while shipping fast. Wrong-answer notes: "ship to 100% immediately" is the most-compressed; reserve for the rarest cases.

Q3. The team has had three "emergencies" this month. What does that suggest? The bar for emergency is too low, or upstream issues are creating false-urgency. Investigate: are the emergencies genuine? are they recurring patterns suggesting a systemic problem? are non-emergencies being labelled as such for convenience? The discipline of the emergency label depends on its rarity. Frequent invocations erode the discipline. The fix is to investigate the upstream causes (more discipline in routine releases? upstream security audits to catch issues before they're emergencies?). Wrong-answer notes: "accept that emergencies happen" without examining the rate misses the systemic question.

Q4. The emergency change shipped, bypassing the eval. After promotion, the eval is run and shows a regression. What is the response? The original emergency stopped the harm; the regression is now the question. Investigate the regression — is it real (the fix has a side effect) or false (the eval set has a case the new behaviour intentionally changes)? If real, plan the fix: refine the original change to address both the security issue and the regression. If false, the eval may need to be updated to reflect the new intended behaviour. The post-promotion discipline catches what the emergency bypass missed; the response converts it into a planned follow-up. Wrong-answer notes: "the regression is acceptable because the fix is necessary" without investigating may leave a real problem unaddressed.


What to do differently after reading this

  • Define what qualifies as an emergency; keep the bar high.
  • Document the bypass scope: which gates skipped, why, what compensates.
  • Approval path with named authorities per category.
  • Compressed canary as the default for emergencies; full 100% reserved for the rarest cases.
  • Post-emergency discipline: verify, run bypassed gates, postmortem.

Bridge. Emergencies are handled. The next discipline is the postmortem for when a release goes wrong — emergency or routine. → 11-release-postmortem.md