06. Incident Response¶
⏱️ Estimated time: 23 min | Level: advanced
ELI5 callback: In the hospital analogy, the monitor alarm declares urgency, the medical chart grounds facts, and the playbook keeps the room coordinated.
1) Detection and incident declaration¶
Incidents start when user harm becomes likely or active. Open the thermometer first so severity is grounded in impact.
Detection can come from alerts, support tickets, or internal reports.
Do not wait for perfect certainty before declaring.
Early declaration creates shared focus and clear ownership.
See. Confusion wastes more time than a false start.
Severity should reflect customer impact, not engineer stress.
Write down the trigger and the first observed symptoms.
This gives the room one factual starting point.
┌────────┐ detect ┌──────────┐ declare ┌─────────────┐ │ Signal │ ───────→ │ Triage │ ─────────→ │ Incident │ └────────┘ └──────────┘ ├─────────────┤ │ roles set │ │ updates set │ └─────────────┘ Then open the thermometer again to see whether mitigation helped. - Use a simple declaration template with scope, symptom, and severity.
-
Record time of detection and time of declaration separately.
-
Escalate early if the ownership boundary is unclear.
-
Keep one chat room or bridge for shared coordination.
2) Triage means scope before cleverness¶
Triage asks how bad, how wide, and how fast.
Which users are affected?
Which regions, routes, or tenants are affected?
Is the issue still growing or already stable?
So what to do first?
Bound blast radius before searching for elegant root cause. Use the X-ray when the failing path spans several teams.
A ten-minute scope estimate often saves an hour of random debugging.
Good triage also sets communication frequency and stakeholder list.
-
Check customer impact, financial impact, and data integrity separately.
-
Compare current graphs with the last healthy baseline.
-
Note recent deploys, config changes, and dependency incidents.
-
Keep a visible list of confirmed facts versus open guesses.
3) Roles create thinking space¶
One person should lead the incident directly.
That is the incident commander, or IC.
Another person should handle stakeholder updates.
Another X-ray can show whether retries or timeouts are the true issue. A third person may keep notes and timeline.
Simple, no? Shared chaos needs divided duties.
The IC protects focus, assigns owners, and decides next step.
The comms lead protects trust outside the debug loop.
The scribe protects memory for postmortem and handoff.
-
Keep the IC out of deep keyboard work when possible.
-
Rotate roles on long incidents to prevent fatigue.
-
Use explicit handoffs when leadership changes.
-
Record key decisions with timestamps, not fuzzy recollection.
The medical chart should preserve the timeline for the incident review.
4) Mitigation comes before perfect explanation¶
During active harm, the first goal is service restoration.
Root cause matters, but mitigation matters first.
Roll back, shed load, fail closed, or disable one risky path.
Buy time for safer diagnosis.
Now watch. Teams often chase elegance while customers suffer.
Mitigation options should be prepared before the incident.
That includes feature flags, traffic shifts, and dependency bypasses.
Document what changed so recovery does not create a second incident.
A monitor alarm must map clearly to severity and ownership. - Choose the lowest-risk action that reduces harm quickly.
-
Re-check metrics after every mitigation, not only at the end.
-
Be careful with cache flushes and broad restarts; they can widen damage.
-
Announce mitigation state clearly in stakeholder updates.
5) Resolution is not the end; learning is¶
Resolution means the user impact has stopped and risk is controlled.
It does not mean every question is answered.
Afterward, run a blameless postmortem.
Focus on contributing conditions, not personal shame.
See. People are part of the system, not the whole cause.
Ask what signals were late, what decisions were hard, and what tooling failed.
Convert findings into owners and due dates.
Then review whether the fixes changed reliability meaningfully.
- Keep the timeline factual and timestamped.
- Separate trigger, contributing factors, and root cause chain.
- Track follow-ups until done, not until memory fades.
- Share broad lessons so other teams avoid the same trap. The playbook should define IC, comms, and rollback authority.
Where this lives in the wild¶
- Consumer platforms use IC and comms roles during login or checkout outages.
- B2B SaaS teams coordinate incident rooms across support, product, and engineering.
- Cloud platform teams triage region incidents by blast radius before root-cause depth.
- Fintech systems treat data integrity incidents differently from simple latency spikes.
- Mature orgs use blameless postmortems to improve runbooks, tooling, and ownership boundaries.
Pause and recall¶
- Why is early incident declaration better than waiting for perfect certainty?
- What questions define good triage in the first minutes?
- Why should mitigation come before perfect root-cause explanation?
- What makes a postmortem genuinely blameless and still useful?
Interview Q&A¶
Q: Why separate IC, comms, and scribe roles during serious incidents? A: Because coordination, stakeholder trust, and memory capture are all demanding tasks that interfere with each other under pressure. Common wrong answer to avoid: "Because big companies like ceremony" - the purpose is cognitive load control, not theater.
Q: Why is mitigation usually prioritized over root cause during active impact? A: Stopping user harm quickly creates safer time and space for diagnosis, while prolonged damage makes every later step worse. Common wrong answer to avoid: "Because root cause does not matter" - it matters deeply, just not before urgent stabilization.
Q: How do you decide incident severity well? A: Base it on customer impact, business impact, and data risk, not on how noisy the alert channel feels. Common wrong answer to avoid: "Use the loudest engineer’s judgment" - severity needs shared criteria, not personality.
Q: What should a strong postmortem produce? A: A factual timeline, contributing-condition analysis, and tracked actions that reduce recurrence or speed future recovery. Common wrong answer to avoid: "A final person to blame" - blame hides system weaknesses and blocks learning.
Apply now (5 min)¶
Pick one incident your team could face this month. Write the declaration template, the first three triage questions, and the likely IC, comms, and scribe roles. Then list one mitigation you could execute within five minutes. If that mitigation is unclear, your runbook is not ready yet.
Bridge. Incidents managed. But how do we roll back safely? → 07