Skip to content

11. Feedback incident response

Closing the loop is the routine work. Sometimes the feedback signal indicates not a per-case issue but a systemic problem — a sudden spike, a sustained degradation, a cross-platform pattern. This chapter is the incident response when feedback shows systemic concerns.


A platform engineer at a Pune SaaS company is paged on a Tuesday afternoon: the negative-feedback rate on the support agent has jumped from 5% to 18% over the past two hours. The team's first move is containment — what changed? A deploy went out at 10:00; the rate started climbing at 10:30. Rollback the deploy. The rate drops back to 6% within the hour. Investigation: the deploy included a prompt change intended to make the agent more concise; it produced responses users found unhelpful. The rollback contains; the prompt change is shelved pending refinement. The incident is closed within 4 hours; the postmortem identifies that the prompt change shipped without canary (the discipline of 13_prompt_lifecycle_operations chapter 06 was bypassed). The discipline tightening prevents recurrence.

This chapter is that response. Detection from feedback signal; containment; investigation; remediation; postmortem.


What "feedback incident" means

A feedback signal indicates a systemic concern, not a per-case issue. Distinguishing markers:

  • Sudden change. Negative feedback rate doubles in hours; explicit thumbs-down rate spikes; abandonment rises.
  • Sustained pattern. Multiple days of degraded signal across a feature or segment.
  • Cross-segment. The signal moves across multiple user segments simultaneously, suggesting platform-wide cause.
  • Correlated with platform changes. Deploys, model migrations, prompt changes that align temporally with the signal shift.

Each is a candidate incident. The incident response (parallels module 05_ai_incident_operations for general AI incidents; this is the feedback-driven version) is the team's structured handling.


The five phases

Phase Time window Action
Detection Minutes to hours Alarm fires from feedback monitor; on-call triages
Containment Minutes to a few hours Rollback the suspect change; freeze the affected feature
Investigation Hours to days Map what changed; correlate with the feedback shift; identify root cause
Remediation Days to weeks Fix the cause; ship with discipline (canary, eval gates)
Postmortem Within 1-2 weeks Document; identify systemic improvements; close action items

The phases are similar to general AI incident response; the feedback as the trigger is the distinguishing feature.


Detection from feedback

The alarms (chapter 09's metrics, monitored on-call dashboard):

  • Negative-feedback rate above baseline (typically 2σ for a sustained period).
  • Implicit-signal anomalies — abandonment up, repeat-ask up, copy-rate down.
  • Calibration agreement drop (the judge and users disagreeing more than usual).
  • Customer-support escalation rate up.

Each alarm has a threshold and a response policy. Severe alarms (negative rate doubling in an hour) page; medium alarms (sustained 2σ over a day) alert the on-call channel; low alarms (per-segment deviations) surface in the weekly review.


Containment

The first hour. Roll back recent changes that correlate with the signal. The discipline:

  • If a deploy went out before the signal shift, rollback or partial rollback.
  • If a prompt change was canaried, drop the canary weight to zero.
  • If a model migration was in progress, pause and revert the alias mapping.
  • If the cause is unclear, freeze further changes while investigating.

Containment is not the same as understanding. Stop the bleeding; understand after.


Investigation

After containment, map what happened.

  • Timeline. When did the signal shift; what changes happened in the preceding window.
  • Cohort analysis. Which segments are affected; does the pattern match any deploy's target.
  • Case-level read. Pull recent negative-feedback cases; what is the system doing or failing to do.
  • Cross-system check. Did the model gateway log unusual provider behaviour; did the data layer change; did upstream sources shift.

The investigation produces a root cause: what specifically caused the signal shift.


Remediation

Fix the root cause; ship with discipline.

  • Refine the change (e.g., a prompt revision that fails the test).
  • Re-test against the regression eval; verify it does not regress.
  • Canary; monitor feedback during canary.
  • Promote to 100% only if feedback profile holds.

The remediation may take days to weeks depending on the complexity. Containment holds in the meantime.


Postmortem

Within 1-2 weeks. The blameless postmortem covers:

  • Timeline. When detected, when contained, when remediated, when closed.
  • Root cause. The specific cause that produced the signal shift.
  • Contributing factors. Disciplines bypassed; monitors that should have caught earlier; tests that should have failed.
  • Blast. Number of affected users; estimated impact; any customer escalations.
  • Response assessment. What worked; what was slow; what improvement is warranted.
  • Action items. Specific systemic improvements with owners and dates.

A feedback-driven incident often points to discipline issues: a change shipped without canary, an eval gap that did not catch the failure mode, a monitor that lagged the signal. The action items address these structurally.


Common patterns in feedback-driven incidents

Frequent root causes:

  • Prompt change shipped without canary. Most common; the team felt the change was small.
  • Model migration without sufficient feedback baseline. The migration moved despite a feedback profile that should have triggered caution.
  • Upstream change affecting the system's input. A data source's shape changed; the agent's response degraded; users felt it.
  • Cumulative drift hitting a tipping point. Several small changes; each individually fine; the combination produces user-visible degradation.
  • Provider behaviour shift. The model provider updated something the gateway did not catch; the system's outputs degraded.

Each has a remediation. Cumulative drift is the hardest — no single change to roll back; the response is to re-evaluate the cumulative state and identify which to revert.


What feedback incident response does not solve

  • Slow degradations. A signal that drops 1% per month for 6 months may not trigger any alarm; only the cumulative drift becomes visible.
  • Cross-platform causes outside the agent. A change to the broader product (pricing, UX, signup flow) may shift feedback for reasons unrelated to the agent.
  • Bias in the incident itself. The responding team may misattribute cause based on what they hypothesise; the case-level investigation grounds the response.

Common mistakes

No feedback alarms. Incidents detected by support escalations days later.

Containment without investigation. Rolled back; never understood; the same incident recurs.

Investigation without postmortem action items. Same issues; same disciplines bypassed next time.

Conflating per-case issues with systemic incidents. Acting on per-case complaints as if each is an incident; effort scattered.

Slow response to "sustained" patterns. Waiting for a day-of-data before responding; the bleeding continues.


Interview Q&A

Q1. Walk through what happens in the first hour of a feedback incident. Alarm fires; on-call assesses. Identify the signal (negative-rate spike, abandonment up, etc.); correlate with recent changes. Roll back the suspect change(s); freeze further changes. Verify the signal stabilises after rollback. Open the incident; assemble the response team. Investigation begins in parallel with containment. The first hour is about stopping the bleeding, not understanding the root cause. Wrong-answer notes: "investigate first, contain second" lets harm continue.

Q2. The negative-feedback rate jumped 2 hours after a prompt change. The change had a canary at 25% before the spike. What is the response? Drop the canary to 0% immediately (containment). The signal should stabilise as the new prompt's traffic disappears. Investigate: what specifically did users not like about the new prompt; the case-level review reveals the pattern. Refine the prompt; re-test against the eval; re-canary at lower weight (5%) with closer monitoring. Promote only if the feedback profile holds at the new lower-weight canary. Postmortem: the eval may have missed the failure mode; add cases representing it. Wrong-answer notes: "let the canary run" continues to harm 25% of users.

Q3. The negative-feedback rate has been climbing 1% per week for 8 weeks. No single alarm has fired. What is the discipline? Slow degradations evade alarms tuned for sudden changes. The discipline includes longer-window monitors (weekly or monthly trend alarms) in addition to short-window. The weekly cadence (chapter 09) is also where slow trends should be noticed — the reviewer compares the week's metrics to the prior month and surfaces the trend even without an alarm. The cumulative-drift root cause may be hard to identify; re-evaluate recent changes individually; consider rolling back the most recent. Wrong-answer notes: "no alarm so no incident" misses cumulative drift.

Q4. The postmortem identifies that the eval set did not catch the failure mode. What action items follow? Add cases representing the failure mode to the eval set (chapter 05 of this module; chapter 04 of 01_dataset_golden_set_operations). Refine the rubric if the failure mode reveals a criterion the rubric was missing. Add a monitor that would have caught the failure mode in production sooner. Review whether the discipline (canary, eval gates) was bypassed; if so, document the bypass and adjust the process to prevent recurrence. The action items address the systemic cause, not just the immediate fix. Wrong-answer notes: "we shipped the fix" without the systemic improvements produces the next incident.


What to do differently after reading this

  • Establish feedback-driven alarms; not just per-case dashboard.
  • Containment first; investigation second.
  • Postmortems address systemic causes; the action items prevent recurrence.
  • Distinguish per-case issues (chapter 10 closing the loop) from systemic incidents (this chapter).
  • Include long-window trend monitors for slow degradations.

Bridge. Eleven chapters of discipline. The last two synthesise. The next chapter is the architect's checklist — twenty items that distinguish a defensible feedback-loop operation from one that is not. → 12-architect-checklist.md