Skip to content

11. Prompt incidents and rollback — the five-minute loop

~16 min read. When a prompt change causes a regression, the work is not heroic. It is procedural. Detect, trace, identify, rollback, verify, postmortem, eval. The shorter the loop, the calmer the team.

Builds on 07-prompt-observability.md, 08-prompt-eval-suites.md, and 10-prompt-feature-flags.md. Now we use the bakery log to find the bad SHA, run the rollback to a known-good version, and feed the regression case back into the taste test.


1) Hook — the Friday-evening incident, told twice

Two versions of the same incident. Same team. Same prompt. Same regression. Different bakeries.

Old bakery. Friday at 14:00 someone merges a "small wording fix" to the customer-support prompt. The deploy ships at 14:05. By 14:25 the support inbox has eleven complaints — "the bot stopped using my name", "it feels colder", "why is it being so short with me?" The team scrambles. Nobody is sure what changed. The Slack thread is forty messages long. Someone finds the commit. Someone else opens a revert PR. CI takes nineteen minutes. The deploy adds another eight. By 15:25 — eighty-five minutes after the regression started — the old prompt is back. Three hundred conversations went out with the bad version. Two enterprise customers email their account manager. The postmortem is scheduled for Tuesday.

New bakery. Friday at 14:00 the prompt change ships, flag-gated at 5%. At 14:08 the complaint-rate dashboard sees a 3-sigma deviation on a specific tenant cohort. A pager fires. The on-call engineer opens the dashboard. The split-by-SHA panel shows v18 traffic has a 4x complaint rate versus v17. She clicks the flag link, flips the killswitch. The flag value reverts in eleven seconds. By 14:12 — twelve minutes after the regression started — every new conversation is back on v17. About forty conversations went out with v18. None of the customers were enterprise. The postmortem starts at 15:00.

Eighty-five minutes versus twelve. Three hundred bad conversations versus forty. The difference is not skill. It is the loop. This chapter is the loop.


2) The metaphor — the smoke alarm and the labeled fuse box

Picture a bakery with two pieces of safety equipment that did not exist before. A smoke alarm sits in the ceiling, wired to the oven. Each batch coming out is scored — color, smell, texture — and any batch that scores wrong sets the alarm. The alarm rings loud enough that the night shift hears it from the storeroom.

When the alarm rings, the head baker walks to the wall and opens the fuse box. Inside, every recipe in production has its own labeled switch — v17, v18, v19. Switching v18 off and v17 back on takes one motion. Within minutes the next batch out of the oven is the safe recipe again.

After the alarm calms, the team meets at the kitchen table. They do not blame the baker who wrote v18. They write the bad batch into the recipe book's test list. Now any future v19, v20, v21 must pass the bad-batch test before it can ship.

The smoke alarm is production monitoring tied to prompt SHA. The fuse box is the feature flag console. The labeled switches are versioned SHAs. The meeting at the kitchen table is the postmortem. The bad batch being added to the test list is the eval gets the regression case.

That last move is the part most teams skip. It is the most important one.


3) The detection-to-rollback loop

The loop has seven discrete steps. Each one has a target time.

┌───────────────────────────────────────────────────────────────┐
│  STEP                          TARGET     OWNER                │
├───────────────────────────────────────────────────────────────┤
│  1. detect                     < 5 min    monitoring           │
│  2. identify the prompt SHA    < 1 min    on-call (trace)      │
│  3. decide rollback or fwd-fix < 2 min    on-call              │
│  4. execute rollback           < 1 min    on-call (flag)       │
│  5. verify metrics recover     < 3 min    on-call (dashboards) │
│  6. write incident note        < 30 min   on-call              │
│  7. update eval suite          < 24 hr    owning team          │
└───────────────────────────────────────────────────────────────┘

Total time from regression-start to flag-flipped — about twelve minutes when the system works. The mean-time-to-rollback target most mature teams adopt is under five minutes. The first time you measure it on your team, you will be at thirty or forty. The shrinkage comes from each step being practiced, not from heroism.

The loop is linear, but the dependency between steps matters. You cannot execute the rollback (step 4) without identifying the SHA (step 2). You cannot identify the SHA without traces that log it (chapter 7). You cannot verify recovery (step 5) without per-variant dashboards (chapter 7 again). Half of incident response is upstream observability.


4) Detection — three triggers, three responses

Not all prompt regressions look the same. There are three patterns and each has a different shape.

Hard trigger — CI eval failed. The new prompt SHA never reaches production. The eval suite from chapter 8 caught it. No incident, no rollback. The owning team iterates on the prompt until the eval passes. This is the cheapest outcome. Most prompt changes should be filtered here.

Soft trigger — live eval regression. The prompt shipped, the eval passed in CI, but a sampled production eval running against live traffic is showing a drift. Maybe the rude-language eval is catching 0.3% versus a baseline of 0.05%. Nobody has complained yet. The on-call engineer has time to think. The decision is whether to pause the rollout at the current percentage or roll back to the previous SHA. Soft triggers are the chance to act before customers feel it.

Incident trigger — production complaint surge. Customers are complaining, csat is dropping, the support inbox is filling, or the safety filter rate has jumped. The decision is not whether to roll back. It is how fast. This is the path the smoke alarm story walks.

TRIGGER          SOURCE                              RESPONSE
─────────────────────────────────────────────────────────────────
hard             CI evals on PR                      block merge
soft             live eval on sampled prod traffic   pause ramp,
                                                     investigate
incident         complaint surge, csat dip,          rollback now
                 safety filter spike, error rate

The detection layer for incident triggers is usually four signals stitched together.

The first is csat per variant — does the new SHA's customer satisfaction differ from the old's by more than 1-2 points? The second is complaint-rate-per-thousand-conversations — is the user-flagged complaint rate elevated? The third is safety filter trip rate — is the upstream safety classifier rejecting more responses than before? The fourth is downstream error rate — has the next system in the chain (a parser, a router, a UI renderer) started failing more often?

A 3-sigma deviation on any single signal is the typical alert threshold. A 2-sigma deviation across two signals is also enough. Tune for your traffic level — at 100 conversations per hour the signal is noisy; at 100,000 it is sharp.


5) The rollback — what actually happens in one minute

When the on-call engineer decides to roll back, the procedure is short and unambiguous.

ROLLBACK PROCEDURE — customer_support_agent
──────────────────────────────────────────────────────────────────
Step 1.  open the flag console
         (Statsig / LaunchDarkly / Split / Flagsmith)
         flag: support_prompt_rollout

Step 2.  flip killswitch ON
         every request now serves the control variant
         propagation: ~10 seconds (SSE-based SDK)

Step 3.  watch the SHA distribution in the dashboard
         confirm v18 traffic drops to ~0%
         confirm v17 traffic returns to ~100%

Step 4.  watch the complaint-rate dashboard
         confirm metric returns toward baseline within
         3-5 minutes (TTL of in-flight conversations)

Step 5.  post in #incident-prompt-rollout
         "killswitch ON, prompt back on v17 at HH:MM,
          investigating cause"

Step 6.  page the prompt owner if not already on the thread

Six steps. Practiced, this takes a minute. The whole point of the flag is that step 2 is a click, not a deploy.

There is an alternate path when the flag was never wired. The team must revert the change in the prompt registry, which means pinning the live label back to the previous SHA. The registry from chapter 2 has a current pointer that names the live SHA. Re-pointing it back to v17 is one operation. Workers running long-poll fetches pick up the change in seconds; workers using a cached pull pick it up at the next refresh interval. This is slower than a flag flip but still much faster than reverting a code PR.

                          ┌─────────────────────┐
                          │ flag killswitch ON  │
                          └──────────┬──────────┘
                                     │ ~10s
                          ┌─────────────────────┐
                          │  all new requests   │
                          │  return v17 SHA     │
                          └──────────┬──────────┘
                          ┌─────────────────────┐
                          │  in-flight requests │
                          │  finish on v18      │
                          │  (max ~30-60s)      │
                          └──────────┬──────────┘
                          ┌─────────────────────┐
                          │  steady state on    │
                          │  v17. metrics       │
                          │  recover within     │
                          │  3-5 min            │
                          └─────────────────────┘

The metric recovery lag — the three to five minutes between the flag flip and the dashboard returning to normal — confuses fresh engineers in their first incident. The reason is in-flight requests. A conversation that started thirty seconds before the killswitch was flipped is still finishing on v18 when the metric is read. The dashboard catches up within one conversation lifetime. Nothing is wrong; it is the system catching its breath.


6) Rollback or forward-fix — a decision tree

Sometimes rolling back is not the right move. The classic case — v18 fixed a bug v17 had. Rolling back to v17 reintroduces the original bug. Forward-fixing — patching v18 into v19 quickly — is the right call.

The decision tree is short.

                  ┌──────────────────────────────┐
                  │ is the regression worse than │
                  │ what v17 had?                │
                  └────────┬─────────┬───────────┘
                           │         │
                          yes        no
                           │         │
              ┌────────────▼─┐       ▼
              │ roll back to │   keep v18 ramping,
              │ v17. fix and │   address with a
              │ ship v19.    │   non-blocking fix
              └──────────────┘
                  ┌────────▼─────────────────────┐
                  │ can you ship v19 within an   │
                  │ hour with a passing eval?    │
                  └────────┬─────────┬───────────┘
                           │         │
                          yes        no
                           │         │
              ┌────────────▼─┐       ▼
              │ forward-fix  │   roll back to v17
              │ v18 → v19.   │   anyway. ship v19
              │ keep flag on │   the next day with
              │ pause until  │   the original v17 fix
              │ v19 ships.   │   plus regression
              └──────────────┘   coverage.

Three questions decide. Is the new regression worse than what v17 had? Can you ship a fixed v19 fast, with a passing eval? Is there any path to forward-fix that does not cut corners on the eval?

The default — and the safer choice when in doubt — is roll back. Forward-fixing under incident pressure is how you ship v18, v19, v20 all broken in different ways. The roll-back-and-think path is slower but correct more often.


Mid-content recall

  1. What is the difference between a hard, soft, and incident trigger?
  2. Why does the complaint-rate dashboard lag three to five minutes after the killswitch flips?
  3. When does forward-fix beat rollback?

7) Verifying the rollback worked

A flipped flag is not a recovered system. The verification step has three checks.

SHA distribution in traces. Open the trace dashboard. Filter to the last five minutes. Group by prompt SHA. The chart should show v17 climbing to 100% and v18 falling to 0% within thirty seconds. If v18 traffic persists, the flag did not propagate everywhere — likely a caching layer holding stale rule values. Flush the cache or restart affected workers.

Complaint metric returning to baseline. Open the complaint-rate dashboard. Confirm the line bends back toward its pre-incident level within five minutes. If it does not, the regression is not driven by the prompt — or the rollback was incomplete. Time to widen the investigation.

Downstream errors clearing. Check the next system in the chain. If a parser was failing on v18's output, the parser error rate should drop with the rollback. Errors persisting after the prompt is reverted usually mean state was poisoned — corrupted cache entries, malformed records in a downstream queue.

VERIFICATION CHECK              TARGET           ACTION IF FAIL
─────────────────────────────────────────────────────────────────
SHA distribution in traces      v17 100% in 30s  flush flag cache, restart workers
complaint rate                  baseline in 5min look for non-prompt cause
downstream error rate           baseline in 5min check for poisoned state
in-flight conversations         drain in 60s     verify TTLs and timeouts

The verification step is fifteen minutes of attention after the flag flips. It is what separates "we think we fixed it" from "we know we fixed it." Skip it and you are back at 16:00 wondering why the dashboard is still red.


8) The postmortem — what it must produce

The postmortem is not the incident's end. It is the eval suite's start.

A useful postmortem produces five outputs.

The timeline. What happened, in clock time. When the change shipped, when the first signal appeared, when the page fired, when the rollback executed, when metrics recovered. This is the basic narrative.

The cause. What about the new prompt produced the regression. "Removed the line 'greet user by name' from the system prompt. Result — the model defaulted to a generic opener, which customers perceived as cold." The cause should be specific enough that someone reading it three months later understands what to avoid.

The detection gap. What did the CI eval miss. "Our greeting-warmth eval used the phrase 'use the user's name' as the only quality signal. The new prompt still passed because it greeted users — just without their name. The eval did not measure greeting personalization." The detection gap is the part that improves the eval suite.

The new eval case. The regression itself, formalized into a test. The bad output from the production trace becomes an example. The desired output becomes the gold answer. The eval suite now blocks any future prompt that fails the personalization test.

Action items. What the team will do differently. New eval coverage. New rollout discipline. Sometimes new flag-system features — "add a kill-switch on csat dip". Sometimes new training for engineers reviewing prompt PRs.

INCIDENT POSTMORTEM — support prompt v18 regression
─────────────────────────────────────────────────────────────────
date:           2026-05-09
duration:       12 minutes (incident triggered to rollback)
impact:         ~40 conversations served v18; csat -2.4 in cohort
                no enterprise customers affected

timeline:
  14:00 — v18 deployed at 5% via support_prompt_rollout flag
  14:08 — complaint-rate alert fires (3-sigma, tenant_cohort_B)
  14:09 — on-call paged
  14:10 — on-call confirms regression in dashboard, identifies SHA
  14:11 — killswitch flipped ON
  14:12 — flag propagation complete; SHA distribution returns
  14:16 — complaint rate returns to baseline
  14:30 — incident channel update; postmortem scheduled
  15:00 — postmortem started

cause:
  v18 removed the line "Always greet the user by name" from the
  system prompt. Without it, the model produced generic openings
  that customers perceived as colder.

detection gap:
  Greeting-warmth eval measured presence of a greeting, not
  personalization. v18 still greeted users — just without names.

new eval case:
  Added 12 examples to greeting_personalization.eval, each with
  user_name available in context. Gold answer requires the name
  appear in the first sentence of the response. Eval gate added
  to CI for any change touching the role or tone sections.

action items:
  AI-2026-119  Add greeting_personalization.eval to CI gate
  AI-2026-120  Add csat-dip-based killswitch trigger to flag
  AI-2026-121  Add prompt-PR template requiring "what eval gates
               this change?" answer

A postmortem in that shape produces an asset — the new eval — that pays interest forever. The next engineer to remove a greeting line will be stopped at the CI gate.


9) The "fix the eval, not just the prompt" rule

This is the rule that separates teams that have one incident from teams that have ten.

When a regression ships, two things must change before the team moves on. The prompt — obviously. And the eval — less obviously. If only the prompt is fixed, the same class of bug will ship again the next time someone touches that section. The eval is the institutional memory. The prompt is just the current state.

The rule has a corollary. Every prompt incident produces a new eval example. Not "consider adding one." Not "if it makes sense." A new example, from the production trace that triggered the incident, added to the eval suite, gated on CI. No exceptions.

This is what makes a prompt-ops practice compound. Year one, your eval suite has the cases you imagined. Year two, it has those plus every regression you actually shipped. Year three, it is the most informative document in the team — a complete inventory of what has gone wrong and how it gets caught.

Mini-FAQ. "What if the regression case is hard to express as a deterministic eval?" Use a model-graded eval. Capture the bad output, write a rubric, have a judge model score future outputs against it. Imperfect, but vastly better than no coverage.


10) Cascading incidents — when the rollback is not enough

Sometimes the prompt change broke something downstream. Rolling back the prompt fixes the symptom for new requests, but a queue full of malformed records is still poisoned. The router that learned a bad pattern from v18's responses is still routing badly.

A cascading incident has a second wave after the obvious one.

WAVE 1: v18 ships, csat drops, killswitch flipped, metric recovers.
WAVE 2: downstream parser keeps failing on records emitted before
        the killswitch. Customer-facing queue stays clogged. Support
        tickets keep arriving from conversations that happened
        during the bad window.

The response has to fix both. Wave 1 — the prompt rollback. Wave 2 — replay or repair the poisoned records, retry the queue, notify affected customers if needed.

Detection of wave 2 needs the downstream-error signal from section 4. If the parser error rate stays high after the prompt SHA distribution recovers, you are in a cascading incident. The work has expanded from "roll back the prompt" to "drain the poisoned state."

Most cascading prompt incidents come from one of three sources. The prompt changed an output format and a downstream parser expected the old format. The prompt's reply went into a database column with a length limit. The prompt's reply was used to populate a search index, and now the index has corrupted entries that need rebuilding. Each has a different repair playbook.


11) Failure modes — where the rollback loop leaks

SYMPTOM                                  ROOT CAUSE                          FIX
─────────────────────────────────────────────────────────────────────────────────────
"It took us an hour to roll back"        No flag wrapping the prompt change   Always ship prompt changes
                                                                              behind a flag
"We rolled back but customers still      Downstream state poisoned            Add cascading-incident
complained for two days"                                                      playbook; drain queues
"The eval still passes after the         Eval too narrow; missed the         Add the regression case to
incident; same bug will return"          regression class                     the eval suite as new example
"We don't know which SHA caused this"    Trace did not log prompt SHA         Always log SHA in trace
"Flag was flipped but half the workers   Flag SDK caches, no streaming        Use streaming flag SDK; cache
still served v18"                                                             flush on killswitch
"Killswitch flipped, metric never        Regression had a non-prompt cause   Don't assume the prompt is
recovered"                                                                    the cause; widen investigation
"We rolled back, then realized v17 had   Forward-fix would have been better   Use rollback/forward-fix
its own bug that v18 fixed"                                                   decision tree
"Postmortem skipped the eval update"     Team treats postmortem as            Make the new eval case a
                                         retrospective, not asset creation    required postmortem output
"Postmortem found that two earlier        Past postmortems' action items     Action-item audit; flag
incidents had the same root cause"       were never closed                    incomplete items each sprint
"On-call could not find the flag          No discovery surface for flags     Flag registry; link in alerts
console URL during the incident"
"Trace search took 8 minutes during      Observability stack too slow at     Pre-built incident dashboards;
incident pressure"                       incident pressure                    saved queries

Eleven leaks. The shape — the rollback loop is fast when each component has been pre-built, slow when it is being improvised at 14:00 on Friday. Pre-build everything.


Where this lives in the wild

Prompt-incident response patterns and the tooling around them.

  • Anthropic, OpenAI, Google DeepMind — internal teams keep model-version rollback playbooks separated from prompt-version rollback playbooks, with distinct mean-time-to-recovery targets.
  • Cursor, v0, Lovable, Replit Agent — frequent staged prompt rollouts mean frequent killswitch usage as new prompt versions occasionally regress on edge cases.
  • GitHub Copilot, Codeium, Tabnine — model-and-prompt incident response built into their staged-release pipelines.
  • Intercom Fin — observability surfaces split csat by prompt version, with rollback procedures tied to per-workspace metrics.
  • Zendesk AI — incident response for assistant regressions runs through their feature-flag stack.
  • Glean — enterprise-search prompt incidents surface as per-tenant relevance drops with documented rollback runbooks.
  • Notion AI, Slack AI — workspace-segment monitoring catches per-tenant regressions before they spread.
  • Salesforce Einstein, Microsoft Copilot Studio — internal incident-response procedures emphasize prompt-SHA logging for trace forensics.
  • LangSmith, Langfuse, Helicone — observability platforms that surface trace-by-SHA filters used during incidents.
  • Braintrust — production scoring streams integrated with prompt versioning for soft-trigger detection.
  • PromptLayer — release-group rollback with version pinning.
  • Vellum — environment-scoped rollback workflows for prompt versions.
  • Phoenix (Arize) — prompt-version-aware monitoring with drift detection.
  • Galileo, Patronus AI — guardrail-trip-rate dashboards used as incident signals.
  • Datadog, New Relic, Honeycomb — APM platforms where the prompt SHA appears as a span attribute for filterable incident dashboards.
  • PagerDuty, Opsgenie, Incident.io — paging surfaces wired to prompt-csat dropouts and complaint-rate thresholds.
  • Sentry — error tracking enriched with prompt SHA tags for AI-feature regressions.
  • LaunchDarkly, Statsig, Split.io, Flagsmith, Optimizely, ConfigCat, GrowthBook, Unleash — feature-flag systems whose killswitch acts as the prompt rollback lever.
  • GitHub Actions, GitLab CI, CircleCI — eval gates that prevent the hard-trigger class entirely.
  • Promptfoo, DeepEval, OpenAI Evals, Inspect AI — eval runners where the regression case lands after a postmortem.
  • AWS AppConfig, Hashicorp Consul, Doppler, AWS Parameter Store — config stores used to gate prompt SHAs and serve as the rollback target.
  • Slack incident channels, Linear incident tickets — coordination surfaces that turn the loop from chaos into procedure.
  • Booking.com, Airbnb, Stripe, Shopify — large applied-AI teams with internal post-incident review templates that require new eval coverage as a closed-loop output.
  • Anthropic's Claude Code, OpenAI's Codex, Cursor agents — agentic systems with their own prompt-incident playbooks tracking regressions in agent-loop behavior.

Pause and recall

  1. What is the target mean-time-to-rollback for a prompt incident?
  2. What three signals are most useful as incident triggers?
  3. Why does the complaint-rate dashboard lag the killswitch flip by several minutes?
  4. What is the difference between rollback and forward-fix?
  5. What is the most important asset produced by a prompt postmortem?
  6. Why does the prompt rollback sometimes not fix the customer experience?
  7. What is the "fix the eval, not just the prompt" rule?

Interview Q&A

Q1. Walk me through your response to a prompt-caused production incident. A. Detect via per-variant dashboards on csat, complaint rate, safety-filter rate, or downstream error rate. Identify the bad SHA from a trace. Decide rollback or forward-fix using the decision tree. Execute the rollback via the feature-flag killswitch. Verify recovery via SHA distribution and metric return. Run a postmortem within hours that produces a new eval case covering the regression and adds the case to CI. Mean-time-to-rollback target is under five minutes. Trap: "We revert the PR." That is the eighty-five-minute path. The flag is the fast path.

Q2. The killswitch flipped, but the complaint rate took five minutes to recover. Is the rollback working? A. Yes. The lag is in-flight conversations finishing on the bad SHA. New requests are already on the safe SHA — confirm with the SHA-distribution panel in the trace dashboard. The complaint metric averages across a window that includes pre-flip conversations. Recovery within three to five minutes is normal. If the metric stays elevated past ten minutes, the prompt was not the cause or the rollback was incomplete. Trap: Panicking and re-flipping the flag during the natural recovery window.

Q3. Should you always roll back when a regression is detected? A. No. Forward-fix is right when the rolled-back version had its own bug that the new version was fixing. Three questions decide — is the new regression worse than what the old version had, can you ship a forward-fixed version within the hour with a passing eval, can the forward-fix path be done without cutting corners on the eval? Default to rollback under pressure. Forward-fix in tight cases. Trap: Always forward-fixing because "rollback feels like failure." That ships multiple broken versions sequentially.

Q4. A prompt incident has happened. What must the postmortem produce? A. Five outputs. A timeline. A specific cause. A detection gap — what the eval suite missed. A new eval case, written into the suite, gated on CI. Action items. The single most important output is the new eval case. Without it, the same regression class ships again. Trap: Postmortem produces only a writeup. The artifact that matters is the new eval.

Q5. You rolled back the prompt but a downstream queue is still poisoned. What is happening? A. A cascading incident. The bad prompt produced records that flowed downstream — into a parser, a queue, an index, a database column — and those records remain poisoned after the rollback. The rollback fixes new requests; it does not heal existing state. The response has a second track — drain the queue, repair the records, retry, notify affected customers if material. Trap: "The flag is flipped, we are done." Half the incident may still be active downstream.

Q6. How do you make mean-time-to-rollback shorter on a team that is at thirty minutes today? A. Pre-build the loop. The prompt-SHA must already be in every trace. The flag must already wrap every prompt deploy. The dashboard splitting metrics by SHA must already exist. The runbook must be next to the flag. The on-call must have practiced the procedure. Each component takes time to build but turns the next incident from improvisation into procedure. Game-day exercises — simulated prompt regressions — surface the gaps before a real incident. Trap: Trying to make the actual incident faster. The work is upstream of the incident.

Q7. The same regression class shipped twice in two months. What changed in process? A. The first incident's postmortem did not produce a new eval case, or the case was added to the suite but not gated on CI, or the eval gate was bypassed. Audit the prior action items. Confirm the new eval exists, is in CI, blocks merges, and runs on every relevant code path. Treat any open action item from a prior postmortem as severity equivalent to a live incident. Trap: Treating recurrence as bad luck. Recurrence is always a process failure.

Q8. What is the role of the on-call engineer during a prompt incident? A. Run the procedure, not improvise it. Identify the bad SHA from the trace, flip the killswitch, verify metric recovery, post status updates, and write a first-draft incident note within thirty minutes. The on-call does not need to know why the prompt regressed during the incident — they just need to make it stop. The postmortem investigates cause later, calmly. Trap: On-call tries to fix the prompt during the incident. That is the slow path. Roll back first; debug after.


Apply now (5 min)

Step 1 — model first. Take a hypothetical prompt incident. v18 of a billing-FAQ prompt has shipped at 10%, complaint rate has doubled, csat has dropped two points. Write out the seven steps of the loop with target times next to each.

Step 2 — your turn. Pick one prompt in your system. Imagine it regressed today. Map out — where is the dashboard that would alert you, where is the flag console, where is the runbook, who is the on-call. If any of those four things does not exist, that is your next ticket.

Step 3 — sketch from memory. Redraw the rollback-or-forward-fix decision tree from section 6. The three branches. The default when in doubt. If you can draw it without looking, you have the model.


Bridge. You have the registry, the version control, the review gates, the shadow, the A/B, the drift detection, the observability, the eval, the multi-tenant patches, the flag rollouts, and the incident loop. Now — what tools actually ship those capabilities, and which ones do you pick? The market is crowded. The tooling chapter is next. → 12-tooling-landscape.md