08. AI-specific incident patterns — the fires that keep returning¶

~13 min read. Mature teams do not treat every incident as unique. They recognize recurring AI failure shapes and keep ready-made firebreaks for each one.

Continues from 07-soft-failure-detection.md. We can now detect soft failures. The next step is recognizing the incident pattern quickly enough to choose the right firebreak.

The previous chapter showed why plausible outputs can be incidents even without stack traces. That solved detection vocabulary, but it still leaves a triage question: what kind of AI fire is this? This chapter turns scattered symptoms into recurring patterns that point to the first artifact and first firebreak.

1) The pattern map¶

Most AI incidents fit one or more of these patterns:

Pattern	Symptom	Usual firebreak
Prompt regression	answers change after template edit	prompt rollback
Model route regression	one model tier behaves differently	route rollback
Retrieval poisoning	bad/stale/unauthorized chunks enter context	index rollback or filter
Reranker/fusion regression	right evidence drops below prompt cutoff	re-enable previous ranking
Tool loop runaway	repeated actions, cost spike, or side effects	tool disable/rate limit
Memory contamination	stale or wrong user facts persist	memory read/write disable
Guardrail bypass	unsafe input/output passes checks	threshold tighten or block path
Judge drift	eval monitor stops matching human judgment	calibrate judge or human gate
Cost explosion	prompt expansion, retries, or loops increase spend	budget cap/routing/rate limit
Latency fallback harm	fallback path gives lower-quality answer	disable unsafe fallback

The table is useful because it maps symptoms to first containment. Root cause can be slower.

2) Worked example — refund incident as combined pattern¶

The refund incident is not one pattern. It is a combination:

reranker disabled for cost experiment
  -> stale policy outranks current policy
fallback model used after timeout
  -> follows stale chunk more strongly
prompt allows recommendation language
  -> answer becomes directive

That means the firebreak should be layered:

Degrade refund answer mode to cited policy only.
Re-enable reranker for refund flows.
Force current-policy retrieval filter.
Review fallback model behavior before restoring.

Pattern recognition prevents the team from blaming "the model" when the incident is really ranking plus fallback plus prompt authority.

3) The cost runaway pattern¶

Cost incidents deserve their own attention because they can be invisible to users until the bill arrives.

Common causes:

agent loop retries the same failed tool
query decomposition explodes one request into fifty calls
fallback model is more expensive than primary
logging or judge evaluation runs on every trace
retrieval overfetch plus reranking grows silently

The alarm bell is spend per request, tokens per workflow, tool calls per session, or retry depth. The firebreak is budget cap, loop depth limit, rate limit, cheaper route, or disabling the expensive stage.

The postmortem must ask why the cost guard did not trip earlier.

4) The stale evidence pattern¶

RAG incidents often come from stale, missing, or unauthorized evidence.

source doc changed
  -> parser missed update
  -> index still has old chunk
  -> retriever finds stale chunk
  -> model gives confident old answer

The right firebreak is not always prompt rollback. It may be index rollback, source filter, freshness gate, or temporary answer refusal for affected policies.

This is where modules 10, 11, 12, 13, and 14 re-enter the incident response module. Incident response is the operational wrapper around all those mechanisms.

5) Production signals — pattern recognition speed¶

The first metric is time to suspected pattern. The team does not need final root cause, but it should quickly say, "This looks like retrieval poisoning plus fallback route."

The misleading metric is number of incident categories. Too many categories become taxonomy theater.

The expert signal is firebreak readiness per pattern. If the pattern is known but no lever exists, the architecture is immature.

6) Boundary — patterns are shortcuts, not verdicts¶

Patterns accelerate response. They can also bias investigation. If the team always sees retrieval poisoning, it may miss model route regression or guardrail drift.

The mature move is to use patterns for initial containment and then test them against the snapshot package.

Recall checkpoint¶

Why are incident patterns useful?
Which patterns often combine in RAG failures?
What makes cost incidents different?
How can pattern recognition bias investigation?

Interview Q&A¶

Q: Name common AI incident patterns and first firebreaks. A: Prompt regression → prompt rollback; retrieval poisoning → index/filter rollback; tool loop → tool disable/rate limit; guardrail bypass → threshold tighten/block path; cost runaway → budget cap/routing/rate limit.

Common wrong answer to avoid: "Most AI incidents are hallucinations." That label hides the operational pattern and firebreak.

Q: How do you respond to a cost runaway incident? A: Cap budget, limit loop depth, rate-limit the expensive path, inspect token/tool-call counts by workflow, and add a cost eval or runtime guard.

Common wrong answer to avoid: "Switch to a cheaper model." The loop or prompt expansion may still runaway on the cheaper route.

Q: Why does stale evidence create incidents even when the model is behaving normally? A: The model may faithfully answer from the wrong chunk. The failure is in source freshness, parsing, indexing, retrieval, or ranking.

Common wrong answer to avoid: "The model hallucinated." It may have been grounded in stale evidence.

Apply now (10 min)¶

Model the exercise. Classify the refund incident as reranker regression plus stale evidence plus unsafe answer authority.

Your turn. Pick three AI incidents and map each to pattern, first artifact, and first firebreak.

Reproduce from memory. Explain why "hallucination" is usually too vague for incident response.

What you should remember¶

This chapter explained recurring AI incident patterns. The important idea is that pattern recognition maps symptoms to containment before full root cause is known.

Carry this diagnostic forward: replace vague labels like hallucination with operational patterns that imply artifacts and firebreaks.

Remember:

Patterns speed containment but must be tested.
Many incidents combine prompt, retrieval, model route, and tool failures.
Cost runaway is an incident even without user-visible errors.
Known patterns should have ready levers.

Bridge. Recognizing the pattern helps contain the fire. The next job is making sure this incident creates a lock, not just a story. → 09-postmortem-evals-and-locks.md