Skip to content

07. Soft failure detection — the incident with no stack trace

~12 min read. The hardest AI incidents do not crash. They sound reasonable while quietly violating the product contract.

Continues from 06-rollback-and-kill-switches.md. The firebreak can stop known dangerous paths. The harder problem is noticing when the alarm bell is a semantic smell instead of a hard error.

The previous chapter gave us rollback surfaces: prompt, model route, retrieval, tools, guardrails, memory, workflow, and feature flags. That helps when the team already knows the path is dangerous. This chapter steps earlier in time and asks how the alarm rings when the system is wrong but still looks operationally healthy.


1) The wall — plausible is not safe

The refund assistant answers with a polished paragraph, cites a policy page, and uses the right tone. The answer is still wrong because it cites the old renewal policy and ignores the enterprise exception.

No exception fired. No schema failed. No latency budget broke.

Soft failures are AI incidents where the output is syntactically valid and operationally successful but semantically wrong, unsafe, stale, misleading, or off-policy.

hard failure                  soft failure
JSON parse error              valid JSON with wrong decision
tool timeout                  tool succeeds on wrong entity
HTTP 500                      HTTP 200 with stale policy
guardrail block               guardrail passes unsafe nuance

The lead question is not "did the system run?" It is "did the system keep its product promise?"


2) Detection sources for soft failures

Soft failures need multiple detectors because none is complete.

Detector Catches Misses
Golden evals known regressions novel failures
Slice evals critical workflows long-tail weirdness
LLM judge scalable quality checks judge bias and domain nuance
Human review subtle policy/tone/fact issues scale and latency
User reports real impact underreporting and delay
Retrieval audits missing/stale evidence generation mistakes
Tool audits unauthorized actions bad advice without action
Cost/latency anomalies loops and overload quiet wrong answers

The mature response combines detectors and treats each one as partial.


3) Worked example — turning a complaint into a detector

The refund complaint should not remain one anecdote. Convert it into a detection slice:

slice: enterprise renewal refunds
must include:
  - renewal date
  - customer segment
  - refund window
  - current policy citation
  - no approval language unless tool confirms eligibility
detectors:
  - golden questions for 30/60/90-day cases
  - retrieval check for current-policy chunk
  - judge rubric for "does not recommend ineligible refund"
  - sampled human review for enterprise accounts

Now the incident creates a future alarm. The after-action lock starts forming before the postmortem is written.


4) Why not rely on user complaints

The tempting alternative is to wait for support tickets. That is cheap and real.

It fails because users often cannot tell the answer is wrong. They may trust the assistant, silently churn, follow bad advice, or report through a channel that never reaches engineering.

User complaints are high-signal but low-coverage. Treat them as incident triggers, not as the monitoring system.


5) Production signals — soft failure monitoring

The first metric is critical-slice failure rate: high-risk workflows measured separately from global quality.

The misleading metric is average answer rating. A broad average can improve while one regulated or high-value slice regresses.

The expert signal is disagreement between detectors. If the LLM judge passes but human review fails, or retrieval looks good but user complaints rise, the system needs investigation.


6) Boundary — detection cannot eliminate judgment

Soft failure detection helps find semantic incidents earlier. It does not eliminate human judgment for policy nuance, legal risk, brand tone, or high-stakes advice.

The pathology is metric worship. The team treats judge score as truth, then misses the exact class of failure the judge is biased to approve.


Recall checkpoint

  • What makes a failure "soft"?
  • Why are user complaints not enough?
  • What does a critical slice measure?
  • Why is detector disagreement valuable?

Interview Q&A

Q: How do you detect AI incidents that do not produce errors? A: Use critical-slice evals, retrieval audits, LLM judges, human review, user reports, tool audits, and anomaly monitoring together.

Common wrong answer to avoid: "Monitor exceptions and latency." Soft failures often have normal exceptions and latency.

Q: What do you do after one customer reports a plausible-but-wrong answer? A: Snapshot the trace, classify severity, create a slice around the failure, replay related cases, add eval/judge/human-review coverage, and choose containment if harm can spread.

Common wrong answer to avoid: "Fix that one prompt." The complaint may reveal a slice-level product failure.

Q: Why are LLM judges useful but dangerous? A: They scale semantic checking but inherit bias, style preferences, and domain blind spots. They need calibration and human sampling.

Common wrong answer to avoid: "The judge passed, so the answer is fine." Judge pass is evidence, not truth.


Apply now (10 min)

Model the exercise. Turn the refund complaint into a critical eval slice with five cases and one human-review rule.

Your turn. Pick one AI workflow and name three soft failures that would not show up in uptime dashboards.

Reproduce from memory. Explain why plausible output is not the same as safe output.


What you should remember

This chapter explained soft failure detection. The important idea is that AI incidents can be semantically red while operational dashboards stay green.

Carry this diagnostic forward: every high-risk AI workflow needs critical slices and detector diversity, not only uptime and error metrics.

Remember:

  • Soft failures are valid outputs that violate product truth or policy.
  • User complaints trigger investigation but cannot be the whole monitor.
  • Critical slices beat global averages.
  • Detector disagreement is a signal, not an annoyance.

Bridge. Soft failures give us the detection problem. Next we catalog the recurring incident patterns that create those failures in real AI systems. → 08-ai-specific-incident-patterns.md