Skip to content

02. Error detection — hear the alarm before the patient crashes

~14 min read. An AI stack cannot recover from failures it never notices.

Built on the ELI5 in 00-eli5.md. The vitals monitor — health signals that show trouble early — is what turns random breakage into manageable operations.


1) First picture: one request needs many monitors

A single LLM request has several ways to fail. Some failures are transport failures. Some are output failures. Some are policy failures. Some are quality failures.

user request
transport check ──→ timeout? connection reset? 5xx?
format check ─────→ valid JSON? expected fields?
policy check ─────→ refusal? blocked content? unsafe action?
quality check ────→ grounded? relevant? entity-correct?

The simple version: One green light is not enough. The vitals monitor needs multiple channels. If you only watch latency, you miss wrong content. If you only watch parse success,

you miss a subtle refusal loop. If you only watch user complaints, you notice too late. The practical response: Build layered detection. Let the triage desk receive several symptoms,

not one.

2) Detecting loud failures first

Start with what machines can see clearly. Timeouts. Connection resets. HTTP 429. HTTP 5xx. Malformed JSON.

Missing required fields. Tool exceptions. These are loud failures. They should be cheap to detect.

request start
   ├── no response in 8 s ─────────────→ timeout
   ├── response status = 503 ──────────→ provider failure
   ├── response body not parseable ────→ format failure
   └── tool call raised exception ─────→ tool failure

This is basic engineering. But teams still miss it. Why? Because they detect only at the outer API boundary. The production problem:

The outer request may succeed, while one inner step quietly failed and got papered over. That is why each important span needs its own vitals monitor. For example, a retrieval-augmented answer path has four steps.

  1. Retrieve documents.
  2. Call model for synthesis.
  3. Parse citations.
  4. Render answer. Suppose step 2 times out after 12 seconds. The frontend still returns, "We are having trouble right now." User-visible handling happened. Good. But detection still matters.

You must record:

  • model timeout count,
  • timeout latency bucket,
  • provider name,
  • prompt class,
  • whether fallback saved the request. Otherwise tomorrow's outage looks like random support noise.

3) Malformed output is a first-class failure type

Now come to structured outputs. Many AI systems do not want prose. They want JSON, tool arguments, SQL, or function-call envelopes.

A pretty answer is useless if the shape is wrong.

model output
   ├── valid JSON? ─────────────── no ─→ reject
   ├── required keys present? ──── no ─→ reject
   ├── value types correct? ────── no ─→ reject
   └── semantic constraints pass? ─ no ─→ reject

The simple version: Notice the last line. Syntax is not enough. You may parse valid JSON, and still get nonsense. For example, expected tool call:

  • amount: positive integer,
  • currency: one of USD, EUR, INR,
  • order_id: existing order. Model returns:
{"amount": -40, "currency": "RUPEES", "order_id": "ORD-999999"}

The JSON is valid. The answer is invalid. So your vitals monitor must do four checks.

  1. Syntax check.
  2. Schema check.
  3. Domain rule check.
  4. External existence check. Now the triage desk can separate malformed output from bad business logic.

4) Detecting refusals, policy conflicts, and hidden dead ends

Refusals are not always failures. Sometimes refusal is the correct behavior. But sometimes it is accidental. The model refuses safe content, or refuses due to prompt confusion, or loops between polite hedges.

The production problem: A refusal may arrive as normal text. HTTP 200. Low latency. Perfect JSON wrapper. Still operationally bad.

safe user request
model says "I cannot help with that"
     ├── policy truly requires refusal? ─ yes ─→ healthy refusal
     └── policy does not require refusal ─ no ─→ false refusal

See the distinction. A false refusal is a reliability defect. A healthy refusal is a safety success. The practical response: Tag refusal reasons. Separate:

  • safety refusal,
  • capability refusal,
  • prompt-confusion refusal,
  • tool-unavailable refusal. For example, a travel assistant gets: "Change my Friday flight to Saturday." The tool is down. The model replies, "I cannot modify flights due to policy restrictions."

That is misleading. The correct detection should say:

  • tool path unavailable,
  • policy not triggered,
  • fallback or human escalation required. The senior doctor may not need to step in yet, but the detector must not mislabel the case.

5) Quality detection needs proxies, not hope

Silent failures are harder. We often cannot know truth instantly. Still, we can build good proxies.

quality proxies
┌────────────────────────────────────┐
│ citation missing for factual claim │
│ entity not found in source data    │
│ confidence below threshold         │
│ answer contradicts retrieved chunk │
│ user re-asks same question         │
└────────────────────────────────────┘

These are not perfect truth tests. But they are useful vitals monitors. Worked example. A support bot answers, "Your refund was sent on 12 March." Checks run after generation.

  • Does order system show a refund event?
  • Does the date exist in the order timeline?
  • Did the answer cite the fetched order object? If all three fail, the answer should not ship automatically. This is where the triage desk and senior doctor meet. The machine detects suspicion. A human decides high-stakes exceptions.

6) Detection must emit action-friendly events

Do not end at logging. Detection without action is only decoration. Every detector should say what happened, where, how severe, and what next action is possible.

Good event card:

detector_event
class = malformed_output
severity = medium
step = synthesis_call
provider = model-x
recoverable = yes
next_action = retry_with_backoff

Bad event card:

error = generation_failed

The difference is practical: The first helps automation. The second creates meetings.


Where this lives in the wild

  • GitHub Copilot — completions platform engineer: detects malformed tool-call arguments and rejects them before any repository-modifying action runs.
  • Perplexity — answer quality engineer: checks every factual answer for citation structure and treats missing grounded evidence as a reliability signal, not just a UX issue.
  • Intercom Fin — conversation operations lead: separates safe policy refusals from false refusals so support teams can tune prompts without weakening safety rules.
  • Cursor — agent runtime engineer: records per-step timeout, schema, and tool-existence failures so an agent edit can fail gracefully before touching the codebase twice.
  • Khanmigo — educational AI safety owner: monitors for off-topic refusals on ordinary tutoring questions because over-refusal harms the learning experience.

Pause and recall

  • Why does one request need transport, format, policy, and quality checks?
  • Why is valid JSON still not enough for a structured AI action?
  • How do you distinguish a healthy refusal from a false refusal?
  • Why should detector outputs include a suggested next action?

Interview Q&A

Q: Why validate semantic constraints after schema validation instead of stopping at parse success? A: Schema validation confirms shape, but operational safety depends on values making sense in the business domain. Common wrong answer to avoid: "Because schema validators are too slow." Speed is not the main reason semantic checks exist. Q: Why should refusals be tagged by reason rather than counted as one bucket? A: Safety refusals, capability limits, and accidental refusals require different product responses and different owners. Common wrong answer to avoid: "Because all refusals reduce conversion equally." Their impact and correctness vary sharply. Q: Why is per-step detection better than only end-to-end API monitoring? A: End-to-end status can hide inner failures that were masked by fallback or generic messaging, which weakens diagnosis and tuning. Common wrong answer to avoid: "Because per-step monitoring always reduces latency." It usually adds visibility, not speed. Q: Why use proxy checks for quality when they are imperfect? A: Silent failures need early signals, and proxy checks reduce risk even when instant ground truth is unavailable. Common wrong answer to avoid: "Because proxy checks are equivalent to human judgment." They are useful filters, not full truth.


Apply now (5 min)

Exercise. Pick one LLM workflow. Write four detectors for it: transport, format, policy, and quality. For each,

name the event payload and the next recovery action.

Sketch from memory. Draw the layered vitals monitor pipeline. Mark where the triage desk receives signals, and where the senior doctor would be called for high-stakes suspicion.


Bridge. Detection tells us a failure happened. Next we decide whether a second attempt is medicine or poison, which is the job of the retry dose. → 03-retry-backoff.md