13. Honest admission — what we still do not know how to guarantee¶

~15 min read. Strong teams sound confident about process, but honest about the limits of current AI reliability methods.

Built on the ELI5 in 00-eli5.md. Even with the triage desk, vitals monitor, sealed ward, and senior doctor, some reliability gaps remain fundamentally hard because AI behavior can look healthy while hiding failure.

1) Silent failures remain the hardest class¶

We can detect timeouts well. We can count malformed JSON. We can trip breakers on repeated 503s. But silent failures still evade us.

observable failure      hidden failure
┌───────────────────┐   ┌────────────────────────┐
│ timeout           │   │ fluent wrong answer    │
│ parse error       │   │ stale but plausible    │
│ missing tool ack  │   │ wrong entity selected  │
└───────────────────┘   └────────────────────────┘

The simple version: The machine can often see syntax faster than truth. That is the heart of the problem. The vitals monitor can be rich, and still miss the most expensive mistake. For example, a support assistant answers, "Your refund was approved yesterday." Every infrastructure metric is green. JSON is valid. Latency is normal. But the order belongs to a different customer.

Silent failure happened. Reliable truth detection for such cases is still hard.

2) Eval scores and production reliability are not the same thing¶

Now what is the common illusion? A team gets strong benchmark scores, then assumes production reliability is strong too. That does not follow. Offline evals often use clean prompts, known datasets,

and stable conditions. Production has:

messy user inputs,
changing tools,
partial outages,
stale memory,
adversarial edge cases,

time pressure.

offline eval
clean case → measured score

production
messy case + dependencies + users + delay → real reliability

The gap is practical: An eval can show model capability. Reliability needs system behavior under messy conditions. For example, a tool-calling agent scores 92% on offline tasks. In production,

duplicate tool calls create costly side effects during network jitter. The benchmark never exercised that path. So the eval was not wrong. It was incomplete. The triage desk for live operations needs broader evidence than lab scores.

3) Confidence remains poorly calibrated in many AI workflows¶

Teams want a clean threshold. If confidence > 0.8, auto-handle. If confidence < 0.8, escalate. The production problem:

Confidence in LLM systems is often indirect, unstable, or domain-shifted.

same model score = 0.82
case A: harmless summary → maybe okay
case B: refund decision   → maybe not okay

The simple version: Confidence is context-sensitive. It can drift after prompt changes, retrieval changes, or model upgrades. For example, a classifier feeding an agent was well-calibrated last month. Then the prompt changed to be more terse. Now score distribution shifts, but thresholds remain old. Escalations drop. Missed risky cases rise.

The senior doctor rule became weaker without anyone noticing.

4) Fallback quality is still hard to reason about under stress¶

Fallbacks sound reassuring. But their real behavior during incidents can surprise teams. A smaller model may be weaker on exactly the cases that arrive during outages. A cached answer may be stale. A human queue may saturate.

fallback looks good in drills
      │
      └── real outage creates different traffic mix
               │
               └── fallback quality drops unexpectedly

Fallback evaluation is always partly conditional. It depends on traffic mix, incident type, and what users do under degraded experience. For example, a product falls back to a smaller model for general chat. During a billing incident, users ask more account-specific questions than usual. The fallback is much weaker there. Complaint rate spikes. The original fallback scorecard was too generic.

That is a real open challenge.

5) Causal diagnosis in AI incidents is messy¶

Now another hard truth. Many incidents have multiple contributing factors. Prompt update, provider latency, retrieval freshness issue, and queue contention may all matter.

incident graph
prompt change ─┐
provider lag ──┼──→ user-visible wrong answers
stale index ───┤
retry storm ───┘

The simple version: Post-mortems want one root cause. Reality often has interacting causes. For example, a research assistant gives wrong citations. Why?

reranker degraded,
fallback model ignored citation format more often,
incident response kept synthesis alive without verification. Which one is the root cause? All matter. Sociotechnical honesty matters here. The vitals monitor may show symptoms, but causal certainty can remain partial.

6) Humans are not infinite fallbacks¶

Teams sometimes say, "If AI fails, humans will handle it." That is incomplete thinking. Human queues have limits. Humans get fatigued. Humans disagree.

Humans may trust bad machine drafts too much.

AI incident
   │
   ▼
more human escalations
   │
   ▼
queue grows
   │
   ▼
review quality drops

The senior doctor is not magic. Human fallback itself needs capacity planning, training, and audit. For example, a support bot incident doubles review volume. Average handle time rises. Reviewers start rubber-stamping drafts. Escalation exists, but reliability still drops. This remains hard to model well.

7) Reliability for open-ended generation still lacks strong guarantees¶

Some domains are narrow. Policy answer quality can be checked against sources. But open-ended planning, creative brainstorming, or long autonomous coding remain less predictable.

narrow task                open-ended task
structured validation      fuzzy validation
clear success criteria     ambiguous success criteria

The production problem: You can monitor obvious failures, but strong guarantees are rare. For example, an autonomous coding agent may pass tests, yet still make the architecture worse.

Was that a reliability failure? Maybe. But the target itself is partly subjective. This is why some AI systems remain better as copilots than full autopilots. The stability kit may simply be the honest boundary.

8) What to say honestly in an interview¶

A strong answer sounds like this. We know how to improve AI reliability materially. We do not know how to guarantee truth, calibration, or safe autonomy across all open-ended settings. We combine:

layered detection,
risk-based routing,
strict side-effect controls,
human escalation,
rollback,
chaos drills,
incident learning. But we remain humble about silent failures, eval gaps, and changing distributions. That is the right tone. Not pessimistic. Not fake certainty.

Where this lives in the wild¶

GitHub Copilot — staff engineer discussing autonomous coding limits: may admit that passing tests does not fully guarantee architectural correctness or long-term maintainability.
Intercom Fin — support AI leader: can show strong policy-grounded reliability, while still admitting that edge-case silent failures remain possible in long-tail conversations.
Perplexity — answer quality owner: can measure citation and freshness signals well, but still cannot perfectly guarantee that every synthesized conclusion is fully faithful.
Klarna assistant — payments risk engineer: can enforce strict action gates, yet still must admit that human review queues and model confidence calibration remain imperfect safeguards.
Healthcare AI product teams — clinical safety manager: often acknowledge that safe triage support is possible, while full diagnostic certainty remains outside present guarantees.

Pause and recall¶

Why are silent failures still the hardest AI reliability problem?
Why can strong offline eval scores still coexist with weak production reliability?
Why is confidence thresholding still fragile in many real systems?
Why is "humans will catch it" not a complete reliability strategy?

Interview Q&A¶

Q: Why is production reliability broader than offline evaluation quality? A: Production includes dependency failures, messy inputs, traffic shifts, and side-effect risks that most offline evals do not exercise. Common wrong answer to avoid: "Because offline evals are useless." They are useful, just insufficient alone. Q: Why are silent failures harder than loud failures even with strong observability? A: Observability can measure many symptoms, but truth and appropriateness are often only partially observable at response time. Common wrong answer to avoid: "Because silent failures are always rare edge cases." They can be central operational risks. Q: Why is human fallback not a universal guarantee? A: Human capacity, consistency, and susceptibility to automation bias all limit how much risk human review can absorb. Common wrong answer to avoid: "Because humans are slower than models." Speed matters, but capacity and judgment quality matter more. Q: Why should senior engineers speak carefully about AI guarantees? A: Overclaiming certainty creates unsafe designs and weak incident preparedness, while honest scope boundaries improve trust and decision-making. Common wrong answer to avoid: "Because legal teams prefer vague language." The reason is technical and operational honesty.

Apply now (5 min)¶

Exercise. Write three things your AI system can guarantee reasonably well, and three things it cannot honestly guarantee yet. For each weak area, write one mitigation rather than pretending certainty.

Sketch from memory. Draw two columns: known controls and open gaps. Place the vitals monitor, sealed ward, senior doctor, and stability kit on the control side, and list silent failure, calibration drift, and eval gap on the open-gap side.

Bridge. You now have the reliability lens. The next pressure is cost and latency: once the system can fail safely, how do we keep it affordable and fast under real traffic? → ../05_agent_performance_economics/00-eli5.md