01. Opening failure — polished answers, weak evidence¶

~12 min read. Strong generation plus weak retrieval is how demos become disasters.

Built on the ELI5 in 00-eli5.md. the confidence gate — deciding whether to answer or search again — matters most when the first search looks decent but is actually thin.

1) The wall — when fluent answers outrun evidence¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: a polished model can turn partial evidence into a confident lie. This page follows a retrieval trace with five chunks and a missing quarter table so you can see whether evidence sufficiency checking actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to prompt harder and trust the top-k set. That keeps the system simple, and on easy questions it may be right. It fails on this case: A revenue question needs Q3 and Q4. Retrieval finds Q4 plus three regional notes. The model can still write a Q3/Q4 comparison, but one side of the comparison was never supported.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: fluency is not evidence; every answer claim needs retrieved support.

Mini-FAQ. "What is the control point here?" the confidence gate is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

Consider the head researcher getting five pages back from the shelves. All five pages smell related. Only one page actually answers the question. A basic RAG pipeline cannot feel that difference. It sees “related enough” and moves on. That is the opening failure. The generator is strong, so it can turn weak evidence into smooth prose. That polish hides the retrieval mistake. The system sounds senior while thinking like an intern. Without the confidence gate, nobody asks whether the support is actually complete.

user question
     │
     ▼
┌──────────────┐
│ one retrieval│
└──────┬───────┘
       │
       ▼
 related chunks
       │
       ├── enough evidence ──→ maybe correct
       │
       └── partial evidence ─→ fluent nonsense

The bad answer is usually not random. It is built from nearby truth plus missing pieces. That makes it feel trustworthy.

3) Why basic RAG plateaus¶

Basic RAG does four cheap things. It embeds the raw question once. It retrieves the top few chunks once. It stuffs them into the prompt once. Then it answers once. That is fast. That is also fragile. Users rarely ask neat benchmark questions. They ask mixed questions with constraints, aliases, and comparisons. Retrieval misses one hidden part.

Generation fills the gap with style.

So the ceiling arrives early.

You improve the model prompt a little.

You improve chunking a little.

Still the same pattern returns.

Weak retrieval keeps leaking into strong generation.

4) The worked example — trace the intermediate state¶

Question:

“Compare Q3 and Q4 revenue growth across all regions.”

Suppose the retriever returns these five chunks.

D1 — Q4 all-region summary — similarity 0.88
D2 — APAC Q3 table — similarity 0.84
D3 — EMEA sales note — similarity 0.82
D4 — annual CEO letter — similarity 0.80
D5 — LATAM cost report — similarity 0.77

Now see the hidden gap.

North America Q3 is missing.

LATAM growth is missing because D5 is about cost, not growth.

The model still writes an answer.

Needed facts:
Q3 APAC   ✓ from D2
Q3 EMEA   partial from D3
Q3 LATAM  ✗ missing
Q3 NA     ✗ missing
Q4 APAC   ✓ from D1
Q4 EMEA   ✓ from D1
Q4 LATAM  ✓ from D1
Q4 NA     ✓ from D1

A confident model now summarizes all regions anyway.

It may say LATAM grew 9% because nearby text hinted at improvement.

It may say North America slowed because the CEO letter sounded cautious.

Neither claim was retrieved.

This is the arithmetic of failure.

Retrieved support coverage = 6 available facts out of 8 needed facts.

Coverage = 6 / 8 = 0.75.

Answer completeness sounded like 1.00.

That gap is where confident nonsense is born.

Without the confidence gate, the system never notices the 0.75 support.

5) Failure modes — three concrete tries that still fail¶

Failure one. Ask, “What changed for premium users in APAC after the cache rollout?”

The system retrieves global cache notes and premium pricing notes.

APAC-specific latency evidence is missing.

The answer still claims improvement.

Failure two. Ask, “Which contracts expiring next quarter need CFO approval?”

The system retrieves the contract list and the approval policy separately.

It misses the clause saying only renewals above a threshold need approval.

The answer overstates risk.

Failure three. Ask, “Did the Phoenix bug affect the retail app or the partner API?”

The system retrieves a Phoenix marketing campaign document and an old incident log.

Entity ambiguity stays unresolved.

The answer picks one meaning and speaks confidently.

See the pattern.

The failure is not only bad ranking.

It is bad completeness plus overconfident generation.

6) Production rules — what advanced RAG changes first¶

The first upgrade is not a bigger model.

The first upgrade is better control before answering.

You rewrite the query.

You expand or decompose it if needed.

You mix retrieval signals.

You rerank candidates with deeper interaction.

Then the confidence gate checks whether the support really covers the claim.

Advanced RAG accepts a humble rule.

Related is not enough.

Supported is the bar.

So what to do?

Start by fixing the query shape itself.

Many retrieval failures begin there.

That is why the rewriter enters next.

7) Why not bigger prompts or stronger generators under this workload¶

The plausible alternative is bigger prompts or stronger generators. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when a polished model can turn partial evidence into a confident lie. Under that workload, evidence sufficiency checking earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
bigger prompts or stronger generators	evidence need is simple	a polished model can turn partial evidence into a confident lie	prompt wording and user trust
opening evidence failure	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether opening evidence failure is working¶

A healthy trace shows the trace shows which claim has no supporting chunk. The first metric to watch is unsupported-claim rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where opening evidence failure helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the confidence gate does not change what the system does, it is decoration.

11) Failure taxonomy for opening evidence failure¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is bigger prompts or stronger generators weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Perplexity — weak first-pass retrieval can still lead to polished summaries that hide missing evidence.
Glean — enterprise questions often fail when one hidden business constraint is not retrieved.
Notion AI — doc answers look fluent even when one page from the workspace never surfaced.
Datadog incident assistants — postmortem summaries break when one timeline fragment is absent.
Intercom support bots — policy answers sound complete even when the exception clause was missed.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why does strong generation make weak retrieval more dangerous rather than less dangerous?
In the revenue example, which exact facts were missing even though the answer sounded complete?
Why is “related enough” a bad production standard for retrieval?
Which false-green case would you review first for opening evidence failure?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would bigger prompts or stronger generators be acceptable instead?

Interview Q&A¶

Q: Why does basic RAG often plateau even after prompt tuning? A: Because the bottleneck is missing or partial evidence, and prompt tuning cannot invent retrieved support.

Common wrong answer to avoid: "The model just needs a smarter system prompt." — The prompt can shape tone, but it cannot recover facts that never reached context.

Q: What is the concrete danger of partial retrieval? A: The model fills unsupported gaps with plausible continuation, so the answer looks complete even when support coverage is incomplete.

Common wrong answer to avoid: "Partial retrieval is fine because the model can infer the rest." — In production, unsupported inference is exactly the risk.

Q: Why is a confidence gate a retrieval concept, not only a generation concept? A: Because it judges whether the retrieved set covers the needed claims before the answer is trusted.

Common wrong answer to avoid: "Confidence is only the model's own feeling." — Good systems compute confidence from evidence quality, coverage, and consistency.

Q: What trace would you inspect first when opening evidence failure fails? A: Start with retrieval trace with five chunks and a missing quarter table. Then compare it with the final answer claims.

Common wrong answer to avoid: "Start by rewriting the final prompt." — The prompt may be fine. The evidence path may already be broken.

Q: What cost does this add? A: Latency, logging, evaluation work, and one more place operators must understand.

Common wrong answer to avoid: "It is free because it is only another model call." — Model calls spend money, time, and failure budget.

Q: When should you skip it? A: Skip it when the question is low-risk, the evidence need is obvious, and the decision would not change.

Common wrong answer to avoid: "Never skip advanced components." — Advanced components are controls, not trophies.

Apply now (10 min)¶

Take one hard business question and list every fact that must be present for a safe answer.
Sketch from memory: draw the flow where one missing fact becomes a polished but unsupported claim.
Reproduce from memory: explain opening evidence failure in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Opening evidence failure exists because a polished model can turn partial evidence into a confident lie. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is a retrieval trace with five chunks and a missing quarter table. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Fluency is not evidence; every answer claim needs retrieved support.
Inspect a retrieval trace with five chunks and a missing quarter table before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. If the first search misses key constraints, the cheapest rescue is to reshape the question itself before retrieval. That is why query rewriting comes next. → 02-query-rewriting.md