14. Honest admission — what RAG does not solve¶
~12 min read. The previous 13 chapters showed you how RAG works. This one shows you where it stops working, and how to tell the difference at 2 AM when a customer is shouting.
Built on the ELI5 in 00-eli5.md. The reading desk — the limited space in front of the writer — is the constraint that never goes away. Retrieval alone has real limits, and grown engineers stop pretending otherwise.
First — respect what basic RAG does solve¶
Before we list the failures, name the wins. Otherwise the chapter reads like a takedown.
Basic RAG is a strong solution for a specific shape of problem. If the question is direct, and the answer lives in one or two chunks, the system works well. Policy lookup. FAQ deflection. "What was our Q3 revenue?" against a quarterly report. "How do I configure SSO?" against your docs. "What's our refund policy for orders over 30 days?" — the running example from earlier chapters — works on the day you launch.
This is real value. Customer support deflection rates from 40% to 70% are common after a competent RAG rollout. Internal search teams at companies like Glean, Notion, and GitHub Copilot ship features that genuinely save engineers minutes per query. None of that is fake.
But then the harder questions arrive. The cross-document ones. The reasoning ones. The "we need to think about this" ones. That is where this chapter lives.
The honest list — five things RAG does not solve¶
Here is the centrepiece. Memorise it.
┌─────────────────────────────────────────────────────────┐
│ RAG does NOT solve │
├─────────────────────────────────────────────────────────┤
│ 1. Multi-hop reasoning across chunks │
│ 2. Hard reasoning (math, logic, constraints, planning) │
│ 3. Stale source data │
│ 4. Vague, adversarial, or false-premise questions │
│ 5. Evaluation completeness │
└─────────────────────────────────────────────────────────┘
We will walk each one with a concrete failure, the root cause, and the upgrade path.
1) Multi-hop questions¶
Here is the trap. "Which team lead was responsible for the Q3 refund policy change that affected enterprise customers?" It looks like one question. It is four.
┌────────────────────────┐
│ user question │
└───────────┬────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ what policy │ │ which │ │ which team │
│ change was │ │ customer │ │ owned the change?│
│ made in Q3? │ │ segment was │ │ │
│ │ │ affected? │ │ │
└──────┬───────┘ └──────┬───────┘ └────────┬─────────┘
│ │ │
└───────────┬───────┴─────────────────────┘
▼
┌──────────────────────┐
│ who was the lead │
│ of that team in Q3? │
└──────────┬───────────┘
▼
┌─────────────────┐
│ final answer │
└─────────────────┘
Each box needs its own retrieval. Naive RAG fires one embedding for the whole question, pulls the top-k from one shot, and hopes the answer is in there. Usually it is not — the four facts live in four different chunks that the single embedding cannot pull together.
Failure mode. The single-shot embedding averages across four sub-topics. The averaged vector lands in a generic neighbourhood. Top-k returns generic chunks. The model answers from the generic chunks, confidently and wrongly.
Upgrade path. Decompose the query (sub-questions), retrieve per sub-question, then synthesise. This is agentic RAG — covered in module 14.
Mini-FAQ. "Can't a longer context window fix this?" No. Longer context lets the model read more chunks, but it does not help retrieval find the right four chunks in the first place. You can have a 1M token window and still miss the policy chunk if your single-shot query never retrieved it. Retrieval is the bottleneck, not capacity.
2) Hard reasoning — retrieval supplies evidence, not thought¶
Retrieve five pricing chunks. Then ask: "Given a usage of 12,000 API calls, 2 TB storage, and 4 active seats, which plan is cheapest under our policy that storage cannot exceed 60% of plan cost?"
The librarian can put all five plans on the desk. The writer still has to compute the math, apply the constraint, and pick a winner. That is reasoning. RAG does not do reasoning.
┌──────────────────────────────────────────────┐
│ Evidence → Required reasoning → Answer │
├──────────────────────────────────────────────┤
│ Plan A: $99 + storage limits... │
│ Plan B: $149 + better storage... [math] │
│ Plan C: $299 + unlimited... │
└──────────────────────────────────────────────┘
Failure mode. The model picks the named answer (e.g., whatever plan is mentioned most fluently in the chunks) rather than the computed answer.
Upgrade path. Tool use. Give the model a calculator. Or a small Python sandbox. Or a structured plan-comparison function. The reasoning then happens in code, not in the model's head. (Covered in modules 01_agentic_system_design/ and 17_schema_driven_generation/.)
Mini-FAQ. "What about reasoning models like o3 or DeepSeek-R1 — don't they solve this?" They help, but they do not change the architecture lesson. A reasoning model with weak retrieval still fails on the wrong chunks. A reasoning model with strong retrieval is a sharper instance of the same pipeline. The retrieval upgrade and the reasoning upgrade are orthogonal.
3) Stale source data¶
The librarian retrieves outdated chunks beautifully. The answer is wrong anyway.
This failure is so common that it has a name in the field: grounded-and-wrong. The citation looks correct. The source exists. The text in the source is what the model paraphrased. But the source was written 18 months ago and the policy changed last quarter.
┌────────────────────────────┐ ┌────────────────────────┐
│ Old policy doc (cached) │ │ Reality (last quarter) │
│ "30 day refund window" │ │ "14 day refund window" │
└──────────────┬─────────────┘ └────────────────────────┘
│
▼
┌───────────────────┐
│ User asks. Model │
│ confidently cites │
│ the cached doc. │
└───────────────────┘
Failure mode. Index freshness lags reality. The chatbot is more confident than your CRM.
Upgrade path. Re-index pipelines on a clock. Track per-document last_modified. Add freshness as a retrieval filter or a re-rank feature. For news-shaped data, retrieve through a live search API instead of a cached vector store. (See 03_agent_observability_debugging/ for staleness monitoring.)
4) Vague, adversarial, or false-premise questions¶
The user does not always ask well. They mix entities. They use the wrong term. They ask leading questions. They ask from a false premise.
| User says | What they mean | What naive RAG does |
|---|---|---|
| "the 30 days one" | refund policy after 30 days | retrieves some 30-day chunk |
| "is it true our SLA is 99.999%?" | (false premise — SLA is 99.9%) | finds 99.9% chunk, blends with the user's number, hedges |
| "what did Sam say about pricing last week?" | (no Sam in our company) | finds some "Sam" + some pricing chunk |
| "compare our plan to theirs" | (no "theirs" in our corpus) | retrieves our plan, invents the comparison |
Failure mode. Retrieval cannot reject a question. It always finds something. The model then dresses up the something as an answer.
Upgrade path. Query understanding upgrades — clarification, reference resolution, scope checks. (Covered in 09-query-and-retrieval.md earlier in this module.) Also, refusal rules in the prompt: "If the user's premise is not present in the context, ask for clarification instead of answering." (See 11-prompt-augmentation.md.)
5) Evaluation completeness¶
The hardest failure. No single eval score catches every failure class.
- Faithfulness scores miss the question of whether the answer was even useful.
- Answer relevance scores miss whether the answer was correct.
- Retrieval metrics miss whether the generation used the retrieved evidence.
- LLM-as-judge inherits the judge's biases (and the judge is often the same family as the generator).
A 0.9 RAGAS score does not mean the system is good. It means the measurable parts of the system look good. Production teams discover the unmeasurable parts at 2 AM, when a CEO forwards a screenshot.
Failure mode. Confidence inflation — your dashboards look fine while a long-tail failure class corrodes user trust silently.
Upgrade path. Multi-eval triangulation. Combine RAGAS-style automatic scores + human review on a stratified sample + production error analysis. Track failure classes, not just an aggregate number. (Module 00_ai_evals_release_gates/ goes deep on this.)
Predict the failure taxonomy before reading the tree¶
Before reading further, predict the failure taxonomy — what decision tree would you build to diagnose any RAG failure? Sketch it. Then continue.
The failure-class decision tree¶
When something goes wrong, walk this tree, not your hunches.
A user got a bad answer.
│
├─ Is the needed fact even in the corpus?
│ ├─ no ────► CORPUS FAILURE — RAG cannot answer honestly
│ └─ yes
│
├─ Was the right chunk retrieved into top-k?
│ ├─ no ────► RETRIEVAL FAILURE — fix embeddings / query / hybrid
│ └─ yes
│
├─ Was the right chunk in the *selected* top-n (after rerank)?
│ ├─ no ────► RERANK or SELECTION FAILURE
│ └─ yes
│
├─ Does the answer need facts that span >1 chunk?
│ ├─ yes ────► MULTI-HOP — needs decomposition or iterative RAG
│ └─ no
│
├─ Does the answer need math / logic / planning?
│ ├─ yes ────► REASONING — needs tool use or stronger model
│ └─ no
│
├─ Is the source data current?
│ ├─ no ────► STALENESS — re-index pipeline broken
│ └─ yes
│
├─ Was the question well-formed and answerable?
│ ├─ no ────► QUERY/PREMISE — needs clarification, refusal rule
│ └─ yes
│
└─ Then the failure is at the PROMPT or MODEL layer.
├─ Missing refusal rule? Bad context order?
└─ Weak generator on hard task?
Print this. Pin it next to your monitor. "RAG is broken" is not a diagnosis. The tree gives you one.
Mini-FAQ. "My dashboard shows 95% faithfulness but users are still complaining. Where do I look?" Faithfulness only measures whether the answer is grounded in the retrieved context — not whether the retrieval found the right context, and not whether the question was even answerable. Walk the tree from the top. Most "high faithfulness, angry users" cases are corpus or multi-hop failures dressed up as generation problems.
The good-enough trap¶
A subtle risk worth its own section.
Basic RAG often impresses on day one. The demo lands. Leadership claps. The team moves on. Then improvement stalls — because medium-quality RAG looks great on easy questions.
The hard failures appear later. Cross-document fails. Edge cases fail. Freshness breaks silently. Users over-trust the confident tone — and that trust is what destroys the system when the inevitable wrong answer ships.
| Day 1 | Day 90 | Day 365 |
|---|---|---|
| Demo works on 10 questions | Production sees 100,000 questions | A wrong answer reaches a regulator |
| Metrics are vibes-based | Faithfulness dashboard exists | Failure taxonomy exists; on-call rotation knows the tree |
| "We have RAG" | "We have RAG with re-ranking" | "We have RAG, and here is what it does not do" |
The discipline. Track failure classes, not just aggregate scores. Sample hard queries weekly. Measure regressions. Have the conversation about what your RAG does not solve before a customer has it for you.
Teacher voice. Most teams stop at "the demo worked." Production teams keep a failure log. The failure log is where engineering happens. Without it, you are decorating.
What comes next — the upgrade path¶
Each of the five failures has a named pattern that addresses it. Together these form the advanced RAG and agentic RAG layers.
| Failure | Upgrade pattern | Module |
|---|---|---|
| Multi-hop | Sub-question decomposition, iterative retrieval | 09_advanced_rag_patterns/ |
| Hard reasoning | Tool use, code execution, calculators | 01_agentic_system_design/ |
| Staleness | Re-index pipelines, live search APIs, freshness filters | 06_evidence_data_pipelines/, 03_agent_observability_debugging/ |
| Vague / adversarial | Query rewrite, HyDE, fan-out, refusal rules | 09-query-and-retrieval.md, 11-prompt-augmentation.md |
| Eval gaps | Multi-eval triangulation, failure-class tracking, human review | 00_ai_evals_release_gates/ |
The point of naming these failures is not pessimism. It is a map. Now you know where to walk next.
The five-failure list across industry post-mortems¶
The five-failure list shows up across the industry. Names change; the shape does not.
- Anthropic — internal eval cookbook explicitly separates "retrieval failure" from "reasoning failure" in model cards.
- OpenAI — Assistants File Search teams document the difference between corpus gaps and retrieval gaps as separate metrics.
- Glean — enterprise search post-mortems track failure-class taxonomies, not just aggregate quality scores.
- Perplexity — citation-mismatch detection is a first-class quality signal; freshness gates routing between cached and live retrieval.
- Cursor / Windsurf — codebase retrieval freshness (post-edit re-indexing) is core engineering; stale embeddings cause "the file has moved" failures.
- GitHub Copilot Chat — multi-file reasoning across a repo is a known weak spot vs single-file completion; explicit upgrade path via agentic loops.
- Notion AI — workspace search teams document how vague queries ("the doc we wrote last week") require conversational rewriting before retrieval.
- Intercom Fin — bot deflection metrics are tracked alongside escalation reasons, which is effectively the failure taxonomy in production.
- Zendesk AI — answer suggestions are paired with a "needs human" signal trained from past cases where automated answers failed.
- Hebbia — finance RAG explicitly invests in multi-hop decomposition because single-shot retrieval cannot answer "what changed between these two filings."
- Harvey — legal AI publishes that contract-conflict reasoning requires structured comparison, not raw retrieval.
- Casetext / CoCounsel — citation accuracy was a launch blocker; un-grounded "legal hallucinations" became newsworthy (Mata v. Avianca, 2023).
- Microsoft Copilot for M365 — Graph-aware retrieval reduces vague-query failures by injecting org context; documented as a major win.
- Salesforce Einstein Copilot — CRM trust layer specifically addresses prompt injection and false-premise attacks at the wrapper, not the model.
- AWS Bedrock Knowledge Bases — observability product specifically markets retrieval failure analysis as a distinct workflow.
- Vertex AI Search — Google explicitly separates "search quality" metrics from "generation quality" metrics in customer dashboards.
- Azure AI Search — semantic ranker exists because hybrid retrieval alone hits a quality wall on enterprise corpora.
- Vectara HHEM — hallucination eval product exists because faithfulness scoring missed real failures in customer deployments.
- Galileo / Patronus AI Lynx — entire startups exist around the "RAG eval is unsolved" problem.
- TruLens / RAGAS / DeepEval — three independent eval frameworks all carry the same disclaimer: aggregate scores are necessary, not sufficient.
- LangSmith / LangFuse / Phoenix Arize — observability platforms exist because production teams need the failure-class view their RAG framework didn't give them.
- Stack Overflow's OverflowAI — chose conservative ground-truth retrieval over generative synthesis specifically to avoid grounded-and-wrong failures on technical content.
- DuckDuckGo / Brave Search AI — both pulled back from aggressive RAG synthesis after early hallucination incidents on news content.
- Bloomberg GPT, JP Morgan DocLLM, Goldman's internal tools — finance teams document that domain reasoning gaps persist even with strong domain-tuned retrieval.
- Mata v. Avianca (2023) — lawyers used an LLM that confidently cited cases that did not exist; the public failure of un-grounded generation.
- Air Canada (2024) — chatbot promised a bereavement fare refund that did not match policy; tribunal ruled the airline liable. A staleness + reasoning + refusal-rule failure.
These are not edge cases. These are the normal operating regime of production RAG. The teams that ship reliably are the teams that have named their failures.
Recall — failure classes and the diagnosis tree¶
- Why is naive RAG bad at multi-hop questions, even with a perfect retriever?
- What kind of work does the model still have to do that retrieval cannot replace?
- What is the failure pattern called when the citation is correct but the source is outdated?
- What is the good-enough trap, and what discipline prevents it?
- Walk the failure tree from memory — name the eight branches.
- Why is a high faithfulness score insufficient for a production-ready system?
- Which of the five failure classes does tool use address, and which does iterative retrieval address?
Interview Q&A¶
Q1. Does RAG solve hallucination? A. It reduces unsupported generation by grounding answers in evidence. It does not eliminate it. Reasoning errors, multi-hop synthesis errors, and confident citation of stale or wrong sources still produce hallucinated-feeling outputs. The mature framing is: RAG reduces some failure classes and exposes others. Common wrong answer to avoid: "Yes, once you add retrieval, hallucinations are gone."
Q2. Why are multi-hop questions difficult for naive RAG? A. The whole question gets averaged into one embedding, which lands in a generic vector neighbourhood. Top-k returns generic chunks rather than the specific four-or-five chunks that each sub-question would have retrieved separately. The fix is query decomposition and iterative retrieval, not a longer context window. Common wrong answer to avoid: "Because embeddings cannot represent long questions."
Q3. What problem do stale indexes create? A. Confident retrieval of outdated evidence. The system looks correct — the citation works, the source exists — but the underlying fact has changed. This is the grounded-and-wrong failure class. It is dangerous precisely because the audit trail looks clean. Common wrong answer to avoid: "Staleness only affects training data, not retrieval systems."
Q4. Why is RAG evaluation still partly unsolved? A. No single automatic metric covers all failure modes — faithfulness misses usefulness, answer relevance misses correctness, retrieval metrics miss generation quality, and LLM-as-judge inherits the judge's biases. Production systems triangulate across multiple metrics plus human-reviewed samples plus failure-class tracking. Common wrong answer to avoid: "RAGAS solved evaluation, so manual review is obsolete."
Q5. A user says 'is it true our SLA is 99.999%?' but the actual SLA is 99.9%. What should the system do? A. Detect the false premise and refuse, or ask for clarification. A naive system retrieves an SLA chunk and either parrots back 99.999% confidently or hedges in a way that confirms the wrong number to the user. The fix is at two layers: query understanding (detect false premise) and prompt rules (refuse if premise is not in the context). Common wrong answer to avoid: "The model will know to correct the user."
Q6. Your dashboard shows 92% faithfulness but customer complaints have doubled this month. What do you check first? A. Walk the failure tree from the top. The corpus may not contain the facts customers are asking about (corpus failure). Or retrieval is finding adjacent-but-wrong chunks that the answers faithfully summarise (retrieval failure looking like generation success). Or the questions changed shape — more multi-hop, more vague — and the system's weak spots are now hot. Faithfulness alone cannot diagnose this. Common wrong answer to avoid: "I would retrain the judge model."
Q7. When is naive single-shot RAG genuinely good enough? A. When the corpus is fresh, the questions are direct, the answer typically lives in one or two chunks, and the cost of an occasional wrong answer is bounded. Internal FAQ deflection, policy lookup, structured-doc QA on stable corpora — these often work well. The trap is generalising "it worked on FAQ" to "it will work for everything." Common wrong answer to avoid: "Naive RAG is never good enough — always use agentic RAG."
Q8. What is the good-enough trap? A. Basic RAG demos well on easy queries, which causes teams to ship and then under-invest in evaluation, failure-class tracking, and the long tail. Six months later, hard failure classes have accumulated unnoticed and trust corrodes. The fix is to track failure classes from day one, not just aggregate quality. Common wrong answer to avoid: "As long as our quality score keeps going up, we are fine."
Apply now (10 min)¶
Step 1 — I will model the exercise. Here is the failure-tree applied to a real bug we might see in our running example.
Bug report: "Chatbot told a customer they could get a full refund on a 45-day-old order. Finance overrode it. Sales is angry."
Walking the tree: - Corpus? Refund policy doc exists. ✓ - Retrieved? Top-k contained the right doc. ✓ - Selected after rerank? Yes, the doc was in top-3. ✓ - Multi-hop? No, single fact ("orders over 30 days require manager approval"). ✓ - Reasoning? Mild — needs to map "45 days" to "over 30 days." ✓ - Stale data? No — the doc is current. ✓ - Question well-formed? Yes. ✓ - Prompt / model layer? Looking at the prompt — there is no rule that says "if the order is over 30 days, escalate." The model summarised the policy as if it were a yes/no answer.
Diagnosis: prompt layer. Add a refusal-and-escalate rule for the over-30-day case. Not a retraining issue. Not a retrieval issue.
Step 2 — your turn. Take a real bad-answer incident from your product (or invent a plausible one). Walk it through the eight-branch failure tree from this chapter. Name the failure class. Name the upgrade pattern. Write the one metric you would now log to catch this class of failure earlier.
Step 3 — sketch from memory. Redraw the five-failure list and the failure-class decision tree, without looking back. If you can do this cold, you have internalised the chapter.
What you should remember¶
This chapter named what basic RAG does not solve: multi-hop reasoning, hard logic and math, staleness, vague or adversarial queries, and the unsolved-eval gap. The five-failure list is not pessimism — it is the map of what the next module fixes and what the rest of the curriculum builds on. The dangerous failures are no longer "the model invented an answer"; they are "the answer is grounded in adjacent-but-wrong chunks" and "the citation works, the source is stale." Audit trails look clean while the system is wrong.
You also learned the decision tree that turns "RAG is broken" from a complaint into a diagnosis. Walk the tree from corpus to retrieval to rerank to multi-hop to reasoning to staleness to query premise to prompt/model — in that order — and the failure class names itself. Production teams that ship reliably are the teams that track failure classes, not aggregate scores.
Carry this diagnostic forward: when faithfulness is high and users are angry, the failure is upstream of the generator. Most "high faithfulness, angry users" cases are corpus or multi-hop failures dressed up as generation problems. The tree is faster than your hunches.
Remember:
- Faithfulness ≠ truth. Grounded-and-wrong is a real failure class and the dashboard hides it.
- Multi-hop needs decomposition, not a longer context window.
- Stale indexes produce confident wrong answers whose audit trail looks clean. Treat staleness as a first-class metric.
- Naive RAG is fine when the corpus is fresh, the question is direct, and the cost of a wrong answer is bounded. Otherwise plan for the upgrades.
- Track failure classes from day one, not just aggregate scores. Without a failure log, you are decorating.
Bridge. Basic RAG handled the easy half of the problem. The next module shows how to push past these five limits — query decomposition, iterative retrieval, HyDE, multi-vector search, hybrid retrievers, and the agentic loops that turn one-shot retrieval into a real reasoning system.