08. The RAG pipeline — from question to grounded answer¶

~12 min read. You asked your chatbot a simple question. The answer came back confident and wrong. By the end of this page you will know which of 8 stages failed, and why.

Builds on the ELI5 in 00-eli5.md. The librarian, the bookshelf, the reading desk, the answer brief — the same four placeholders walk the full path here.

A confident wrong answer comes back — which of eight boxes failed?¶

Your chatbot is in production. A user asked "What's our refund policy for orders over 30 days?" and got back "We offer full refunds within 30 days". The answer was fluent, cited a source, and was completely wrong. Where do you look first?

If your mental model of RAG is "retrieve and generate," you have only two places to blame, and both look fine. The real query-time path has eight distinct boxes between the user's words and the model's reply. Each one transforms the data into a different shape, and each one fails in its own way. Until you can name them and trace them, every wrong answer looks like a generation problem.

RAG actually has two pipelines, not one. Indexing-time runs offline, rarely: ingest → parse → clean → chunk → embed → store. Query-time runs constantly, per request — that is the eight-stage path below. This page is the query-time pipeline; indexing-time gets its own treatment in 03_vector_retrieval_infrastructure/. When stage 1 fires, assume the bookshelf is already loaded.

Rule. The visible failure at stage 7 is rarely the root cause. You cannot fix what you cannot separate. Instrument every box or every bug looks like a model bug.

1) The eight boxes hidden behind "retrieve and generate"¶

Most candidates will tell you RAG is "retrieve plus generate." They are right and useless. Two stages hide eight failure modes.

Here is the pipeline as a single column, with the data shape at every handoff. Notice the shape changes at each step — that is how you know it is a real stage, not a name.

┌───────────────────────────────────────────────────┐
│  USER QUERY  (string)                             │
│  "What's our refund policy for orders             │
│   over 30 days?"                                  │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  1. QUERY UNDERSTANDING                           │
│  rewrite, expand, resolve references              │
│  out → "refund policy enterprise orders 30+ days" │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  2. EMBED QUERY                                   │
│  text → vector                                    │
│  out → [0.21, -0.84, ..., 0.07]  (768 dims)       │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  3. RETRIEVE TOP-K                                │
│  nearest neighbours from the bookshelf            │
│  out → 20 candidate chunks with scores            │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  4. RERANK                                        │
│  cross-encoder rescans candidates carefully       │
│  out → 20 chunks, re-scored and re-ordered        │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  5. SELECT TOP-N                                  │
│  budget step — fit the reading desk               │
│  out → 3 chunks, de-duplicated                    │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  6. BUILD PROMPT (the answer brief)               │
│  system + question + evidence + rules             │
│  out → final prompt string                        │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  7. GENERATE ANSWER                               │
│  LLM reads the brief, writes the response         │
│  out → answer text                                │
└────────────────────────┬──────────────────────────┘
                         │
                         ▼
┌───────────────────────────────────────────────────┐
│  8. CITATION MAPPING                              │
│  link claims back to source chunks                │
│  out → answer + citations                         │
└───────────────────────────────────────────────────┘

Every box has one job. Every box has its own failure mode. You cannot fix what you cannot separate.

Mini-FAQ. "Why embed the query separately — aren't the docs already embedded?" Yes, the docs were embedded at indexing time. The query gets embedded now, at request time, with the same model. The retriever then finds doc vectors close to the query vector. Different times, same embedder.

2) Stages 1–4: how the librarian decides what reaches the desk¶

The first four stages exist to put the right chunks in front of the LLM. If the librarian fails to find or rank the right page, the rest is theatre — generation cannot recover information that never reached the reading desk.

Stage 1 — Query understanding¶

Users do not write clean queries. They write "the thing we said last quarter" or "the 30 days one." This stage rewrites vague wording, resolves references, and expands acronyms.

For our running example, "What's our refund policy for orders over 30 days?" becomes "refund policy enterprise orders 30+ days." The rewrite preserves the time constraint. That matters.

Failure mode. Drop a key term in the rewrite and the rest of the pipeline chases a different question. We will trace exactly this failure in section 4.

Stage 2 — Embed query¶

The question becomes the index card. Same embedding model that built the bookshelf. Different model = different geometry = nearest neighbours land in the wrong neighbourhood.

Failure mode. Domain mismatch. A generic embedder may not separate "refund" from "return" in a billing corpus. Domain-tuned embedders help.

Mini-FAQ. "Why not skip the embedder and just keyword-search?" You can — that's sparse retrieval (BM25). It wins on rare entities and exact-match. Dense embedding wins on paraphrase and synonym. Most production stacks run hybrid retrieval: BM25 + dense, scores fused by Reciprocal Rank Fusion (RRF). Whenever you hear "vector search only," that is usually a toy.

Stage 3 — Retrieve top-k¶

The bookshelf returns the k nearest chunks. Typical k is 20 to 100. Bigger k catches more recall, costs more downstream rerank time, dilutes precision.

Failure mode. k too small and useful evidence stays hidden. Wrong metadata filter and the right source is excluded before it gets scored. (Did you scope the query to the right tenant, the right time window, the right product line?)

Stage 4 — Rerank¶

The retriever uses a fast bi-encoder: embed query, embed chunk, dot product. Cheap and approximate. The reranker uses a cross-encoder: feed query and chunk together into a small transformer that outputs a single relevance score. Slower per pair, much sharper.

Failure mode. Skipping rerank in a noisy corpus. Weak chunks then occupy prompt space and crowd out the strong ones.

Mini-FAQ. "Cross-encoder vs bi-encoder — what's the actual difference?" Bi-encoder embeds each side independently → vectors compared by similarity → fast at scale, blunt at precision. Cross-encoder fuses both texts and reads them jointly → cannot be precomputed → slow at scale, sharp at precision. So you use bi-encoder to find the 20 candidates and cross-encoder to rank them.

Predict the four stages that turn evidence into prose¶

Before reading the next section, predict the four remaining stages. Write them down. Then continue and check.

3) Stages 5–8: how the answer brief turns chunks into a grounded reply¶

Now the LLM enters. The reading desk fills, the answer brief gets written, the writer writes, the citations close the loop. Each box still has its own failure mode — confident-wrong answers love these four stages just as much as the first four.

Stage 5 — Select top-n¶

The prompt cannot hold everything. The reading desk has a budget — typically 3 to 8 chunks. This stage is the difference between retrieval and useful retrieval.

A naive selection just takes the top-n from rerank. A real selection also de-duplicates: if three chunks paraphrase the same paragraph, only one earns its place. Diversity beats redundancy.

Failure mode. Keep all three near-duplicates. Now your "evidence" is one fact, repeated.

Stage 6 — Build prompt (the answer brief)¶

The brief is not "concatenate retrieved text." It has structure:

SYSTEM:  You answer using only the provided context.
         If the context is insufficient, say so.

CONTEXT: [chunk 1: doc=refund_policy.md, page=2]
         Enterprise annual plans may request a refund
         within 30 days of renewal.

         [chunk 2: doc=billing_faq.md, page=5]
         Refunds for orders past 30 days require
         manager approval and are issued as credit.

QUESTION: What's our refund policy for orders
          over 30 days?

Notice the rules at the top, the citations in the context, the question last. Order matters. LLMs weight the last instruction heavily — putting the question at the end is intentional.

Failure mode. Vague brief, no "answer only from these chunks" rule. The model blends evidence with its own training memory. Hallucination begins here.

Stage 7 — Generate answer¶

The visible box. Users see this and ignore the rest. But generation quality is capped by retrieval quality — feed nonsense chunks to GPT-4 and watch it confidently summarize nonsense.

Failure mode. Temperature too high → creative invention. Context too long → lost-in-the-middle (the middle chunks get under-read). Streaming half-baked outputs without re-checking → users see hallucinations land in real time.

Mini-FAQ. "If retrieval is perfect, does the model matter?" Yes, but less than people think. A weaker model with perfect retrieval often beats a stronger model with messy retrieval. This is why "upgrade to GPT-X" is not a fix for a broken RAG.

Stage 8 — Citation mapping¶

The doc that says "optional" lies to you. In any product where users trust the answer — Perplexity, Glean, every enterprise search — citations are mandatory. Optional means toy.

Citations work in two flavours:

Span-level: each claim points to the exact sentence in a chunk. Strong, expensive.
Chunk-level: each paragraph in the answer points to a source chunk. Cheaper, weaker.

Failure mode. Citation mismatch — the answer cites chunk 2 for a fact that actually came from the model's training data. This is citation hallucination, and it destroys trust faster than no citation.

4) The failure chain — one query traced through¶

Abstract failure cascades sound theoretical. Here is the same chain with one concrete query.

User asks: "What's our refund policy for orders over 30 days?"

Stage	What should happen	What goes wrong	Result
1. Rewrite	Preserve "over 30 days"	Drop time constraint → "refund policy"	Generic query
2. Embed	Vector tuned to "30+ days"	Vector tuned to "refunds" generally	Generic neighbourhood
3. Retrieve	Pull policy chunks for >30d	Pull generic refund policy chunks	Wrong evidence in top-k
4. Rerank	Push 30+d chunks up	No 30+d chunks present to push up	Generic chunks rerank highest
5. Select	Keep diverse 30+d chunks	Keep redundant generic chunks	Reading desk now hollow
6. Brief	"Answer from these chunks"	Same rule, but chunks are wrong	Hollow brief
7. Generate	"Refunds over 30 days require manager approval"	"We offer full refunds within 30 days"	Confidently wrong
8. Cite	Cite the >30d policy chunk	Cite a generic FAQ	Looks trustworthy

The visible failure is at stage 7. The root cause is at stage 1. This is why you instrument every stage. You cannot fix what you cannot separate.

5) Latency budget — generation dominates, but not always¶

Latency at each stage, as a bar chart. Width is proportional to time.

Query rewrite     ██▏                       20-80 ms
Embedding (local) █▏                         5-30 ms
Embedding (API)   ████▏                   100-400 ms
Retrieval         ██▏                       10-40 ms
Reranking         ██████▏                  30-150 ms
Selection         ▏                          <5 ms
Prompt build      ▏                          <5 ms
Generation        ████████████████████▏   300-1200 ms
Citation          ██▏                      20-100 ms
                  └──────────────────────────────────
Typical total: 400-1500 ms (local embed) | 600-2000 ms (API embed)

Stack matters. The numbers above assume self-hosted embedding and reranking with batched calls; cloud API stacks add round-trip latency at every IO stage. Always qualify your budget by stack before quoting it in an interview.

Generation eats 60-80% of the budget. Optimize there first. But only after measuring. If your rewrite stage uses a slow remote model, your "generation problem" might actually be a rewrite problem hiding in plain sight.

6) Cost — the parallel budget¶

Latency is what users feel. Cost is what your CFO feels. Both need budgets.

Stage	Cost dominator	Typical $/query at 1M qpd
Rewrite	LLM tokens (small)	$0.0001
Embedding	API calls × dims	$0.00002
Retrieval	Vector DB read	$0.00001
Reranking	Cross-encoder calls	$0.0002
Generation	LLM tokens (input + output)	$0.005 – $0.030
Citation	LLM tokens (small)	$0.0002

Rule of thumb. Generation is ~90% of your cost as well as your latency. The single biggest lever: send fewer tokens to the generator (better selection, prompt compression, smaller models for easy queries).

7) Three production wrappers the pipeline diagram hides¶

The 8-stage diagram is the happy path. Real systems add three wrappers around it.

Caching¶

Three layers, from outside in:

Semantic cache — has this query (or one like it) been answered recently? If yes, skip the whole pipeline.
Retrieval cache — has this exact rewritten query hit the bookshelf before? If yes, skip retrieve + rerank.
Prompt prefix cache — providers like Anthropic and OpenAI cache the static prefix of long prompts. Free latency cut, you do nothing.

Safety (input + output)¶

Input filter — block prompt-injection attempts ("ignore previous instructions...") and PII before the query reaches stage 1.
Output filter — scan the generated answer for PII leaks (from retrieved docs), unsafe content, or hallucinated citations.

Fallback when retrieval returns nothing useful¶

When the top-k scores are all below a threshold, you have three choices:

Answer with the base model only and flag "no sources found."
Ask the user a clarifying question.
Refuse — "I do not have information on this."

Pick a default. Toy systems silently fall back to the LLM and hallucinate. Production systems refuse.

The eight-stage shape across shipped systems¶

The 8-stage pipeline shows up across many systems, sometimes with different names. The shape is constant.

Perplexity AI — end-to-end retrieval and citation pipeline for grounded web answers; citations are span-level and central to the product.
Glean — enterprise search across SaaS apps; permission-aware retrieval is the differentiator.
Azure AI Search — vector retrieval + semantic ranker as a managed pipeline; reranking is a first-class stage.
Amazon Kendra — enterprise QA with answer ranking and source attribution.
Notion AI Q&A — pulls notes and docs into the brief before generation; chunking is workspace-aware.
GitHub Copilot Chat — repository context retrieved per query, ranked, and stuffed into the prompt.
Cursor / Windsurf — codebase-aware coding agents with retrieval over file embeddings.
Anthropic Claude Projects — user-supplied corpus indexed and retrieved per turn.
OpenAI ChatGPT with Connectors — third-party data retrieval and grounded answers.
Google NotebookLM — source-grounded QA over user-uploaded documents.
Cohere RAG / Coral — retrieval and grounded generation as a packaged API.
Vespa — search engine with built-in hybrid retrieval and reranking primitives.
Elastic + ELSER — sparse-dense hybrid retrieval inside an existing search stack.
Pinecone Assistant — managed RAG stack including chunking, embedding, retrieval.
Weaviate Hybrid Search — BM25 + dense fusion with reranking modules.
LlamaIndex production deployments — orchestration framework wrapping the same 8 stages.
LangChain / LangGraph RAG agents — agentic loops where retrieval is a tool the agent calls.
Vectara — RAG-as-a-service with citation and faithfulness scoring built in.
You.com / Andi / Phind — consumer search-grounded chat with web retrieval per query.
Hebbia / Harvey / Casetext — domain RAG for finance, legal; chunking is contract-aware.
Salesforce Einstein Copilot — CRM-grounded answers using customer data retrieval.
Microsoft Copilot for Microsoft 365 — Graph-aware retrieval over emails, docs, chats.
Slack AI search and summaries — retrieval over channel history, summary generation.
Intercom Fin — support chatbot grounded in help-center articles with citation back to source.
Zendesk AI agents — knowledge-base retrieval + grounded answers for ticket deflection.

The diagram does not change. The corpus, the reranker, the safety wrapper — those change.

Recall — can you name each box and its failure mode cold?¶

Which stage decides whether useful evidence enters the system at all?
Why does retrieval quality cap generation quality? Phrase it in one sentence.
Which stage usually dominates latency, and by roughly what percentage?
What happens if the selector keeps three near-duplicate chunks?
Bi-encoder vs cross-encoder — where does each one belong in the pipeline?
What is the difference between sparse and dense retrieval, and what does "hybrid" mean?
Why is citation called "optional" in some texts and "mandatory" in this one?

Interview Q&A¶

Q1. What are the main stages in a RAG pipeline? A. Query understanding, embedding, retrieval, reranking, selection, prompt building, generation, citation. Common wrong answer to avoid: "RAG is just retrieval followed by an LLM call."

Q2. Why can one weak stage ruin the whole answer? A. Each stage only operates on what earlier stages pass forward. Stage 7 cannot recover information stage 1 has already dropped. Common wrong answer to avoid: "A strong-enough generator can always recover."

Q3. What does top-k mean in retrieval, and how do you pick k? A. Top-k is the number of candidates fetched before reranking and selection. Pick k large enough to ensure recall, small enough to keep rerank cost bounded. Typical: k=20–100 retrieved, n=3–8 selected. Common wrong answer to avoid: "Top-k is the number of final tokens in the answer."

Q4. Why track latency per stage instead of just total p95? A. Because optimization requires the bottleneck, not the average. The total tells you whether you have a problem; the stage tells you where. Common wrong answer to avoid: "Just measure total p95 — stage-level instrumentation is overkill."

Q5. How would you debug a RAG system that gives confident wrong answers? A. Walk the pipeline backward. Check generation prompt → verify the right chunks reached it → check rerank scores → check retrieval top-k → check the embedded query → check the rewritten query. The bug is upstream of the visible failure. Common wrong answer to avoid: "I'd switch to a stronger model."

Q6. Sparse vs dense vs hybrid retrieval — when do you use each? A. Sparse (BM25) wins on rare entities and exact-match tokens. Dense (embedding) wins on paraphrase and synonym. Hybrid (RRF fusion) wins in most production settings because real queries mix both patterns. Common wrong answer to avoid: "Dense is strictly better than sparse."

Q7. Why is a cross-encoder reranker worth the extra latency? A. Bi-encoder retrieval is approximate — it ranks by independent similarity. A cross-encoder reads query and chunk jointly, so it scores actual relevance, not just geometric proximity. On a noisy corpus, the precision gain is large enough to justify 30–150 ms per query. Common wrong answer to avoid: "Reranking is just another retrieval pass."

Q8. What happens when no retrieved chunk is relevant? A. The right behaviour is to refuse or ask for clarification. Production systems use a score threshold to detect this; below threshold, the LLM is not given retrieved context. The wrong behaviour — silently falling back to the base model — produces confident hallucinations dressed up as grounded answers. Common wrong answer to avoid: "The LLM will know to say it doesn't know."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the trace I would write for our example query, in one sentence per stage:

Stage	Input	Output	Failure I would log
1 Rewrite	Raw user question	Cleaned, expanded query	"Time constraint dropped" rate
2 Embed	Cleaned query	Query vector	Embedder version mismatch
3 Retrieve	Query vector	20 candidate chunks	Avg top-k score per query
4 Rerank	20 candidates	20 re-scored chunks	Reranker latency p95
5 Select	20 re-scored	3 chunks	Duplicate rate in selected set
6 Brief	3 chunks + question	Prompt string	Brief length distribution
7 Generate	Prompt	Answer text	Faithfulness score, drift
8 Cite	Answer + chunks	Answer + citations	Citation mismatch rate

Step 2 — your turn. Take a real query from your product. Write the same eight rows for it. For each row, mark one failure you have actually seen or could plausibly see, and write one metric you would log to catch it.

Step 3 — sketch from memory. Redraw the 8-stage diagram. Beside each box, write the data shape on the way out, not the description. Mark where the librarian, the bookshelf, the reading desk, and the answer brief appear. If you can do this cold, you understand the pipeline.

What you should remember¶

This chapter explained why "retrieve and generate" is the wrong mental model for debugging a RAG system. Behind those two words sit eight boxes — rewrite, embed, retrieve, rerank, select, brief, generate, cite — each transforming the data into a different shape, each with its own failure mode. The confident-wrong answer at stage 7 is almost never a stage-7 problem; it is upstream rot that the generator faithfully renders.

You learned to walk the chain backward when something looks wrong: read the answer, then the chunks that reached the reading desk, then the rerank scores, then the retrieved top-k, then the embedded query, then the rewritten query. The visible failure is rarely the cause. You also learned that latency and cost both concentrate at generation (60–90%), but optimizing there first without instrumenting the rest is how teams burn quarters on the wrong stage.

Carry this diagnostic forward: every stage gets its own metric and its own threshold. Faithfulness belongs to stage 7; recall@k belongs to stage 3; duplicate rate belongs to stage 5; citation mismatch belongs to stage 8. If you cannot point at the stage, you cannot point at the bug.

Remember:

Two pipelines, not one. Indexing-time builds the bookshelf; query-time runs the eight boxes per request.
Each box transforms the shape of the data. If the shape doesn't change, it is a name, not a stage.
The visible failure is rarely the root cause — walk the chain backward from the answer to the rewrite.
Retrieval quality caps generation quality. A stronger model writes more fluent nonsense from bad chunks.
Production wrappers — caching, safety, fallback — are part of the pipeline, not optional polish. "Silent fallback to the base model" is how grounded answers become hallucinations in disguise.

Bridge. The pipeline is clear. But every stage starts with the user's question — and users rarely ask clean questions. Query understanding is where the first weak link breaks. The next file goes deep on how to repair vague, ambiguous, or under-specified queries before they ever reach the bookshelf.

→ 09-query-and-retrieval.md