02. RAG Fundamentals — Narrative Explainer¶

Companion to 03_study_material.md. The study material gives the terms. This file gives the picture in your head.

Table of contents¶

ELI5 — the whole thing in kid words
The librarian, the bookshelf, the index card, the reading desk, the answer brief
Chapter 1: The first production failure
1.1 The confident wrong answer
1.2 Why closed-book LLMs fail
1.3 Why this matters to a lead
1.4 What RAG changes
1.5 The operating principle
Chapter 2: Chunking — the first real engineering decision
2.1 Why raw documents do not fit
2.2 Chunk size trade-offs
2.3 Why overlap exists
2.4 Recursive vs semantic splitting
2.5 One document chunked three ways
2.6 Practical chunking defaults
Chapter 3: Embeddings and vector search
3.1 What embeddings capture
3.2 Visualizing embedding space
3.3 Cosine vs dot product
3.4 Choosing embedding models
3.5 How vector databases search quickly
3.6 What still goes wrong
Chapter 4: The RAG pipeline end to end
4.1 The full pipeline picture
4.2 Query understanding
4.3 Query embedding
4.4 Retrieval
4.5 Reranking
4.6 Prompt augmentation
4.7 Generation
4.8 Retrieval prompts you can actually use
4.9 Honest admission
Chapter 5: Evaluation — because vibes will fool you
5.1 Why eval comes first
5.2 Recall@k
5.3 MRR
5.4 NDCG
5.5 Generation quality
5.6 RAGAS
5.7 Building an evaluation habit
Chapter 6: Recap and application
6.1 Failure-fix chain
6.2 Key points to remember
6.3 Important interview questions
6.4 Production experience
6.5 Apply now — exercises
6.6 Foundation-gap audit
6.7 Bridge to the next module

ELI5 — the whole thing in kid words¶

Imagine a very smart research assistant in a giant library. A person walks in and asks a question. "What did our company say about enterprise refunds last quarter?" The assistant does not try to remember everything from memory. That would be foolish. The library is huge. Instead, the assistant uses five simple helpers.

The placeholder map¶

The librarian = the retriever
The bookshelf = the vector store
The index card = the embedding
The reading desk = the context window
The answer brief = the augmented prompt Keep these names in your head. They will make the rest of the module easy.

The story in slow motion¶

First, the question arrives. The librarian reads it carefully. Then the librarian makes an index card for the question. This card is not normal text. It is a compact meaning-signature. It says, roughly, "questions about refund policy, enterprise customers, last quarter." Now the librarian goes to the bookshelf. But this bookshelf is magical. It is not organized alphabetically. It is organized by meaning. On that shelf, every chunk of every document also has its own index card. Questions about refunds sit near refund-policy paragraphs. Questions about pricing sit near pricing paragraphs. Questions about GPU kernels sit far away. So the librarian does not search the whole building. The librarian pulls the five most relevant books from nearby places. Not the whole library. Only the most likely evidence. Then the librarian opens those books at the key pages. These pages go onto the reading desk. The reading desk is small. Only a limited number of pages fit there. So the librarian must choose carefully. Finally, the librarian writes an answer brief. The answer brief contains: - the user question, - the selected evidence, - instructions like "answer only from these pages," - and maybe a rule saying "if evidence is missing, say you do not know." That brief goes to the writer. The writer is the language model. Now the model is no longer answering blindly. It is answering with pages open in front of it. That, in one picture, is RAG.

Why this picture matters¶

If the librarian picks the wrong books, the answer goes wrong. If the bookshelf is badly organized, retrieval goes wrong. If the index cards are poor, nearby meaning gets distorted. If the reading desk is overcrowded, signal gets buried in noise. If the answer brief is weak, the writer improvises. See the pattern. RAG is not one magic box. It is a chain of small engineering decisions. Each one can fail. Each one can be debugged. That is why good RAG work feels like systems engineering. Not just prompt writing.

Chapter 1: The first production failure¶

1.1 The confident wrong answer¶

Picture the scene. You ship an internal company chatbot. A senior leader asks: "What was our Q4 revenue last year?" The model replies instantly. It sounds polished. It says, "Your company reported Q4 revenue of $186 million." That number is wrong. Completely wrong. The true number was $143 million. No source was shown. No uncertainty was expressed. The answer looked professional. That made it more dangerous. Now imagine the CEO sees that answer in a demo. You do not get credit for fluent prose now. You get blamed for a system that invents financial facts. Career damage is not dramatic language here. It is the correct phrase.

1.2 Why closed-book LLMs fail¶

A foundation model answers from its weights. Those weights store patterns learned during training. That is useful. But it has limits. First, the model may never have seen your company data. Private dashboards were not in pretraining. Internal QBR notes were not in pretraining. Yesterday's policy update was not in pretraining. Second, even when the model saw something related, it does not store facts like a database. It stores compressed statistical patterns. That means it can sound right while being wrong. Third, the model is rewarded to continue text plausibly. It is not naturally rewarded to stop and say, "I have no verified evidence for this number." So when you ask a specific business question, and you provide no supporting documents, the model does what a clever but cornered student does. It guesses elegantly.

1.3 Why this matters to a lead¶

A toy demo can survive one wrong answer. A production system cannot. Once a model touches: - revenue, - legal policy, - customer promises, - account status, - medical notes, - or compliance rules, you are in a different game. Now the job is not "say something impressive." The job is: - answer from evidence, - show what evidence was used, - refuse when evidence is missing, - and update quickly when documents change. This is where leads get serious. They stop asking, "Which LLM is smartest?" They start asking: - Where does the answer come from? - How fresh is the source? - How do we debug misses? - How fast is retrieval? - What is our faithfulness score? - What happens when the document changed this morning? Notice the shift. The problem became a system problem. Not just a model problem.

1.4 What RAG changes¶

RAG means Retrieval-Augmented Generation. The important word is not generation. The important word is augmented. We augment the model with external evidence at answer time. Instead of saying, "Model, remember everything yourself," we say, "Model, here are the best supporting passages I found right now. Use them." That one change does several useful things. It makes answers fresher. It makes answers more auditable. It makes updates cheaper than retraining. It reduces hallucination on knowledge questions. It also makes failures diagnosable. If the answer is bad, you can inspect retrieval. You can inspect chunking. You can inspect prompts. You can inspect ranking. A bad closed-book answer is often mysterious. A bad RAG answer is usually traceable. That is why teams love it.

1.5 The operating principle¶

Think of an exam. A closed-book exam tests memory. An open-book exam tests lookup plus reasoning. Naive LLM use is closed-book. RAG is open-book. But there is a twist. The book is too large. So before the exam, you need a fast librarian, a smart shelf layout, good index cards, a limited reading desk, and a crisp answer brief. That is the whole subject. You will now study each part. With one rule in mind. Whenever something feels abstract, come back to the library picture. If you can explain the library picture, you understand the technical term.

Chapter 2: Chunking — the first real engineering decision¶

2.1 Why raw documents do not fit¶

Newcomers often say, "Why not just stuff the whole PDF into the prompt?" Because reality is rude. Your documents are long. A product manual may be 80 pages. A policy handbook may be 200 pages. A customer ticket archive may be thousands of messages. The model context window is limited. Even when the context window is large, using it blindly is expensive and noisy. Suppose the user asks: "What is the cancellation notice period for enterprise annual plans?" The answer may live in one paragraph. If you dump 100 pages into the model, you pay for tokens you did not need, you increase latency, and you make it harder for the model to attend to the right place. The reading desk becomes cluttered. So we split documents into manageable pieces. These pieces are chunks. Chunking is not clerical work. It is the first real retrieval decision. If you chunk badly, every downstream component inherits the damage.

2.2 Chunk size trade-offs¶

Now think carefully. What makes a chunk good? A chunk should be small enough to retrieve precisely. But large enough to contain the necessary local context. That is the tension.

If chunks are too small¶

Small chunks often improve precision. A narrow question may map cleanly to one sentence. That sounds good. But very small chunks break context. The answer sentence may depend on the line before it. Or the definition may live across two paragraphs. Or the table header may be separated from the cell value. Then retrieval pulls a fragment that looks relevant, but the model still lacks the needed support. Typical symptoms: - definitions without qualifiers, - refund rules without exceptions, - legal clauses without scope, - table values without units, - API examples without setup.

If chunks are too large¶

Large chunks preserve context. That also sounds good. But now each chunk contains multiple topics. The embedding becomes an average of many ideas. A question about enterprise refunds may retrieve a giant section about billing, taxes, pricing, refunds, trials, and support tiers all mixed together. The relevant paragraph is inside somewhere, but the chunk carries too much noise. That hurts ranking. It also wastes context window space later. Typical symptoms: - relevant paragraph buried inside irrelevant bulk, - many near-duplicate large chunks in top-k, - slower generation due to prompt bloat, - weaker citations because evidence is diffuse.

The central trade-off¶

Say it simply. Small chunks improve retrieval precision. Large chunks improve local context preservation. You are balancing these two. There is no universal perfect size. There is only a good size for your corpus, your query style, and your evaluation metric.

2.3 Why overlap exists¶

Now imagine a policy sentence cut exactly at the wrong place. Chunk A ends with: "Enterprise customers may request a refund within 30 days..." Chunk B begins with: "...only if the account has not exceeded 5,000 API calls." If you use zero overlap, each chunk alone is incomplete. One chunk has the rule without the exception. The other has the exception without the rule. Very bad. Overlap exists to protect boundary facts. You repeat a small slice of text across neighboring chunks. That way, important statements that straddle a boundary still appear intact somewhere.

Overlap diagram¶

original text
[-------------------------------A-------------------------------]
                    [-------------------------------B-------------------------------]

shared overlap
                    [========== repeated text ==========]

A practical picture:

chunk 1: lines 1-12
chunk 2: lines 10-22
chunk 3: lines 20-32

Lines 10-12 appear twice. That duplication is intentional.

Too little overlap¶

Too little overlap loses boundary meaning. This hurts recall. It especially hurts: - definitions, - step-by-step instructions, - policy exceptions, - contracts, - FAQs with caveats.

Too much overlap¶

But do not become emotional about overlap. Too much overlap also hurts. You create many near-duplicate chunks. Retrieval returns repeated evidence. Your top-k becomes redundant. Storage grows. Indexing grows. Prompt stuffing grows. So overlap is a seatbelt. Not a blanket. For prose, 10-20% overlap is a strong default. Then benchmark.

2.4 Recursive vs semantic splitting¶

There are many chunking strategies. Two matter a lot early on. Recursive splitting. Semantic splitting.

Recursive splitting¶

Recursive splitting follows structure in stages. It tries a big boundary first. If the piece is still too large, it tries a smaller boundary. Then smaller again. Typical order: - heading, - paragraph, - sentence, - token or character fallback. This works well for messy real-world documents. Why? Because real documents already have structure. Headings mean something. Paragraphs mean something. Bullet lists mean something. Recursive splitting respects that structure before using blunt cuts. It is a very strong baseline. Especially for markdown, docs, help centers, and internal knowledge bases.

Semantic splitting¶

Semantic splitting looks for meaning shifts. Instead of saying, "cut every 500 tokens," it says, "cut where the topic changes." Usually this means scoring adjacent sentences or windows for similarity. When similarity drops sharply, that is a natural breakpoint. This is beautiful in theory. And often good in prose. Especially when topic transitions are smooth, and paragraph sizes are inconsistent. But it is slower. It is harder to reason about. And it can behave strangely on lists, tables, and semi-structured documents.

So which one should you start with?¶

For most practical systems, start with recursive or document-aware chunking. Why? It is easier to debug. It respects visible structure. It gives stable chunks. Then test semantic splitting if: - your corpus is mostly prose, - topic shifts matter, - and your baseline misses are semantic rather than structural.

2.5 One document chunked three ways¶

Let us work one example properly. Here is a tiny policy excerpt.

The source text¶

Enterprise Annual Plan Refunds
Customers on annual enterprise plans may request a refund within 30 days of renewal.
Refunds are not available after the first 5,000 API calls made in the renewal period.
If the account has an active security incident review, billing changes pause until review closure.
Support response SLA remains active during the review period.

Now imagine the user asks: "When are enterprise annual refunds not available?" We will chunk the same text three ways.

Version A — fixed-size chunking¶

Chunk A1:
Enterprise Annual Plan Refunds Customers on annual enterprise plans may request

a refund within 30 days of renewal. Refunds are not available after

the first 5,000 API calls made in the renewal period.

Chunk A2:
If the account has an active security incident review, billing changes

pause until review closure. Support response SLA remains active during

the review period.

This is not terrible. The refund rule stayed together. The security-review note stayed separate. For this query, A1 will probably retrieve well.

Version B — recursive split by heading then paragraph¶

Chunk B1:
Heading: Enterprise Annual Plan Refunds
Customers on annual enterprise plans may request a refund within 30 days of renewal.
Refunds are not available after the first 5,000 API calls made in the renewal period.

Chunk B2:
If the account has an active security incident review, billing changes pause until review closure.
Support response SLA remains active during the review period.

This is better. Why? Because the chunk keeps the heading and the refund logic together. The semantic scope is cleaner. The chunk title itself helps retrieval.

Version C — semantic split by topic shift¶

Chunk C1:
Enterprise Annual Plan Refunds
Customers on annual enterprise plans may request a refund within 30 days of renewal.
Refunds are not available after the first 5,000 API calls made in the renewal period.

Chunk C2:
If the account has an active security incident review, billing changes pause until review closure.

Chunk C3:
Support response SLA remains active during the review period.

This looks sharp. But notice the subtle issue. C2 and C3 are now separated. For a query about review-period behavior, you may need both chunks together. Semantic splitting found a topic shift, but perhaps split too aggressively.

What did we learn?¶

Fixed-size chunking is simple. Recursive chunking is robust. Semantic chunking can be elegant, but elegance is not always better retrieval. Always test on your actual queries. Not on aesthetics.

2.6 Practical chunking defaults¶

Let us make this less mystical. Here are practical starting rules.

For product docs¶

Start around 300-500 tokens. Use 10-20% overlap. Prefer headings and paragraph boundaries. Store metadata like product, version, section, URL.

For blogs and long prose¶

Start around 400-700 tokens. Use light overlap. Semantic splitting is worth trying.

For API docs¶

Chunk by endpoint or method first. Keep request, response, and parameter tables near each other. If you separate them, you create misleading retrieval.

For code¶

Do not chunk like prose. Prefer function, class, or file-level structure. Sometimes a docstring plus signature plus a small body window works best.

For tables and contracts¶

Be careful. Tables lose meaning when headers separate from cells. Contracts lose meaning when clauses separate from conditions. Structure-aware chunking matters more here.

The practical test for a chunk¶

Ask one simple question. "Can this chunk answer a narrow question on its own?" If the answer is usually no, your chunking is probably weak. Chunk quality is not a philosophical concept. It is a retrieval convenience function. If retrieval improves, your chunking was better. That is it.

Chapter 3: Embeddings and vector search¶

3.1 What embeddings capture¶

An embedding is a vector representation of text. That sentence sounds dry. So let us translate it. An embedding is an index card for meaning. Not perfect meaning. Approximate meaning. When two texts talk about similar things, their vectors tend to land near each other. A query like: "How do refunds work for enterprise renewals?" should land near chunks like: "Annual enterprise plans may request a refund within 30 days of renewal." Even if the wording differs. That is the magic. Embeddings help us match paraphrase to paraphrase. Not just exact keyword overlap.

What embeddings are good at¶

They are often good at: - synonyms, - paraphrases, - topical similarity, - intent-level closeness, - cross-lingual similarity if trained for it.

What embeddings are bad at¶

They are not reliable truth-machines. They can struggle with: - exact numbers, - negation, - rare acronyms, - highly specialized internal jargon, - subtle logical relationships, - access permissions, - multi-hop connections across documents. This matters. Because people often say, "We used embeddings, so retrieval should work." No. Embeddings are a powerful approximation. Not divine understanding.

3.2 Visualizing embedding space¶

Imagine a map. Not a real map. A meaning map. Chunks about billing cluster together. Chunks about pricing cluster nearby. Chunks about GPU kernels cluster elsewhere. A query enters that map. Then we ask, "Which stored chunks are closest?" ASCII picture:

embedding space

                 GPU / infra
                     x   x
                  x

   billing x   x    x
           x  Q
         x   x

   refunds x   x  x
         x   x

Q = query: "enterprise refund after renewal"

Notice something important. The axes do not mean anything human-readable here. Dimension 1 is not "refundness." Dimension 2 is not "enterprise-ness." The geometry matters. The labels do not. That confuses beginners. Do not worry. You only care that similar meaning lands nearby enough for search.

3.3 Cosine vs dot product¶

Once texts become vectors, we need a similarity score. The common choices are cosine similarity, dot product, and sometimes L2 distance.

Cosine similarity¶

Cosine asks: "How aligned are these vectors in direction?" It mostly ignores magnitude. For text retrieval, this is a common default. Because meaning direction usually matters more than raw length.

Dot product¶

Dot product mixes direction and magnitude. If vectors are normalized to unit length, dot product and cosine ranking become equivalent. That is why people sometimes use the terms loosely. But do not be lazy. The metric should match model expectations. Some embedding models assume normalization. Some libraries handle it for you. Some do not.

L2 distance¶

L2 is ordinary geometric distance. Also usable. Less common in text retrieval discussions, but common in ANN library internals.

Practical rule¶

If your embedding model documentation says cosine, use cosine. If it says dot product, follow that. If embeddings are normalized, cosine and dot product often rank the same. Benchmark anyway.

3.4 Choosing embedding models¶

Now comes a realistic engineering choice. Which embedding model should you use? People often choose by hype. That is a weak habit. Choose by your corpus, your latency budget, your deployment constraints, and your eval scores.

Questions to ask¶

Does the model understand my domain language?
Does it support my languages?
What is the latency at my expected volume?
What is the storage cost of its vector dimension?
Can I host it myself if needed?
Does it expect special prefixes for query and document text?
How does it perform on my gold set?

Common options¶

API models are easy to start with. You send text, you get vectors, you move fast. Open-source models are attractive when: - privacy matters, - cost matters, - offline use matters, - or you want more control. Good open baselines include e5, BGE, and sentence-transformer families.

One important subtlety¶

Some retrieval models distinguish query embeddings from document embeddings. They may expect prefixes like: - query: ... - passage: ... If you ignore that, your quality can drop for silly reasons. This is the sort of thing interviews love. Because it shows whether you have actually built retrieval, or only read tweets about it.

3.5 How vector databases search quickly¶

Now assume you have millions of chunk vectors. A user query arrives. How do you find the nearest ones quickly? The naive answer is exact search. Compare the query vector to every stored vector. That works. It is also slow at scale. So practical systems use approximate nearest neighbor search. ANN. The two names you must know are HNSW and IVF.

Exact search¶

Exact search gives the real nearest neighbors. Best quality. Worst scalability. Fine for small corpora. Often good enough up to modest sizes. Do not over-engineer too early.

HNSW¶

HNSW stands for Hierarchical Navigable Small World. Big name. Simple intuition. Think of a graph. Each vector points to other nearby vectors. There are multiple layers. Top layers are sparse and fast to traverse. Lower layers are denser and more precise. Search starts high, jumps quickly to the right region, then refines lower down. ASCII sketch:

layer 2:      o ------ o
                \    /
layer 1:    o -- o -- o -- o
             \   |   / \  |
layer 0:  o -- o -- o -- o -- o -- o

Why people like HNSW: - high recall, - strong latency, - good default behavior, - widely supported. Important knobs: - M = how many connections per node, - ef_construction = build thoroughness, - ef_search = search thoroughness. Higher settings usually mean better recall, more memory, and more compute.

IVF¶

IVF stands for Inverted File. Simple picture. First cluster vectors into buckets. Then at query time, find the most promising buckets, and search mainly inside them. ASCII sketch:

all vectors
   ↓
cluster into cells

[cell 1]  [cell 2]  [cell 3]  [cell 4]
   x x       x x x      x          x x

query → choose nearest cells → search inside those cells

IVF can be very fast. But if the correct neighbor lives outside searched cells, recall drops. That is the trade-off.

So which one is the safer default?¶

For many modern text-retrieval systems, HNSW is the safer default. It tends to be easier to get strong recall with sane latency. IVF still matters. Especially in FAISS-heavy systems, very large indexes, or memory-sensitive setups. But if an interviewer asks, "What ANN index would you start with for RAG?" you can say HNSW confidently, and then justify it.

3.6 What still goes wrong¶

Even with decent embeddings and a solid ANN index, retrieval still misses. Why? Because the world is messy.

Failure type 1 — exact numbers¶

A question asks for a precise threshold. Embeddings are good at topic similarity. They are weaker at exact numeric matching. That is why hybrid retrieval often helps.

Failure type 2 — negation¶

"Supported" and "not supported" can land dangerously close. Semantic similarity is not full logical understanding.

Failure type 3 — acronyms and jargon¶

Your team says "TBR" or "ECA freeze" internally. A general embedding model may not understand that.

Failure type 4 — stale indexes¶

The document changed. Your vector index did not. Now the retriever is faithfully retrieving outdated truth. Very embarrassing.

Failure type 5 — permissions¶

The correct chunk exists. The user should not see it. This is not an embedding problem. It is an access-control problem. Never confuse the two. Good RAG engineers know where semantics end, and systems design begins.

Chapter 4: The RAG pipeline end to end¶

4.1 The full pipeline picture¶

Now let us connect everything. A minimal RAG pipeline looks like this.

user query
   ↓
(optional) rewrite / clean query
   ↓
embed query
   ↓
retrieve top-k chunks from vector store
   ↓
(optional) rerank candidates
   ↓
assemble augmented prompt
   ↓
LLM generates answer
   ↓
return answer with citations or abstain

Short diagram. Very easy to memorize. Harder to build well. Because each arrow can fail. We will walk slowly.

4.2 Query understanding¶

The query arrives in natural language. Humans are messy. Users write: - half-sentences, - pronouns, - vague references, - misspellings, - last-week context, - or emotionally loaded nonsense. Example: "What about that refund thing from last quarter?" This is not retrieval-friendly. What refund thing? Enterprise or self-serve? Quarter relative to what date? Renewal or original purchase? So some systems rewrite or normalize the query first. Even a tiny rewrite helps. Failure modes here: - ambiguity preserved, - important entity dropped, - wrong date range inferred, - jargon expanded incorrectly. Symptom: good chunks exist, but retrieval still feels strangely off-topic.

4.3 Query embedding¶

Next the query becomes a vector. This sounds mechanical. But mistakes happen here too. Common issues: - wrong model for the domain, - missing query prefix, - query text polluted by UI junk, - language mismatch, - poor normalization. If query embeddings are low-quality, you are searching the bookshelf with the wrong index card. No librarian can fix that later.

4.4 Retrieval¶

Now the retriever pulls top-k candidate chunks. This is the librarian walking to the shelf. If chunking was bad, retrieval inherits the damage. If embeddings were weak, retrieval inherits the damage. If the ANN index is poorly tuned, retrieval inherits the damage. This stage is brutally important. Because generation can only use what retrieval found. Not what retrieval should have found. Common retrieval failures: - relevant chunk not in top-k, - duplicates crowd out coverage, - metadata filter missing, - stale data retrieved, - keyword-heavy query missed by semantic-only retrieval. One of the strongest beginner lessons in RAG is this: When the answer is wrong, check retrieval before blaming the LLM.

4.5 Reranking¶

Suppose retrieval gets you twenty plausible chunks. Good. But only five are truly strong. This is where reranking helps. The retriever is usually broad and cheap. The reranker is slower but sharper. A reranker reads the query and candidate chunk together, then scores actual relevance more precisely. This often improves top positions. That matters. Because the reading desk is limited. You want the best pages on the desk. Not just roughly related pages. Failure modes here: - skipping reranking when corpus is noisy, - reranking too few candidates, - using a weak reranker, - reranking duplicates and wasting slots. Symptom: your relevant chunk appears in top-20, but not in top-3 where it matters.

4.6 Prompt augmentation¶

Now we build the answer brief. This is a deeply underrated step. You are deciding: - which chunks go in, - in what order, - with what metadata, - under what instructions, - and what the model must do when evidence is missing. A weak augmented prompt says: "Here are some docs. Answer the question." A better augmented prompt says: "Use only the provided context. Cite chunk IDs. If the answer is unsupported, say you could not find support in the retrieved context." This is not mere style. It changes system behavior. Failure modes here: - too many chunks, - noisy chunk ordering, - missing source labels, - no abstention rule, - contradictory evidence dumped without instruction. The reading desk must stay tidy. Do not pile every possibly relevant page on it. That is not thoroughness. That is panic.

4.7 Generation¶

Finally the LLM writes the answer. If the evidence is good, and the instructions are clear, generation often looks magical. If the evidence is bad, generation becomes dangerous again. Common generation failures: - the model blends multiple chunks incorrectly, - the model overgeneralizes beyond the context, - the model answers confidently from prior knowledge, - the model ignores a weak abstain instruction, - the model produces citations that look real but are wrong. Remember. Generation is the last step. Not the first step. The pipeline determines what generation can safely do.

4.8 Retrieval prompts you can actually use¶

You asked for useful prompts. Let us keep them practical. These are starter prompts. Not sacred text.

Prompt 1 — query rewrite for retrieval¶

You are a search-query rewriter for an internal knowledge base.
Rewrite the user's question into a concise retrieval query.
Keep key entities, dates, product names, and policy terms.
Do not answer the question.
Return only the rewritten query.

User question: {question}

Use this when user phrasing is noisy.

Prompt 2 — multi-query expansion¶

Generate 3 alternative retrieval queries for the question below.
Each rewrite should preserve intent but vary wording.
Include synonyms and likely documentation language.
Return a JSON list only.

Question: {question}

Use this when recall feels low. Then retrieve for all rewrites, dedupe, and merge candidates.

Prompt 3 — decomposition for hard retrieval¶

Break the question into smaller searchable sub-questions.
Only split if the question clearly asks for multiple facts.
Return 1-3 short search queries.
Do not answer the question.

Question: {question}

Use this when questions are multi-part.

Prompt 4 — grounded answer prompt¶

Answer the user's question using only the provided context.
Cite the supporting chunk IDs inline.
If the context does not support an answer, say:
"I could not find support in the retrieved context."
Do not use outside knowledge.

Question: {question}
Context:
{chunks}

This is not a retrieval prompt exactly. It is the last guardrail before hallucination.

4.9 Honest admission¶

Now the important honesty section. RAG is powerful. RAG is useful. RAG is not magic reasoning dust. If the answer requires multi-hop reasoning across three distant documents, naive RAG may fail. If the answer requires symbolic computation, RAG may fail. If the answer depends on permissions, RAG alone does not solve that. If the answer requires checking live transactional state, you need tools or databases, not just retrieved text. And if retrieval missed the key fact, the generator cannot invent it safely. So say this clearly in interviews. RAG solves knowledge access and grounding. It does not automatically solve reasoning. Multi-hop is hard. Query rewriting helps. Decomposition helps. HyDE may help. Reranking helps. Agentic flows may help. But the problem remains hard. This is exactly why the next module exists.

Chapter 5: Evaluation — because vibes will fool you¶

5.1 Why eval comes first¶

A RAG demo can look wonderful. Ask one friendly question. It answers nicely. Everyone smiles. That tells you very little. Real systems fail on edge cases, hard phrasing, ambiguous wording, and stale knowledge. So you need evaluation. Without evaluation, you optimize for storytelling. With evaluation, you optimize for measurable quality. A good eval setup answers questions like: - Did retrieval find the right chunk? - How high was it ranked? - Did the answer stay faithful to evidence? - Did the answer address the question? - Did the system abstain when support was missing?

5.2 Recall@k¶

Recall@k is the first retrieval metric many teams use. Simple idea. Did the relevant chunk appear anywhere in the top-k results? If yes, great. If not, bad. Formula intuition:

recall@k = relevant items retrieved in top-k / total relevant items

In many RAG settings, you simplify this to: "Did at least one gold chunk appear in top-k?" Why recall@k matters: Because generation has no chance if retrieval never surfaced the evidence. But recall@k does not care about rank quality much. A relevant chunk at rank 1 and rank 10 both count. That is why recall@k is necessary, but not sufficient.

5.3 MRR¶

MRR means Mean Reciprocal Rank. This sounds more frightening than it is. Take the rank of the first relevant result. Invert it. Then average across queries. If the first relevant result is rank 1, you get 1.0. If it is rank 2, you get 0.5. If it is rank 5, you get 0.2. So MRR rewards systems that surface a relevant item early. This is useful in RAG. Because top positions dominate the context window. If the best evidence is buried deep, your answer step may never use it well.

5.4 NDCG¶

Now we level up. NDCG stands for Normalized Discounted Cumulative Gain. You do not need to fear the full formula now. Remember the intuition. NDCG rewards: - highly relevant items, - appearing early, - in the right order. It is helpful when relevance is graded, not just binary. For example: - exact answer chunk = 3, - supporting chunk = 2, - vaguely related chunk = 1, - irrelevant chunk = 0. Then NDCG tells you whether the ranking puts the best evidence first. This is a very realistic production metric. Because not all retrieved chunks are equally helpful.

5.5 Generation quality¶

Retrieval metrics are not enough. You also need answer-quality metrics. Two especially important ones are faithfulness and relevance.

Faithfulness¶

Faithfulness asks: "Is the answer supported by the retrieved context?" This is the anti-hallucination metric. A fluent wrong answer scores low here. A cautious answer tied to evidence scores high.

Answer relevance¶

Relevance asks: "Did the answer actually address the user's question?" An answer can be faithful but irrelevant. Example. The model cites a correct billing paragraph, but the question was about cancellation notice period. Good grounding. Wrong answer.

Context precision and context recall¶

Context precision asks: "Were the retrieved chunks mostly useful?" Context recall asks: "Did the retrieved chunks cover the needed evidence?" These help separate retrieval noise from retrieval miss. That distinction matters. Noise means you retrieved too much junk. Miss means you failed to retrieve key support. Different fixes.

5.6 RAGAS¶

RAGAS is a practical framework for evaluating RAG systems. People like it because it packages several useful metrics together. Common RAGAS-style metrics include: - faithfulness, - answer relevance, - context precision, - context recall. This is helpful. Especially when you need a repeatable evaluation harness quickly. But be mature about it. RAGAS is not final truth. It often uses LLMs internally as judges. LLM judges can be useful. They can also be noisy. So use RAGAS as a strong instrument panel. Not as scripture. Human review still matters. Especially for: - business-critical answers, - citations, - tone, - policy compliance, - and subtle domain errors.

5.7 Building an evaluation habit¶

Here is the healthy habit. Build a small gold set. Maybe 20-50 queries first. For each query, label the relevant chunk or chunks. Then measure retrieval. Then inspect failures manually. Then change one variable. Not five variables together. Maybe chunk size. Maybe overlap. Maybe embedding model. Maybe top-k. Maybe reranker. Then run eval again. This is how you actually improve RAG. Not by declaring, "This prompt felt better in my last demo." One more crucial habit. Slice results. Do not only compute one average score. Break results by: - query length, - query type, - document type, - language, - presence of numbers, - presence of acronyms, - recency-sensitive questions. A single average hides real pain. Sliced eval exposes it.

Chapter 6: Recap and application¶

6.1 Failure-fix chain¶

Here is the whole subject in one table. | Failure | What actually broke | Common fix | See | |---|---|---|---| | Hallucinated company fact | No grounded evidence supplied | Add retrieval + citations + abstain rule | §1.1, §4.6 | | Relevant sentence missing | Chunk too small or split badly | Increase size or use better boundaries | §2.2, §2.4 | | Boundary exception lost | No overlap | Add 10-20% overlap | §2.3 | | Top-k is noisy | Chunks too large / retrieval too broad | reduce chunk size or rerank | §2.2, §4.5 | | Semantically close but wrong chunk | Weak embedding fit | benchmark a better embedding model | §3.4 | | Relevant chunk buried low | ranking quality weak | add reranker or tune search depth | §4.5 | | Query wording misses docs | messy user phrasing | rewrite or expand query | §4.2, §4.8 | | Numbers / negation mismatch | embedding semantics are imperfect | hybrid search or filters | §3.6 | | Good retrieval, bad answer | prompt augmentation weak | tighten instructions and source usage | §4.6 | | Nice demo, poor production | no real evaluation | build gold set and track metrics | §5.1-§5.7 | If you memorize nothing else, memorize this chain.

6.2 Key points to remember¶

RAG exists because closed-book LLMs guess on private facts.
Chunking is the first quality gate.
Chunk size is a precision-versus-context trade-off.
Overlap protects facts that cross boundaries.
Recursive chunking is a strong default.
Embeddings capture meaning similarity, not truth.
Cosine similarity is the usual text-retrieval baseline.
HNSW is a common ANN default for strong recall and speed.
Retrieval quality constrains answer quality.
Reranking sharpens the top of the list.
Prompt augmentation decides what evidence the model actually sees.
Faithfulness matters more than fluency in production RAG.
Recall@k, MRR, and NDCG answer different ranking questions.
RAGAS is helpful, but human review still matters.
RAG does not magically solve reasoning or multi-hop synthesis.

6.3 Important interview questions¶

These are good interview questions. Practice answering them aloud. 1. Why is RAG preferable to fine-tuning for rapidly changing knowledge bases? 2. Chunk size trade-off: what breaks when chunks are too small or too large? 3. Why do we use overlap, and why not make it huge? 4. Recursive chunking vs semantic chunking — when would you start with each? 5. What do embeddings capture well, and where do they fail? 6. Cosine similarity vs dot product — when do they become equivalent? 7. HNSW vs IVF — what makes each fast, and which would you start with? 8. Why does reranking help even after retrieval already happened? 9. Recall@k vs MRR vs NDCG — what exactly does each metric reward? 10. Why can a RAG system still hallucinate even with retrieval? 11. Why is naive RAG weak on multi-hop reasoning? 12. If faithfulness is low, where would you debug first? If you can answer these cleanly, you are interview-ready for basic RAG.

6.4 Production experience¶

Now let us talk like engineers. These are ballpark numbers. Not laws of nature. But they are useful for intuition.

Example latency budget¶

A practical single-turn RAG system might look like this: - query rewrite: 20-80 ms if you use a small model - query embedding: 5-30 ms - vector retrieval: 10-40 ms - reranking top-20: 30-150 ms local, or 80-250 ms via API - prompt assembly: 1-5 ms - answer generation: 300-1200 ms depending on model and answer length That gives a rough end-to-end range of:

~400 ms on a very lean system
~800-1500 ms on a realistic API-based system

If your product is chat, that may be fine. If your product is voice or search-autocomplete, you must be stricter.

Example cost intuition¶

Again, ballpark only. Suppose you handle 100,000 queries per month. Very roughly: - embeddings are usually cheap per query, - vector retrieval is mostly infra cost, - reranking adds noticeable API or GPU cost, - generation is often the dominant variable cost. In many systems, the answer model costs more than retrieval. But retrieval mistakes force longer prompts, more retries, and lower trust. So do not optimize only raw API price. Optimize total system usefulness.

Production lessons people learn the hard way¶

Bad chunking silently destroys downstream quality.
Stale indexes create confident old answers.
Missing metadata filters cause embarrassing data leaks.
Reranking often gives a bigger gain than prompt tweaking.
A refusal with evidence rules is often better UX than a guessed answer.
Small eval sets already beat gut-feel iteration.

6.5 Apply now — exercises¶

Easy¶

Take one FAQ page and chunk it three different ways.
For each version, predict which questions it will answer well.
Write one retrieval rewrite prompt and one grounded answer prompt.

Medium¶

Build a 20-query gold set.
Compare two chunk sizes and two overlap settings.
Report recall@5, recall@10, and MRR.
Write the top three retrieval failure categories.

Hard¶

Add reranking.
Compare top-k retrieval with and without reranking.
Evaluate whether NDCG improves.
Pick three multi-hop questions and explain why naive RAG struggles. If you do the hard version honestly, you are already preparing for the next module.

6.6 Foundation-gap audit¶

Before you move to 09_advanced_rag_patterns, make sure these five basics are automatic. - Basic RAG pipeline - Embedding similarity concept - Chunk size reasoning - Vector search mechanics - Retrieval metrics If even one of these feels fuzzy, pause. Re-read the relevant section. Advanced RAG assumes these are already in your hands.

6.7 Bridge to the next module¶

Here is the exact bridge sentence. Next module — 09_advanced_rag_patterns — tackles the hard cases: multi-hop questions, query rewriting, HyDE, agentic RAG, and production hardening. That is where the subject gets interesting. This module gave you the floor. The next one tests the ceiling.