03. Week 7 — RAG Fundamentals¶
For deep understanding see
02_explainer.md— narrative with worked examples, diagrams, failure modes, and retrieval prompts. This file is the quick-reference glossary: definitions, formulas, lookup tables, and implementation defaults.
Section 1 — Why RAG exists¶
LLMs store broad world knowledge in weights. That knowledge is: - static, - hard to update, - weak on private company data, - and dangerous when the model guesses.
RAG = retrieve relevant external context at query time, then generate an answer grounded in that context.
See explainer §1.1-§1.4.
Section 2 — The core pipeline¶
user query
↓
embed query
↓
retrieve top-k chunks
↓
(optional) rerank
↓
augment prompt with evidence
↓
generate answer with citations / abstain
See explainer §4.1.
Section 3 — Chunking cheat sheet¶
Documents rarely fit whole into the model context window. So we split them into chunks.
| Strategy | Best for | Strength | Weakness |
|---|---|---|---|
| Fixed-size | quick baseline | easy, reproducible | ignores meaning boundaries |
| Recursive | mixed markdown / docs | respects headings, then paragraphs, then sentences | still approximate |
| Semantic | long prose | follows topic shifts | slower, more complex |
| Document-aware | HTML, markdown, code | preserves structure | parser effort |
| Hierarchical | long reports | coarse + fine retrieval | more plumbing |
Chunk size defaults¶
| Corpus type | Starting size | Overlap |
|---|---|---|
| Product docs | 300-500 tokens | 10-20% |
| Blog / prose | 400-700 tokens | 10-15% |
| API docs | section-based first | small overlap |
| Code | function / class aware | minimal overlap |
| Tables / contracts | structure-aware first | depends on layout |
See explainer §2.1-§2.6.
Section 4 — Chunking trade-offs¶
| If chunks are too small | If chunks are too large |
|---|---|
| Great local precision | Better local context |
| Loses surrounding evidence | Pulls irrelevant text |
| More index rows | Fewer but noisier rows |
| Weak for definitions split across sentences | Weak for exact retrieval |
Rule of thumb: a good chunk should often answer a narrow question by itself.
See explainer §2.2.
Section 5 — Embeddings¶
An embedding converts text into a dense vector. Nearby vectors usually mean semantically related text.
"refund policy" ───── close to ───── "returns and reimbursement"
"refund policy" ───── far from ───── "GPU kernel launch error"
What embeddings capture well: - semantic similarity, - paraphrases, - related topics, - intent-level closeness.
What embeddings capture poorly: - exact numbers, - negation, - access permissions, - multi-hop logic, - rare domain jargon without good training coverage.
See explainer §3.1-§3.2.
Section 6 — Similarity metrics¶
| Metric | Intuition | When used |
|---|---|---|
| Cosine similarity | angle between vectors | default for text retrieval |
| Dot product | angle + magnitude | okay when model expects it; normalized vectors often make it equivalent to cosine ranking |
| L2 distance | geometric distance | some ANN libraries / exact search setups |
If embeddings are L2-normalized, cosine similarity and dot product produce the same ranking.
See explainer §3.3.
Section 7 — Choosing an embedding model¶
Choose by quality on your corpus, not by brand alone.
| Factor | What to ask |
|---|---|
| Domain fit | Does it understand your jargon? |
| Language support | English only, or multilingual? |
| Cost | What is price per million tokens or per document batch? |
| Latency | Can you meet your p95 target? |
| Dimension | Does storage / RAM matter? |
| Hosting | API or self-hosted? |
| Query/doc prefixes | Does the model expect special formatting? |
Typical starting options: - OpenAI text-embedding models for fast API baselines - BGE / e5 / sentence-transformers for open-source baselines - Cohere for multilingual or retrieval-focused comparisons
See explainer §3.4.
Section 8 — Vector search internals¶
Exact search¶
- Compare query vector to every stored vector
- Best recall
- Too slow at large scale
HNSW¶
- Graph-based ANN
- Fast and high recall
- Common production default
- Key knobs:
M,ef_construction,ef_search
IVF¶
- Cluster vectors into buckets
- Search likely buckets only
- Faster, often lower recall than HNSW
See explainer §3.5.
Section 9 — RAG pipeline failure map¶
| Stage | Typical failure |
|---|---|
| Query | ambiguous user wording |
| Embedding | wrong model or query formatting |
| Retrieval | relevant chunk not in top-k |
| Rerank | skipped or too shallow |
| Prompt augmentation | too much noisy context |
| Generation | model invents unsupported details |
| Evaluation | team tracks fluency, not grounding |
See explainer §4.2-§4.8.
Section 10 — Evaluation quick reference¶
Retrieval metrics¶
- Recall@k: did we retrieve relevant chunks somewhere in top-k?
- MRR: how early does the first relevant chunk appear?
- NDCG: do highly relevant chunks appear near the top?
Generation metrics¶
- Faithfulness: answer supported by retrieved evidence
- Answer relevance: answer actually addresses the question
- Context precision: retrieved context is mostly relevant
- Context recall: retrieved context covers the needed evidence
RAGAS¶
A practical framework for evaluating RAG systems with LLM-assisted metrics. Use it as a starting point, not the final judge. Human review still matters.
See explainer §5.1-§5.6.
Reading list¶
02_explainer.md— chapters 1-5- Lewis et al. (2020), RAG
- Reimers and Gurevych (2019), Sentence-BERT
- Malkov and Yashunin (2018), HNSW
- Johnson et al. (2017), FAISS
- RAGAS docs / paper
Reference material¶
YouTube¶
- What is Retrieval-Augmented Generation (RAG)? — short overview of grounding and retrieval.
- Learn RAG From Scratch - Python AI Tutorial from a LangChain Engineer — practical build walkthrough.
Blogs¶
- Retrieval-Augmented Generation - Pinecone Learn — strong conceptual overview of vector search and RAG architecture.
- Deconstructing RAG — useful taxonomy of retrieval and advanced patterns.
Self-check¶
For full interview framing, see 02_explainer.md §6.3.
- Why do static model weights fail on private company facts? (§1.2)
- What makes a chunk "good" for retrieval? (§2.2, §2.6)
- Why use overlap, and why not make overlap huge? (§2.3)
- Semantic vs recursive splitting — when each? (§2.4)
- What do embeddings miss even when they capture meaning well? (§3.1)
- Cosine vs dot product — when same ranking? (§3.3)
- HNSW vs IVF — which one is the safer default? Why? (§3.5)
- Why does reranking often improve precision? (§4.5)
- Recall@k vs MRR vs NDCG — what does each reward? (§5.2-§5.4)
- RAGAS helps with what, exactly? What can it still miss? (§5.6)