03. Week 7 — RAG Fundamentals¶

For deep understanding see 02_explainer.md — narrative with worked examples, diagrams, failure modes, and retrieval prompts. This file is the quick-reference glossary: definitions, formulas, lookup tables, and implementation defaults.

Section 1 — Why RAG exists¶

LLMs store broad world knowledge in weights. That knowledge is: - static, - hard to update, - weak on private company data, - and dangerous when the model guesses.

RAG = retrieve relevant external context at query time, then generate an answer grounded in that context.

See explainer §1.1-§1.4.

Section 2 — The core pipeline¶

user query
  ↓
embed query
  ↓
retrieve top-k chunks
  ↓
(optional) rerank
  ↓
augment prompt with evidence
  ↓
generate answer with citations / abstain

See explainer §4.1.

Section 3 — Chunking cheat sheet¶

Documents rarely fit whole into the model context window. So we split them into chunks.

Strategy	Best for	Strength	Weakness
Fixed-size	quick baseline	easy, reproducible	ignores meaning boundaries
Recursive	mixed markdown / docs	respects headings, then paragraphs, then sentences	still approximate
Semantic	long prose	follows topic shifts	slower, more complex
Document-aware	HTML, markdown, code	preserves structure	parser effort
Hierarchical	long reports	coarse + fine retrieval	more plumbing

Chunk size defaults¶

Corpus type	Starting size	Overlap
Product docs	300-500 tokens	10-20%
Blog / prose	400-700 tokens	10-15%
API docs	section-based first	small overlap
Code	function / class aware	minimal overlap
Tables / contracts	structure-aware first	depends on layout

See explainer §2.1-§2.6.

Section 4 — Chunking trade-offs¶

If chunks are too small	If chunks are too large
Great local precision	Better local context
Loses surrounding evidence	Pulls irrelevant text
More index rows	Fewer but noisier rows
Weak for definitions split across sentences	Weak for exact retrieval

Rule of thumb: a good chunk should often answer a narrow question by itself.

See explainer §2.2.

Section 5 — Embeddings¶

An embedding converts text into a dense vector. Nearby vectors usually mean semantically related text.

"refund policy"  ───── close to ─────  "returns and reimbursement"
"refund policy"  ───── far from  ─────  "GPU kernel launch error"

What embeddings capture well: - semantic similarity, - paraphrases, - related topics, - intent-level closeness.

What embeddings capture poorly: - exact numbers, - negation, - access permissions, - multi-hop logic, - rare domain jargon without good training coverage.

See explainer §3.1-§3.2.

Section 6 — Similarity metrics¶

Metric	Intuition	When used
Cosine similarity	angle between vectors	default for text retrieval
Dot product	angle + magnitude	okay when model expects it; normalized vectors often make it equivalent to cosine ranking
L2 distance	geometric distance	some ANN libraries / exact search setups

If embeddings are L2-normalized, cosine similarity and dot product produce the same ranking.

See explainer §3.3.

Section 7 — Choosing an embedding model¶

Choose by quality on your corpus, not by brand alone.

Factor	What to ask
Domain fit	Does it understand your jargon?
Language support	English only, or multilingual?
Cost	What is price per million tokens or per document batch?
Latency	Can you meet your p95 target?
Dimension	Does storage / RAM matter?
Hosting	API or self-hosted?
Query/doc prefixes	Does the model expect special formatting?

Typical starting options: - OpenAI text-embedding models for fast API baselines - BGE / e5 / sentence-transformers for open-source baselines - Cohere for multilingual or retrieval-focused comparisons

See explainer §3.4.

Section 8 — Vector search internals¶

Exact search¶

Compare query vector to every stored vector
Best recall
Too slow at large scale

HNSW¶

Graph-based ANN
Fast and high recall
Common production default
Key knobs: M, ef_construction, ef_search

IVF¶

Cluster vectors into buckets
Search likely buckets only
Faster, often lower recall than HNSW

See explainer §3.5.

Section 9 — RAG pipeline failure map¶

Stage	Typical failure
Query	ambiguous user wording
Embedding	wrong model or query formatting
Retrieval	relevant chunk not in top-k
Rerank	skipped or too shallow
Prompt augmentation	too much noisy context
Generation	model invents unsupported details
Evaluation	team tracks fluency, not grounding

See explainer §4.2-§4.8.

Section 10 — Evaluation quick reference¶

Retrieval metrics¶

Recall@k: did we retrieve relevant chunks somewhere in top-k?
MRR: how early does the first relevant chunk appear?
NDCG: do highly relevant chunks appear near the top?

Generation metrics¶

Faithfulness: answer supported by retrieved evidence
Answer relevance: answer actually addresses the question
Context precision: retrieved context is mostly relevant
Context recall: retrieved context covers the needed evidence

RAGAS¶

A practical framework for evaluating RAG systems with LLM-assisted metrics. Use it as a starting point, not the final judge. Human review still matters.

See explainer §5.1-§5.6.

Reading list¶

02_explainer.md — chapters 1-5
Lewis et al. (2020), RAG
Reimers and Gurevych (2019), Sentence-BERT
Malkov and Yashunin (2018), HNSW
Johnson et al. (2017), FAISS
RAGAS docs / paper

Reference material¶

YouTube¶

What is Retrieval-Augmented Generation (RAG)? — short overview of grounding and retrieval.
Learn RAG From Scratch - Python AI Tutorial from a LangChain Engineer — practical build walkthrough.

Blogs¶

Retrieval-Augmented Generation - Pinecone Learn — strong conceptual overview of vector search and RAG architecture.
Deconstructing RAG — useful taxonomy of retrieval and advanced patterns.

Self-check¶

For full interview framing, see 02_explainer.md §6.3.

Why do static model weights fail on private company facts? (§1.2)
What makes a chunk "good" for retrieval? (§2.2, §2.6)
Why use overlap, and why not make overlap huge? (§2.3)
Semantic vs recursive splitting — when each? (§2.4)
What do embeddings miss even when they capture meaning well? (§3.1)
Cosine vs dot product — when same ranking? (§3.3)
HNSW vs IVF — which one is the safer default? Why? (§3.5)
Why does reranking often improve precision? (§4.5)
Recall@k vs MRR vs NDCG — what does each reward? (§5.2-§5.4)
RAGAS helps with what, exactly? What can it still miss? (§5.6)