Skip to content

03. Week 7 — RAG Fundamentals

For deep understanding see 02_explainer.md — narrative with worked examples, diagrams, failure modes, and retrieval prompts. This file is the quick-reference glossary: definitions, formulas, lookup tables, and implementation defaults.

Section 1 — Why RAG exists

LLMs store broad world knowledge in weights. That knowledge is: - static, - hard to update, - weak on private company data, - and dangerous when the model guesses.

RAG = retrieve relevant external context at query time, then generate an answer grounded in that context.

See explainer §1.1-§1.4.

Section 2 — The core pipeline

user query
embed query
retrieve top-k chunks
(optional) rerank
augment prompt with evidence
generate answer with citations / abstain

See explainer §4.1.

Section 3 — Chunking cheat sheet

Documents rarely fit whole into the model context window. So we split them into chunks.

Strategy Best for Strength Weakness
Fixed-size quick baseline easy, reproducible ignores meaning boundaries
Recursive mixed markdown / docs respects headings, then paragraphs, then sentences still approximate
Semantic long prose follows topic shifts slower, more complex
Document-aware HTML, markdown, code preserves structure parser effort
Hierarchical long reports coarse + fine retrieval more plumbing

Chunk size defaults

Corpus type Starting size Overlap
Product docs 300-500 tokens 10-20%
Blog / prose 400-700 tokens 10-15%
API docs section-based first small overlap
Code function / class aware minimal overlap
Tables / contracts structure-aware first depends on layout

See explainer §2.1-§2.6.

Section 4 — Chunking trade-offs

If chunks are too small If chunks are too large
Great local precision Better local context
Loses surrounding evidence Pulls irrelevant text
More index rows Fewer but noisier rows
Weak for definitions split across sentences Weak for exact retrieval

Rule of thumb: a good chunk should often answer a narrow question by itself.

See explainer §2.2.

Section 5 — Embeddings

An embedding converts text into a dense vector. Nearby vectors usually mean semantically related text.

"refund policy"  ───── close to ─────  "returns and reimbursement"
"refund policy"  ───── far from  ─────  "GPU kernel launch error"

What embeddings capture well: - semantic similarity, - paraphrases, - related topics, - intent-level closeness.

What embeddings capture poorly: - exact numbers, - negation, - access permissions, - multi-hop logic, - rare domain jargon without good training coverage.

See explainer §3.1-§3.2.

Section 6 — Similarity metrics

Metric Intuition When used
Cosine similarity angle between vectors default for text retrieval
Dot product angle + magnitude okay when model expects it; normalized vectors often make it equivalent to cosine ranking
L2 distance geometric distance some ANN libraries / exact search setups

If embeddings are L2-normalized, cosine similarity and dot product produce the same ranking.

See explainer §3.3.

Section 7 — Choosing an embedding model

Choose by quality on your corpus, not by brand alone.

Factor What to ask
Domain fit Does it understand your jargon?
Language support English only, or multilingual?
Cost What is price per million tokens or per document batch?
Latency Can you meet your p95 target?
Dimension Does storage / RAM matter?
Hosting API or self-hosted?
Query/doc prefixes Does the model expect special formatting?

Typical starting options: - OpenAI text-embedding models for fast API baselines - BGE / e5 / sentence-transformers for open-source baselines - Cohere for multilingual or retrieval-focused comparisons

See explainer §3.4.

Section 8 — Vector search internals

  • Compare query vector to every stored vector
  • Best recall
  • Too slow at large scale

HNSW

  • Graph-based ANN
  • Fast and high recall
  • Common production default
  • Key knobs: M, ef_construction, ef_search

IVF

  • Cluster vectors into buckets
  • Search likely buckets only
  • Faster, often lower recall than HNSW

See explainer §3.5.

Section 9 — RAG pipeline failure map

Stage Typical failure
Query ambiguous user wording
Embedding wrong model or query formatting
Retrieval relevant chunk not in top-k
Rerank skipped or too shallow
Prompt augmentation too much noisy context
Generation model invents unsupported details
Evaluation team tracks fluency, not grounding

See explainer §4.2-§4.8.

Section 10 — Evaluation quick reference

Retrieval metrics

  • Recall@k: did we retrieve relevant chunks somewhere in top-k?
  • MRR: how early does the first relevant chunk appear?
  • NDCG: do highly relevant chunks appear near the top?

Generation metrics

  • Faithfulness: answer supported by retrieved evidence
  • Answer relevance: answer actually addresses the question
  • Context precision: retrieved context is mostly relevant
  • Context recall: retrieved context covers the needed evidence

RAGAS

A practical framework for evaluating RAG systems with LLM-assisted metrics. Use it as a starting point, not the final judge. Human review still matters.

See explainer §5.1-§5.6.

Reading list

  1. 02_explainer.md — chapters 1-5
  2. Lewis et al. (2020), RAG
  3. Reimers and Gurevych (2019), Sentence-BERT
  4. Malkov and Yashunin (2018), HNSW
  5. Johnson et al. (2017), FAISS
  6. RAGAS docs / paper

Reference material

YouTube

Blogs

Self-check

For full interview framing, see 02_explainer.md §6.3.

  1. Why do static model weights fail on private company facts? (§1.2)
  2. What makes a chunk "good" for retrieval? (§2.2, §2.6)
  3. Why use overlap, and why not make overlap huge? (§2.3)
  4. Semantic vs recursive splitting — when each? (§2.4)
  5. What do embeddings miss even when they capture meaning well? (§3.1)
  6. Cosine vs dot product — when same ranking? (§3.3)
  7. HNSW vs IVF — which one is the safer default? Why? (§3.5)
  8. Why does reranking often improve precision? (§4.5)
  9. Recall@k vs MRR vs NDCG — what does each reward? (§5.2-§5.4)
  10. RAGAS helps with what, exactly? What can it still miss? (§5.6)