00. RAG Fundamentals — The Five-Year-Old Version¶
Module 06 covered changing the model. This module covers giving the model better evidence instead.
Imagine a very smart research assistant sitting in a giant library. A person walks in and asks, "What did our company say about enterprise refunds last quarter?" The assistant does not try to remember everything from memory. That would be foolish. The library is huge. Memories are stale. Instead, the assistant uses five simple helpers.
First, the question arrives. The librarian reads it carefully. Then the librarian makes an index card for the question. This card is not normal text. It is a compact meaning-signature. It says, roughly, "questions about refund policy, enterprise customers, last quarter."
Now the librarian goes to the bookshelf. This bookshelf is magical. It is not organized alphabetically. It is organized by meaning. Chunks about refunds sit near refund-policy paragraphs. Chunks about GPU kernels sit far away. The librarian pulls the five most relevant books from nearby places. Not the whole library. Only the most likely evidence.
Then the librarian opens those books at the key pages. These pages go onto the reading desk. The reading desk is small. Only a limited number of pages fit. So the librarian must choose carefully.
Finally, the librarian writes the answer brief. The brief contains the user question, the selected evidence, and instructions like "answer only from these pages" and "if evidence is missing, say you do not know." That brief goes to the writer — the language model.
Now the model is no longer answering blindly. It is answering with pages open in front of it. That is RAG.
Here is the important warning. If the librarian picks the wrong books, the answer goes wrong. If the bookshelf is badly organized, retrieval goes wrong. If the index cards are poor, nearby meaning gets distorted. If the reading desk is overcrowded, signal gets buried. If the answer brief is weak, the writer improvises.
RAG is not one magic box. It is a chain of small engineering decisions, each one capable of failing and each one debuggable. That is why good RAG work feels like systems engineering, not just prompt writing.
A tiny worked example¶
Three chunks sit on the bookshelf: - Chunk A: "Enterprise annual plans may request a refund within 30 days of renewal." - Chunk B: "GPU pricing follows per-second billing with no minimum commitment." - Chunk C: "Refunds are not available after the first 5,000 API calls."
User asks: "When can enterprise customers get a refund?"
question ──→ index card ──→ bookshelf search
│
┌───────────┼───────────┐
▼ ▼ ▼
Chunk A Chunk C Chunk B
score: 0.91 score: 0.87 score: 0.23
│ │
▼ ▼
┌─ reading desk ─────────────────┐
│ A: refund within 30 days │
│ C: not after 5,000 API calls │
└────────────┬───────────────────┘
▼
answer brief ──→ LLM ──→ grounded answer
Chunk B about GPU pricing is far away on the meaning map. It never reaches the reading desk. The model answers from A and C only. That is the whole picture.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| The librarian | The retriever — finds relevant chunks for the query. |
| The bookshelf | The vector store — all chunks indexed by meaning. |
| The index card | The embedding — a compact vector encoding of meaning. |
| The reading desk | The context window — limited space for evidence in the prompt. |
| The answer brief | The augmented prompt — question + evidence + instructions. |
Where you will see this in real products¶
The same five helpers — librarian, bookshelf, index card, reading desk, answer brief — sit inside almost every grounded-AI product shipping today. The names change. The shape does not.
- Perplexity AI — searches the web, picks the few most relevant pages, then writes a cited answer. Pure RAG, consumer-facing.
- ChatGPT with browsing / Connectors — same pattern, except the bookshelf is the live internet or your connected docs.
- Anthropic Claude with Projects — your uploaded files become the bookshelf; the librarian fetches per turn.
- Google NotebookLM — your source documents form a private bookshelf; answers are grounded only in those.
- GitHub Copilot Chat — your repository is the bookshelf; the librarian fetches the right files before suggesting code.
- Cursor / Windsurf — same pattern for code editors, with codebase-aware retrieval.
- Notion AI Q&A — your workspace pages are the bookshelf; the answer cites the notes.
- Glean — enterprise search across Slack, Drive, Confluence, Jira — one bookshelf made from many sources.
- Microsoft Copilot for M365 — the Microsoft Graph is the bookshelf; emails, docs, chats all retrievable.
- Slack AI — channel history is the bookshelf; the librarian summarises with citations.
- Intercom Fin — your help-centre articles are the bookshelf; the chatbot answers customer tickets from them.
- Zendesk AI agents — the same pattern for support deflection.
- Salesforce Einstein Copilot — CRM data is the bookshelf; answers reference customer records.
- HubSpot Breeze — marketing and CRM data as the bookshelf.
- Hebbia, Harvey, Casetext / CoCounsel — domain RAG for finance and legal, with contract-aware bookshelves.
- Bloomberg GPT, JP Morgan DocLLM — finance-domain bookshelves where chunking respects filings and tables.
- AWS Bedrock Knowledge Bases — managed bookshelf + librarian as a cloud service.
- Azure AI Search + Azure OpenAI — Microsoft's managed retrieval stack.
- Google Vertex AI Search — Google's managed equivalent.
- OpenAI Assistants File Search — bookshelf-as-a-service inside the OpenAI platform.
- Cohere Coral / Compass — RAG-first product from Cohere.
- Vectara — RAG-as-a-service including citation and faithfulness scoring.
- Pinecone Assistant — bookshelf + librarian + answer brief, packaged.
- LlamaIndex + LangChain — open-source frameworks that wire the same five helpers in code.
The pattern is universal. Once you see the five helpers, you see them everywhere.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| Chunking | document parsing | context vs precision | chunk-strategy choices, parent-document patterns | data → index |
| Embedding | chunking | geometry-of-meaning | embedder swap, dimension/cost trade, MRL | index → ANN |
| Vector store + ANN | embedding | latency, recall, filter scope | HNSW vs IVF vs DiskANN, filtered search | index → query |
| Reranking | retrieval candidates | precision-after-recall | cross-encoder, ColBERT, LLM-as-reranker | retrieval → desk |
| Prompt/answer brief | retrieved chunks | grounded generation, abstention | citation, refusal, prefix caching | desk → model |
| Eval (retrieval + faithfulness) | golden set | unsolved-eval, drift detection | per-class metrics, LLM-as-judge | output → audit |
The five helpers — librarian, bookshelf, index card, reading desk, answer brief — reappear in every chapter under new pressure. Use this table when a later page introduces a new term and ask: which helper does it belong to, and what failure did it solve?
Top resources¶
- RAG paper (Lewis et al.) — the original Retrieval-Augmented Generation paper.
- Pinecone RAG Guide — practical end-to-end walkthrough.
- LangChain RAG Tutorial — code-first RAG with LangChain.
- RAGAS Documentation — the evaluation framework every RAG team should know.
- Sentence Transformers — open-source embeddings that power many RAG systems.
- Cohere Reranking — cross-encoder reranking explained with API examples.
What's coming¶
- 01-confident-wrong-answer.md — why closed-book LLMs guess on private data.
- 02-open-book-answering.md — what RAG changes: evidence-based generation.
- 03-chunking-tradeoffs.md — chunk size and overlap: precision vs context.
- 04-chunking-strategies.md — recursive, semantic, and document-aware splitting.
- 05-embeddings.md — index cards for meaning: what embeddings capture.
- 06-similarity-and-models.md — cosine, dot product, choosing embedding models.
- 07-vector-stores-ann.md — HNSW, IVF, and approximate nearest neighbors.
- 08-rag-pipeline.md — the full pipeline from query to answer.
- 09-query-and-retrieval.md — query understanding and retrieval failures.
- 10-reranking.md — sharpening the top of the list.
- 11-prompt-augmentation.md — building the answer brief with citations and abstention.
- 12-retrieval-metrics.md — recall@k, MRR, NDCG.
- 13-faithfulness-ragas.md — faithfulness, relevance, and the RAGAS framework.
- 14-honest-admission.md — multi-hop limits and what RAG does not solve.
Bridge. The first thing that breaks is obvious in hindsight — a model that answers from memory on private data will eventually invent facts. → 01-confident-wrong-answer.md