05. Assignment 7 — Retrieval-First RAG Baseline¶
Week 7. Build the retrieval layer properly before building a fancy chatbot.
Required reading first:
02_explainer.mdchapters 2-5. If you cannot explain chunking, embeddings, retrieval metrics, and the pipeline failure modes from explainer §2.1-§5.6, you are not ready to ship this hands_on_lab.
Goal¶
Build a small but defensible RAG baseline over a corpus you choose. You must demonstrate: - document ingestion, - chunking, - embedding, - vector retrieval, - at least one reranking or retrieval-improvement experiment, - and retrieval evaluation on a gold set.
Constraints¶
- You may use pgvector, Qdrant, FAISS, Chroma, or a simple in-memory baseline.
- Do not hide the retrieval logic behind a single framework call without understanding it.
- Keep the corpus small enough to iterate quickly: roughly 100-1000 documents.
- Build the eval harness before polishing UI.
Suggested corpora¶
- Product docs from an open-source tool you actually care about
- Your own notes, blog posts, or markdown docs
- Wikipedia pages inside one narrow topic
- A mini support-doc corpus with FAQs and policies
Required architecture¶
documents
↓
chunker
↓
embeddings
↓
vector index
↓
user query
↓
query embedding
↓
retrieve top-k
↓
(optional) rerank
↓
answer with citations / abstain
Required deliverables¶
ingest.pyor notebook — load docs, chunk, embed, indexsearch.py— query → top-k retrieval with similarity scoreseval.py— retrieval metrics on a gold setgold_set.json— at least 20 query-to-relevant-chunk mappingsREADME.md— corpus, chunk strategy, embedding model, metrics, failures, next steps
Minimum experiments¶
| Experiment | What to compare | Metric |
|---|---|---|
| Chunk size | 256 vs 512 vs 768 tokens | recall@5 / recall@10 |
| Overlap | 0 vs 50 vs 100 tokens | recall@10 + failure notes |
| Retrieval depth | k = 3 vs 5 vs 10 | recall and noise |
| Optional reranker | off vs on | MRR / NDCG |
Run at least two of these.
Success criteria¶
- Gold set with at least 20 queries
- recall@10 reported clearly
- At least one ranking-sensitive metric reported: MRR or NDCG
- One concrete failure analysis section in the README
- One honest statement about what the system still cannot do
What to document in the README¶
- Why you chose the corpus
- Why you chose the chunking strategy (link to explainer §2.2-§2.5)
- Why you chose the embedding model (link to explainer §3.4)
- Top failure modes by category
- What you would add in Week 8: query rewriting, HyDE, guardrails, or agentic routing
Common pitfalls¶
- Chunking by raw character count without checking semantic boundaries (explainer §2.4)
- Zero overlap when answers sit at section boundaries (explainer §2.3)
- Evaluating only generation, not retrieval (explainer §5.1)
- Using too-small top-k and blaming the generator (explainer §4.4)
- Forgetting to make the answer abstain when evidence is missing (explainer §4.6)
- Treating a single nice demo as proof that the system works (explainer §5.1)
Stretch goal¶
Add a very small answer step after retrieval. The answer prompt must: - cite source chunks, - quote relevant lines when possible, - and say "I could not find support in the retrieved context" when evidence is missing.
LinkedIn post template¶
"Built a retrieval-first RAG baseline this week.
Corpus: [what you chose] Chunking: [strategy] Embeddings: [model] Best retrieval score: recall@10 = [x], MRR = [y]
Biggest lesson: most RAG failures started before generation.
Next step: Week 8 adds reranking, query rewriting, and guardrails.
Repo: [link]"
Why this hands_on_lab matters¶
Most weak RAG projects jump directly to chatbot polish. Strong candidates debug retrieval first. If the wrong evidence enters the context window, the prettiest prompt in the world cannot save you. That is the entire point of this week.