Skip to content

05. Assignment 7 — Retrieval-First RAG Baseline

Week 7. Build the retrieval layer properly before building a fancy chatbot.

Required reading first: 02_explainer.md chapters 2-5. If you cannot explain chunking, embeddings, retrieval metrics, and the pipeline failure modes from explainer §2.1-§5.6, you are not ready to ship this hands_on_lab.

Goal

Build a small but defensible RAG baseline over a corpus you choose. You must demonstrate: - document ingestion, - chunking, - embedding, - vector retrieval, - at least one reranking or retrieval-improvement experiment, - and retrieval evaluation on a gold set.

Constraints

  • You may use pgvector, Qdrant, FAISS, Chroma, or a simple in-memory baseline.
  • Do not hide the retrieval logic behind a single framework call without understanding it.
  • Keep the corpus small enough to iterate quickly: roughly 100-1000 documents.
  • Build the eval harness before polishing UI.

Suggested corpora

  • Product docs from an open-source tool you actually care about
  • Your own notes, blog posts, or markdown docs
  • Wikipedia pages inside one narrow topic
  • A mini support-doc corpus with FAQs and policies

Required architecture

documents
chunker
embeddings
vector index
user query
query embedding
retrieve top-k
(optional) rerank
answer with citations / abstain

Required deliverables

  1. ingest.py or notebook — load docs, chunk, embed, index
  2. search.py — query → top-k retrieval with similarity scores
  3. eval.py — retrieval metrics on a gold set
  4. gold_set.json — at least 20 query-to-relevant-chunk mappings
  5. README.md — corpus, chunk strategy, embedding model, metrics, failures, next steps

Minimum experiments

Experiment What to compare Metric
Chunk size 256 vs 512 vs 768 tokens recall@5 / recall@10
Overlap 0 vs 50 vs 100 tokens recall@10 + failure notes
Retrieval depth k = 3 vs 5 vs 10 recall and noise
Optional reranker off vs on MRR / NDCG

Run at least two of these.

Success criteria

  • Gold set with at least 20 queries
  • recall@10 reported clearly
  • At least one ranking-sensitive metric reported: MRR or NDCG
  • One concrete failure analysis section in the README
  • One honest statement about what the system still cannot do

What to document in the README

  • Why you chose the corpus
  • Why you chose the chunking strategy (link to explainer §2.2-§2.5)
  • Why you chose the embedding model (link to explainer §3.4)
  • Top failure modes by category
  • What you would add in Week 8: query rewriting, HyDE, guardrails, or agentic routing

Common pitfalls

  • Chunking by raw character count without checking semantic boundaries (explainer §2.4)
  • Zero overlap when answers sit at section boundaries (explainer §2.3)
  • Evaluating only generation, not retrieval (explainer §5.1)
  • Using too-small top-k and blaming the generator (explainer §4.4)
  • Forgetting to make the answer abstain when evidence is missing (explainer §4.6)
  • Treating a single nice demo as proof that the system works (explainer §5.1)

Stretch goal

Add a very small answer step after retrieval. The answer prompt must: - cite source chunks, - quote relevant lines when possible, - and say "I could not find support in the retrieved context" when evidence is missing.

LinkedIn post template

"Built a retrieval-first RAG baseline this week.

Corpus: [what you chose] Chunking: [strategy] Embeddings: [model] Best retrieval score: recall@10 = [x], MRR = [y]

Biggest lesson: most RAG failures started before generation.

Next step: Week 8 adds reranking, query rewriting, and guardrails.

Repo: [link]"

Why this hands_on_lab matters

Most weak RAG projects jump directly to chatbot polish. Strong candidates debug retrieval first. If the wrong evidence enters the context window, the prettiest prompt in the world cannot save you. That is the entire point of this week.