05. Assignment 7 — Retrieval-First RAG Baseline¶

Week 7. Build the retrieval layer properly before building a fancy chatbot.

Required reading first: 02_explainer.md chapters 2-5. If you cannot explain chunking, embeddings, retrieval metrics, and the pipeline failure modes from explainer §2.1-§5.6, you are not ready to ship this hands_on_lab.

Goal¶

Build a small but defensible RAG baseline over a corpus you choose. You must demonstrate: - document ingestion, - chunking, - embedding, - vector retrieval, - at least one reranking or retrieval-improvement experiment, - and retrieval evaluation on a gold set.

Constraints¶

You may use pgvector, Qdrant, FAISS, Chroma, or a simple in-memory baseline.
Do not hide the retrieval logic behind a single framework call without understanding it.
Keep the corpus small enough to iterate quickly: roughly 100-1000 documents.
Build the eval harness before polishing UI.

Suggested corpora¶

Product docs from an open-source tool you actually care about
Your own notes, blog posts, or markdown docs
Wikipedia pages inside one narrow topic
A mini support-doc corpus with FAQs and policies

Required architecture¶

documents
  ↓
chunker
  ↓
embeddings
  ↓
vector index
  ↓
user query
  ↓
query embedding
  ↓
retrieve top-k
  ↓
(optional) rerank
  ↓
answer with citations / abstain

Required deliverables¶

ingest.py or notebook — load docs, chunk, embed, index
search.py — query → top-k retrieval with similarity scores
eval.py — retrieval metrics on a gold set
gold_set.json — at least 20 query-to-relevant-chunk mappings
README.md — corpus, chunk strategy, embedding model, metrics, failures, next steps

Minimum experiments¶

Experiment	What to compare	Metric
Chunk size	256 vs 512 vs 768 tokens	recall@5 / recall@10
Overlap	0 vs 50 vs 100 tokens	recall@10 + failure notes
Retrieval depth	k = 3 vs 5 vs 10	recall and noise
Optional reranker	off vs on	MRR / NDCG

Run at least two of these.

Success criteria¶

Gold set with at least 20 queries
recall@10 reported clearly
At least one ranking-sensitive metric reported: MRR or NDCG
One concrete failure analysis section in the README
One honest statement about what the system still cannot do

What to document in the README¶

Why you chose the corpus
Why you chose the chunking strategy (link to explainer §2.2-§2.5)
Why you chose the embedding model (link to explainer §3.4)
Top failure modes by category
What you would add in Week 8: query rewriting, HyDE, guardrails, or agentic routing

Common pitfalls¶

Chunking by raw character count without checking semantic boundaries (explainer §2.4)
Zero overlap when answers sit at section boundaries (explainer §2.3)
Evaluating only generation, not retrieval (explainer §5.1)
Using too-small top-k and blaming the generator (explainer §4.4)
Forgetting to make the answer abstain when evidence is missing (explainer §4.6)
Treating a single nice demo as proof that the system works (explainer §5.1)

Stretch goal¶

Add a very small answer step after retrieval. The answer prompt must: - cite source chunks, - quote relevant lines when possible, - and say "I could not find support in the retrieved context" when evidence is missing.

LinkedIn post template¶

"Built a retrieval-first RAG baseline this week.

Corpus: [what you chose] Chunking: [strategy] Embeddings: [model] Best retrieval score: recall@10 = [x], MRR = [y]

Biggest lesson: most RAG failures started before generation.

Next step: Week 8 adds reranking, query rewriting, and guardrails.

Repo: [link]"

Why this hands_on_lab matters¶

Most weak RAG projects jump directly to chatbot polish. Strong candidates debug retrieval first. If the wrong evidence enters the context window, the prettiest prompt in the world cannot save you. That is the entire point of this week.