RAG Hardening — Analysis¶
What "hardening" means here¶
The starter was a happy-path RAG: retrieve → generate → return. Production RAG fails in specific ways that need explicit defences. This implementation adds:
- Refusal on empty retrieval. No relevant docs → don't generate from thin air; return a clean refusal.
- Citation enforcement. Every claim is backed by a source ID in the answer; the test checks the marker is present.
- Per-component error capture. Retriever or generator can throw; the response still completes with an error field rather than crashing the request.
- Structured result.
RAGResultincludes latency, source list, refused flag, citation flag, error — enough to audit the request. - Configurable thresholds.
RAGConfigcontrols top_k, refusal behaviour, citation requirements.
Why each defence matters¶
Refusal on empty retrieval. Generators given no context tend to hallucinate. They have parametric knowledge and will use it. A clean refusal forces the user to refine the query or accept "I don't know" — both safer than a confidently wrong answer.
Citation enforcement. Citations make the answer auditable. If the user disagrees with the claim, they can open the source and verify. Without citations, the answer is a black box. The structural check has_citation() proves at least one source ID appears in the answer text.
Per-component error capture. A real RAG has many failure modes: the vector DB times out, the embedding service errors, the generator's API rate-limits, the network drops. Each can fail independently. Capturing per-component errors lets the caller decide whether to retry, escalate, or surface the failure to the user.
Latency tracking. RAG latency = retrieval + generation. Both contribute. Tracking per-request latency surfaces which component is slow when SLOs slip.
What this version still doesn't do¶
- Real embedding-based retrieval. Keyword overlap is the placeholder; production uses dense vectors via OpenAI's embedding API or sentence-transformers. The interface stays the same; the
retrieverfunction is the swap point. - Reranking. First-pass retrieval returns N candidates; a cross-encoder reranker scores them; top K go to the generator. Cuts noise dramatically.
- Hybrid search. Combine keyword and vector search; reciprocal rank fusion. Stronger than either alone.
- Caching. Identical queries should hit a cache before the retriever runs.
- Streaming generation. For UX, generators should stream tokens; this implementation returns the whole answer.
- Citation verification at the document level. Strong citation enforcement: parse claims out of the answer; match each against retrieved docs; reject the answer if any claim is unsupported.
The refusal pattern¶
The most under-implemented RAG behaviour. The starter generates an answer even from empty retrieval; the fixed version refuses. This single change has a large impact on perceived quality.
When to refuse:
- Retrieval returned nothing. Don't generate.
- Retrieval returned only low-relevance results. Below a threshold; configurable.
- The generator itself signals uncertainty. "I'm not sure" responses can be detected and confirmed.
Refusal is the structural defence against hallucination on out-of-distribution questions. Models trained to refuse gracefully are paired with retrieval systems trained to surface "no good matches" — together they avoid the worst RAG failure mode.
What this exercise teaches¶
- RAG hardening is mostly about handling the unhappy paths.
- The structural defences (refuse, cite, capture errors) are 30 lines of code that change the system's behaviour qualitatively.
- A structured result type is the audit surface; without it, every caller reinvents error handling.
- Citation enforcement is the simplest hallucination defence.
Interview probes¶
- "What does your RAG do when retrieval returns no relevant documents?"
- "How do you enforce that the answer cites its sources?"
- "What is the difference between keyword and vector retrieval?"
- "How does reranking improve RAG quality?"
- "What is the structural defence against a flaky retrieval service?"
- "Walk through hybrid search vs. pure dense retrieval."
Each has a one-paragraph answer drawn from the hardening choices in this implementation.