RAG Hardening — Analysis¶

What "hardening" means here¶

The starter was a happy-path RAG: retrieve → generate → return. Production RAG fails in specific ways that need explicit defences. This implementation adds:

Refusal on empty retrieval. No relevant docs → don't generate from thin air; return a clean refusal.
Citation enforcement. Every claim is backed by a source ID in the answer; the test checks the marker is present.
Per-component error capture. Retriever or generator can throw; the response still completes with an error field rather than crashing the request.
Structured result. RAGResult includes latency, source list, refused flag, citation flag, error — enough to audit the request.
Configurable thresholds. RAGConfig controls top_k, refusal behaviour, citation requirements.

Why each defence matters¶

Refusal on empty retrieval. Generators given no context tend to hallucinate. They have parametric knowledge and will use it. A clean refusal forces the user to refine the query or accept "I don't know" — both safer than a confidently wrong answer.

Citation enforcement. Citations make the answer auditable. If the user disagrees with the claim, they can open the source and verify. Without citations, the answer is a black box. The structural check has_citation() proves at least one source ID appears in the answer text.

Per-component error capture. A real RAG has many failure modes: the vector DB times out, the embedding service errors, the generator's API rate-limits, the network drops. Each can fail independently. Capturing per-component errors lets the caller decide whether to retry, escalate, or surface the failure to the user.

Latency tracking. RAG latency = retrieval + generation. Both contribute. Tracking per-request latency surfaces which component is slow when SLOs slip.

What this version still doesn't do¶

Real embedding-based retrieval. Keyword overlap is the placeholder; production uses dense vectors via OpenAI's embedding API or sentence-transformers. The interface stays the same; the retriever function is the swap point.
Reranking. First-pass retrieval returns N candidates; a cross-encoder reranker scores them; top K go to the generator. Cuts noise dramatically.
Hybrid search. Combine keyword and vector search; reciprocal rank fusion. Stronger than either alone.
Caching. Identical queries should hit a cache before the retriever runs.
Streaming generation. For UX, generators should stream tokens; this implementation returns the whole answer.
Citation verification at the document level. Strong citation enforcement: parse claims out of the answer; match each against retrieved docs; reject the answer if any claim is unsupported.

The refusal pattern¶

The most under-implemented RAG behaviour. The starter generates an answer even from empty retrieval; the fixed version refuses. This single change has a large impact on perceived quality.

When to refuse:

Retrieval returned nothing. Don't generate.
Retrieval returned only low-relevance results. Below a threshold; configurable.
The generator itself signals uncertainty. "I'm not sure" responses can be detected and confirmed.

Refusal is the structural defence against hallucination on out-of-distribution questions. Models trained to refuse gracefully are paired with retrieval systems trained to surface "no good matches" — together they avoid the worst RAG failure mode.

What this exercise teaches¶

RAG hardening is mostly about handling the unhappy paths.
The structural defences (refuse, cite, capture errors) are 30 lines of code that change the system's behaviour qualitatively.
A structured result type is the audit surface; without it, every caller reinvents error handling.
Citation enforcement is the simplest hallucination defence.

Interview probes¶

"What does your RAG do when retrieval returns no relevant documents?"
"How do you enforce that the answer cites its sources?"
"What is the difference between keyword and vector retrieval?"
"How does reranking improve RAG quality?"
"What is the structural defence against a flaky retrieval service?"
"Walk through hybrid search vs. pure dense retrieval."

Each has a one-paragraph answer drawn from the hardening choices in this implementation.