Skip to content

RAG Hardening — Analysis

What "hardening" means here

The starter was a happy-path RAG: retrieve → generate → return. Production RAG fails in specific ways that need explicit defences. This implementation adds:

  1. Refusal on empty retrieval. No relevant docs → don't generate from thin air; return a clean refusal.
  2. Citation enforcement. Every claim is backed by a source ID in the answer; the test checks the marker is present.
  3. Per-component error capture. Retriever or generator can throw; the response still completes with an error field rather than crashing the request.
  4. Structured result. RAGResult includes latency, source list, refused flag, citation flag, error — enough to audit the request.
  5. Configurable thresholds. RAGConfig controls top_k, refusal behaviour, citation requirements.

Why each defence matters

Refusal on empty retrieval. Generators given no context tend to hallucinate. They have parametric knowledge and will use it. A clean refusal forces the user to refine the query or accept "I don't know" — both safer than a confidently wrong answer.

Citation enforcement. Citations make the answer auditable. If the user disagrees with the claim, they can open the source and verify. Without citations, the answer is a black box. The structural check has_citation() proves at least one source ID appears in the answer text.

Per-component error capture. A real RAG has many failure modes: the vector DB times out, the embedding service errors, the generator's API rate-limits, the network drops. Each can fail independently. Capturing per-component errors lets the caller decide whether to retry, escalate, or surface the failure to the user.

Latency tracking. RAG latency = retrieval + generation. Both contribute. Tracking per-request latency surfaces which component is slow when SLOs slip.

What this version still doesn't do

  • Real embedding-based retrieval. Keyword overlap is the placeholder; production uses dense vectors via OpenAI's embedding API or sentence-transformers. The interface stays the same; the retriever function is the swap point.
  • Reranking. First-pass retrieval returns N candidates; a cross-encoder reranker scores them; top K go to the generator. Cuts noise dramatically.
  • Hybrid search. Combine keyword and vector search; reciprocal rank fusion. Stronger than either alone.
  • Caching. Identical queries should hit a cache before the retriever runs.
  • Streaming generation. For UX, generators should stream tokens; this implementation returns the whole answer.
  • Citation verification at the document level. Strong citation enforcement: parse claims out of the answer; match each against retrieved docs; reject the answer if any claim is unsupported.

The refusal pattern

The most under-implemented RAG behaviour. The starter generates an answer even from empty retrieval; the fixed version refuses. This single change has a large impact on perceived quality.

When to refuse:

  • Retrieval returned nothing. Don't generate.
  • Retrieval returned only low-relevance results. Below a threshold; configurable.
  • The generator itself signals uncertainty. "I'm not sure" responses can be detected and confirmed.

Refusal is the structural defence against hallucination on out-of-distribution questions. Models trained to refuse gracefully are paired with retrieval systems trained to surface "no good matches" — together they avoid the worst RAG failure mode.

What this exercise teaches

  • RAG hardening is mostly about handling the unhappy paths.
  • The structural defences (refuse, cite, capture errors) are 30 lines of code that change the system's behaviour qualitatively.
  • A structured result type is the audit surface; without it, every caller reinvents error handling.
  • Citation enforcement is the simplest hallucination defence.

Interview probes

  • "What does your RAG do when retrieval returns no relevant documents?"
  • "How do you enforce that the answer cites its sources?"
  • "What is the difference between keyword and vector retrieval?"
  • "How does reranking improve RAG quality?"
  • "What is the structural defence against a flaky retrieval service?"
  • "Walk through hybrid search vs. pure dense retrieval."

Each has a one-paragraph answer drawn from the hardening choices in this implementation.