08. Hallucination detection — arrival customs checks whether the declared facts match the manifest¶

~15 min read. A polite answer with fake grounding is still a broken safety layer.

Built on the ELI5 in 00-eli5.md. The arrival customs — the checkpoint that inspects what is leaving — must verify that the answer matches evidence, not just that it sounds good.

Hallucination in production means unsupported claims¶

In a real product, hallucination is not a philosophical mystery. It usually means the system stated a fact that is unsupported, invented, or contradicted by its available evidence.

That matters even when the answer is harmless in tone. A fake policy citation, invented court case, wrong medical dose, or nonexistent internal feature can break trust quickly. The arrival customs layer should not release claims just because they are fluent.

question + evidence
       │
       ▼
     model
       │
       ▼
candidate claim set
       │
       ├── grounded? yes ──→ release
       └── grounded? no  ──→ revise, abstain, or refuse

See the important shift. We are not asking, "Is the sentence well written?" We are asking, "Can the system justify each important claim?"

Groundedness checks start with claim extraction¶

A long answer may contain many facts. We need to inspect them in smaller pieces. One practical approach is claim extraction. Break the answer into atomic statements. Then compare each statement to available sources.

Example answer: "Enterprise refunds require CFO approval, must be requested within ten business days, and exclude custom contracts."

That becomes three claims. Each claim can then be checked against policy text or retrieved evidence. This is much easier than scoring the whole paragraph as one fuzzy object.

The passport desk for facts is really a claim checker. Extract, compare, decide. Simple, no?

Worked example: citation exists, support does not¶

Suppose a RAG assistant answers this.

"Policy section 4.2 says enterprise customers always receive automatic refunds after any outage longer than two hours."

The system even cites policy_v3.pdf, page 8. Looks impressive. Now inspect the source. The actual page says, "Credits may be issued at finance discretion for outages materially affecting service, excluding custom contracts." That is not the same claim.

answer claim
"automatic refunds after any outage > 2 hours"
        │
        ▼
cited source
"credits may be issued at finance discretion"
        │
        ▼
result: contradiction / unsupported extrapolation

So what to do? Mark the answer ungrounded. Ask the model to regenerate using only exact supported wording. If it cannot, send it to the no-fly desk for abstention.

Notice the pattern. Citation presence is weaker than citation support. A fake-looking citation is bad. A real citation supporting the wrong claim is also bad.

Practical detection methods¶

Method one: retrieval overlap checks. Does the answer reuse facts or spans actually present in retrieved evidence? This is cheap but weak. It can miss paraphrases and accept unsupported wording that overlaps superficially.

Method two: citation verification. Require the answer to map each important claim to one or more cited spans, then verify the span exists and is semantically aligned.

Method three: NLI-style entailment. Use a verifier model to ask whether source text entails, contradicts, or does not support the claim. This is stronger for paraphrase-heavy answers. It is also slower and not perfect.

Method four: tool authority checks. If a tool returned a canonical number or status, compare the answer directly against that field. Deterministic comparisons beat fuzzy model judgments when possible.

A compact comparison looks like this.

cheap checks                  stronger checks
├── token overlap             ├── claim-to-citation match
├── citation exists           ├── entailment / contradiction
└── source mention present    └── exact tool-result comparison

Use the strongest deterministic method available. Use model-based verification where deterministic checks end.

Hallucination detection needs abstention paths¶

Now what is the operational mistake? Teams build a detector but forget the fallback. The checker says, "unsupported." Then what?

You need explicit outcomes. Revise answer with tighter prompting. Ask for clarification. Return cited excerpts only. Or abstain with a short statement that evidence is insufficient. The no-fly desk must have authority to stop release.

Example fallback reply: "I could not verify that claim from the retrieved policy text. I can quote the relevant section directly if you want."

That is honest and still useful. Better than pretending certainty.

Limits of hallucination detectors¶

Be honest. Groundedness checks are not perfect. Retrieval may be incomplete. Citations may point to stale documents. NLI models can be fooled by subtle wording. Claim extraction may miss implied facts.

So what to do? Combine methods. Use deterministic checks for tool outputs and structured facts. Use citation matching for text-heavy answers. Add sampling review for high-risk surfaces. Track bypasses in the control tower.

Look. Hallucination detection is not about eliminating all uncertainty. It is about forcing unsupported claims to work much harder before they escape arrival customs.

Where this lives in the wild¶

Perplexity-style answer engines — retrieval quality engineer: need citation verification so linked sources actually support the summarized claim.
Enterprise policy bots — knowledge systems lead: must check that every quoted refund or HR rule is grounded in the current document set.
Legal research assistants — litigation tech architect: need claim-level support checks because one fabricated precedent can poison the whole draft.
Clinical copilots — medical safety engineer: verify doses and contraindications against structured tools or trusted references before release.
Support RAG systems — QA analyst: compare generated promises against policy snippets and tool-returned eligibility fields.

Pause and recall¶

In production terms, what does hallucination usually mean?
Why is claim extraction helpful before verification?
Why is citation presence weaker than citation support?
What fallback actions should exist after an unsupported-claim detection?

Interview Q&A¶

Q: Why verify claim-to-source support instead of only checking that a citation exists? A: Because a real citation can still be attached to an exaggerated, implied, or contradictory claim, so source presence alone is not grounding. Common wrong answer to avoid: "Because citations are mostly a UI feature and do not matter operationally."

Q: Why prefer deterministic comparisons over model-based hallucination scoring when possible? A: Because canonical tool fields and exact policy spans give lower-variance checks than another probabilistic model judging the first one. Common wrong answer to avoid: "Because verifier models are always less accurate than generators."

Q: Why should unsupported-claim detection feed abstention logic rather than silent rewriting alone? A: Because repeated silent rewriting can still release a weak answer, while abstention preserves honesty when evidence remains insufficient. Common wrong answer to avoid: "Because users dislike clarifying questions in every case."

Q: Why is hallucination detection still needed in RAG systems? A: Because retrieval supplies evidence but does not guarantee the model will quote, combine, or interpret that evidence faithfully. Common wrong answer to avoid: "Because RAG eliminates unsupported claims automatically once citations appear."

Apply now (5 min)¶

Exercise. Write one short answer with three factual claims about a fake company policy. Then write one source snippet that supports only two of those claims. Mark which claim should fail arrival customs.

Sketch from memory. Draw the path: answer → claim extraction → source comparison → pass, revise, or abstain. Add one note on where the no-fly desk steps in.

Bridge. Fact checking improves trust. But even truthful systems can be abused through sheer volume, scraping, and cost attacks. Next we move to the control tower. → 09-rate-limiting-abuse.md