13. Faithfulness and RAGAS — measuring whether the answer obeys the evidence¶

~12 min read. Retrieval scored a perfect Recall@5. The chatbot still invented a fact. By the end of this page you will know exactly how to catch that, and which metric catches what.

Builds on the ELI5 in 00-eli5.md. The answer brief told the writer to stay on the desk. Faithfulness audits whether the writer obeyed.

Previous chapter: 12-retrieval-metrics.md — recall, MRR, NDCG for retrieval.

The opening trap — perfect retrieval, wrong answer¶

A medical assistant is asked, "Is paracetamol safe with my blood thinner?"

The retriever pulls three chunks. All three are correct. All three mention the drug interaction. Recall@5 is 1.0. MRR is 1.0. The dashboards are green.

The generated answer says, "Paracetamol is safe with warfarin at doses up to 4 grams per day. Studies show no interaction below this threshold."

Read it again. The first half is partially supported. The second half — "studies show no interaction below this threshold" — is invented. The retrieved chunks never said that. The model added a confident-sounding citation that does not exist.

This is the gap retrieval metrics cannot see. The librarian did her job. The writer cheated. We need a second report card. That is what this chapter builds.

The two report cards — librarian vs writer¶

Every RAG system has two workers. Two graders.

┌──────────────────────────────────────────────────────────┐
│              RAG quality has two layers                  │
├────────────────────────────┬─────────────────────────────┤
│  LIBRARIAN (retrieval)     │  WRITER (generation)        │
│                            │                             │
│  Did we fetch the right    │  Did the answer use what    │
│  chunks?                   │  was fetched?               │
│                            │                             │
│  Metrics:                  │  Metrics:                   │
│    Recall@k, MRR, NDCG,    │    Faithfulness,            │
│    Context Precision,      │    Answer Relevance,        │
│    Context Recall          │    Answer Correctness       │
└────────────────────────────┴─────────────────────────────┘

Grade only the librarian, you ship hallucinations. Grade only the writer, you cannot tell whether the bug is upstream. Real teams measure both. Always.

Mini-FAQ. "Why is context precision in the writer column too?" It is the bridge. RAGAS computes context precision using the question and the chunks — but it scores them in terms of how much of the chunk is useful for answering. So it sits in retrieval but is judged the writer's way: claim-grounded. You will see this when we expand each metric below.

The running example — one query traced through¶

Same shape as the medical answer above. Here is the full setup we will score, claim by claim.

Query

What is the refund window for enterprise customers?

Retrieved context (3 chunks reached the reading desk)

[C1] refund_policy.md, p.2
     Enterprise annual plans may request a refund
     within 30 days of renewal. Requests must come
     from the billing admin on the account.

[C2] billing_faq.md, p.5
     Refunds for orders past 30 days require manager
     approval and are issued as credit, not cash.

[C3] support_runbook.md, p.11
     Customer Support is responsible for verifying
     refund eligibility before any approval workflow.

Generated answer

Enterprise customers can request a refund within 30 days of renewal. The request must come from the billing admin. Support verifies eligibility. Refund approvals typically complete within 24 hours.

Read it. Sounds clean. Sounds confident. But one of those four sentences was never in the context. That is the sentence faithfulness exists to catch.

Faithfulness — every claim must trace to a chunk¶

Faithfulness is one number with one rule: take the answer, split it into atomic claims, check each claim against the retrieved context.

generated answer
      │
      ▼ (decompose)
┌──────────────────────────────────────────────────────┐
│  Claim 1: "Enterprise refunds allowed within 30 d."  │ ── C1 ── supported
│  Claim 2: "Request must come from billing admin."    │ ── C1 ── supported
│  Claim 3: "Support verifies eligibility."            │ ── C3 ── supported
│  Claim 4: "Approvals complete within 24 hours."      │ ── ??? ── UNSUPPORTED
└──────────────────────────────────────────────────────┘

      faithfulness = supported claims / total claims
                   = 3 / 4
                   = 0.75

Three out of four. That fourth claim is the hallucination. The retrieved context does not say anything about 24 hours. The model invented a number that sounds plausible — and that is what makes it dangerous.

Notice what claim decomposition forces. You cannot judge the answer as a blob. A long, fluent paragraph hides many small assertions. Each one needs its own audit. The blob is not the unit. The claim is the unit.

Typical faithfulness ranges, from the field:

System quality	Faithfulness score	Notes
Weak RAG (poor prompt, no abstention rule)	0.55 – 0.70	Many invented numbers, dates, names
Average production RAG	0.75 – 0.85	One unsupported claim per 4–6 answers
Well-tuned production RAG	0.88 – 0.94	Strict prompt, low temperature, citations
Closed-domain (legal, medical) target	> 0.95	Anything lower fails compliance review

Anything below 0.80 is leaking trust. Anything below 0.70 is shipping fiction.

Mini-FAQ. "How is faithfulness different from answer correctness?" Faithfulness asks, "Is every claim supported by the chunks we retrieved?" Answer correctness asks, "Is every claim true according to the ground-truth answer?" A faithful answer can still be factually wrong if the retrieved chunks were wrong. A correct answer can still be unfaithful if it brought in extra true facts the chunks did not mention. They overlap. They are not the same.

The five RAGAS metrics — what each one catches¶

RAGAS bundles five core metrics. Each one targets a different failure mode. Memorize the table — it is the most-asked interview shape in the entire module.

Metric	What it asks	Failure it catches	Needs ground truth?
Faithfulness	Are answer claims supported by context?	Hallucination	No
Answer Relevance	Does the answer address the question?	Off-topic answers	No
Context Precision	Are retrieved chunks ranked by usefulness?	Noisy top-k	Yes (or judge-based)
Context Recall	Did context cover the ground-truth facts?	Missing evidence	Yes
Answer Correctness	Does the answer match the ground truth?	Wrong facts overall	Yes

Three of them — faithfulness, answer relevance, context precision — can run on production traffic without ground truth. That is why they end up in live dashboards. The other two need a labelled eval set, so they live in CI and offline runs.

Answer Relevance — staying on topic¶

Take the generated answer. Reverse-engineer it: "What question would this answer perfectly fit?" Compare that reconstructed question to the original. High similarity = relevant. Low similarity = the writer wandered.

Our example answer reconstructs to roughly "What is the enterprise refund process?" — the original asked specifically about the window. So the answer covers it but also drifts. Relevance maybe 0.85.

Context Precision — top of the list earns its place¶

For each chunk in the retrieved context, ask: "Is this chunk useful for the answer?" Then weight by rank. A useful chunk at rank 1 scores more than the same useful chunk at rank 5.

In our example, C1 is the gold chunk and it is at rank 1. C2 is partially useful (talks about beyond-30-days, not asked). C3 is tangential. Precision lands around 0.7.

Context Recall — did anything get missed¶

For each fact in the ground-truth answer, ask: "Is it present in the retrieved chunks?" If the ground truth says "requests must come from billing admin and renewal must be within 30 days", both facts must be findable in some chunk. Missing one = recall drops.

Answer Correctness — the overall pass/fail¶

The blended metric. Compare the generated answer to the ground-truth answer using both semantic similarity and a factual-overlap judge. Captures both "says the right thing" and "says it the right way." Most teams use this as their north-star number.

Predict which RAGAS metric fails the loudest¶

In our worked example, which RAGAS metric fails the loudest? Write your answer. Then continue.

(Answer: faithfulness, because of the invented 24-hour claim. The other four are mostly fine.)

LLM-as-judge — using a model to grade a model¶

How does RAGAS actually compute faithfulness? It calls an LLM.

┌──────────────────────────────────────────────────────────┐
│                  LLM-as-judge for faithfulness           │
├──────────────────────────────────────────────────────────┤
│                                                          │
│   STEP 1: Decompose                                      │
│   Prompt the judge: "Break this answer into atomic       │
│   claims, one per line."                                 │
│                                                          │
│   STEP 2: Verify                                         │
│   For each claim, prompt the judge: "Given this          │
│   context, is this claim entailed? Reply yes/no with     │
│   a one-sentence reason."                                │
│                                                          │
│   STEP 3: Aggregate                                      │
│   faithfulness = count(yes) / count(claims)              │
│                                                          │
└──────────────────────────────────────────────────────────┘

This is the part that makes engineers uneasy. We are using an LLM to grade another LLM. The judge can also hallucinate. The judge can also be biased. So why does this work at all?

It works because the judge has an easier job than the answerer. The answerer must produce truth from a question and chunks. The judge must only verify a claim against a chunk — and that is a much simpler decision. Verification is asymmetrically easier than generation. The same way fact-checking is easier than reporting.

Mini-FAQ. "Why use an LLM to judge an LLM?" Because the alternative is human raters at every check, which costs $1–$5 per evaluation and takes hours. An LLM judge runs at $0.001 to $0.02 per evaluation, in seconds. On a daily eval suite of 1,000 examples across 5 metrics, that is $5–$100 per day instead of $5,000+. Speed and cost win — when the judge is calibrated.

Real numbers for LLM-as-judge cost (approximate, 2025 pricing):

Judge model	Cost per eval (one metric)	Daily cost at 1k examples × 5 metrics
GPT-4o-mini / Haiku class	$0.0005 – $0.002	$2.50 – $10
GPT-4o / Sonnet class	$0.005 – $0.015	$25 – $75
GPT-4 / Opus class	$0.02 – $0.05	$100 – $250
Human rater	$1 – $5	$5,000 – $25,000

Human-judge agreement. When a frontier-class LLM judge scores faithfulness, it agrees with expert human raters about 80–90% of the time on clean datasets. That gap is the residual error you must close with sampling and human review on the hard cases.

How a RAGAS-style metric fits in an eval pipeline¶

The metric is not a number you compute once. It is a control point in a larger system.

┌────────────────────────────────────────────────────────┐
│                   Eval pipeline shape                  │
└────────────────────────────────────────────────────────┘

  OFFLINE (golden set, runs in CI)
  ────────────────────────────────
    eval_set (200–2000 examples with ground truth)
       │
       ▼
    run RAG pipeline → answers + retrieved context
       │
       ▼
    RAGAS judge → faithfulness, relevance, precision,
                  recall, correctness
       │
       ▼
    compare to last commit → pass / fail / regression

  ONLINE (production traffic, runs hourly)
  ─────────────────────────────────────────
    sampled live answers (1–5% of traffic)
       │
       ▼
    judge → faithfulness, relevance (no ground truth needed)
       │
       ▼
    dashboard + alert if score drops below threshold
       │
       ▼
    flagged examples → human review queue (weekly)

The split matters. Offline catches regressions before they ship. Online catches drift after they ship. Both layers run all the time. The human review queue catches what the judge misses.

Failure modes — how RAGAS lies to you¶

Trust the framework. Verify the framework. These are the ways it goes wrong.

Judge model bias. A judge from one model family tends to favour answers from the same family. GPT-4 judges rate GPT-4 outputs higher than they should. Solve by using a cross-family judge or rotating judges.
Judge model too small. A Haiku-class judge scores faithfully on simple claims, fails on multi-step reasoning claims. Match judge capability to claim complexity. For legal text, do not use a tiny judge.
Leaked context. The judge prompt accidentally includes the ground-truth answer when it should only see the generated answer and the retrieved context. Now the score is artificially high. Audit your judge prompts.
Unverifiable claims dismissed wrongly. A claim like "this policy is fair" is opinion, not fact. A strict judge marks it unsupported. Faithfulness drops unfairly. Filter opinions before claim decomposition or instruct the judge to skip them.
Partial-credit scoring lost. "Refunds within 30 days" is supported; "Refunds within exactly 30 calendar days from purchase" is partially supported. Binary judges lose this nuance. Use graded scoring (0, 0.5, 1) when precision matters.
Claim decomposition errors. The judge splits one logical claim into three, or fuses three into one. Faithfulness goes up or down by accident. Sample 20 decompositions per release and eyeball them.
Prompt sensitivity. Change the wording of the judge prompt by one sentence and scores shift 5–10 points. Lock your judge prompts under version control like code.
Score inflation drift. As prompts get tuned to maximise the judge score, real quality may drop. The model learns to please the judge, not the user. Counter with periodic human-eval blind tests.

Mini-FAQ. "When does an LLM judge disagree with a human?" Most often on: claims that need outside knowledge ("does this drug interaction match clinical guidelines?"), claims with quantitative precision ("within 30 days" vs "30 calendar days"), claims with implicit context ("our customers" — which customers?), and claims with negation ("the policy does not apply to X"). Negation handling is the single biggest source of judge error.

Recall — faithfulness scoring cold¶

What is the one-line rule for faithfulness?
In our worked example, which claim is unsupported and why?
What is the difference between faithfulness and answer correctness?
Which two RAGAS metrics can run on production traffic without ground truth?
Why is using an LLM to judge an LLM defensible?
Roughly what fraction of the time does a frontier judge agree with humans?
Name three failure modes of LLM-as-judge.
Where does the offline RAGAS eval sit in the dev loop, and where does the online one sit?

Faithfulness scoring across eval frameworks¶

The frameworks below all implement faithfulness-style scoring or wrap RAGAS directly. The shape is constant; the brand differs.

RAGAS — the reference open-source framework. Faithfulness, answer relevance, context precision, context recall, answer correctness as Python metrics.
TruLens (TruEra) — open-source LLM app evals with groundedness, context relevance, and answer relevance as the "RAG triad."
DeepEval — pytest-style LLM eval framework with faithfulness, hallucination, contextual recall metrics.
Promptfoo — config-driven eval runner that supports RAGAS assertions and custom judges in CI.
Phoenix (Arize AI) — observability platform with built-in LLM-as-judge evaluators for faithfulness and relevance.
LangSmith (LangChain) — tracing + dataset-driven eval; ships RAG-specific evaluators out of the box.
LangFuse — open-source tracing and eval platform; faithfulness scores can be attached to traces.
Helicone — LLM observability with custom eval pipelines including hallucination scoring.
Patronus AI Lynx — Patronus's hallucination-detection model, purpose-built as a faithfulness judge.
Galileo (Genie) — enterprise LLM eval with their "Context Adherence" metric (RAGAS faithfulness under another name).
OpenAI Evals — open-source framework for grading model outputs; supports custom RAG eval graders.
Anthropic Eval Cookbook — recipes for claim-decomposition evals using Claude as judge.
Vectara HHEM — Vectara's Hughes Hallucination Evaluation Model, an open-weights faithfulness scorer.
Glean — enterprise search; internal grounding evals reject answers below a faithfulness threshold before display.
Perplexity — internal eval stack measures citation-claim alignment for every shipped answer.
Confident AI — DeepEval's commercial dashboard for tracking faithfulness over time.
Comet Opik — open-source LLM tracing and eval with built-in hallucination metric.
Weave (Weights & Biases) — eval logging with faithfulness scorers integrated.
MLflow LLM Evaluate — MLflow's eval module with faithfulness and toxicity metrics.
Inspect AI (UK AISI) — safety-focused eval framework that includes groundedness probes.
Athina AI — RAG eval platform marketed specifically around faithfulness and context quality.
Ragas Cloud — managed RAGAS service for teams that don't want to run the pipeline themselves.

Every team building serious RAG ends up with one of these in their stack. The metric is the same. The wrapper differs.

Interview Q&A¶

Q1. How do you measure hallucination in a RAG system? A. Compute faithfulness: decompose the generated answer into atomic claims, verify each claim against the retrieved context using an LLM judge, return supported / total. Below 0.85 is usually a problem; below 0.70 ships fiction. Common wrong answer to avoid: "You compare the answer to the ground-truth answer." That is answer correctness, not faithfulness.

Q2. What is the difference between faithfulness and answer correctness? A. Faithfulness compares the answer to the retrieved chunks. Answer correctness compares the answer to a ground-truth reference. A faithful answer can be wrong if retrieval was wrong; a correct answer can be unfaithful if it added true facts the chunks did not contain. Common wrong answer to avoid: "They are the same — both check if the answer is true."

Q3. Why is LLM-as-judge defensible when we know LLMs hallucinate? A. Because verification is asymmetrically easier than generation. The judge only checks claim-against-chunk entailment, not multi-hop synthesis. Frontier judges agree with expert humans 80–90% of the time at a fraction of the cost. Common wrong answer to avoid: "It's not defensible, you should always use humans."

Q4. What are the five RAGAS metrics and what does each catch? A. Faithfulness (hallucination), answer relevance (off-topic), context precision (noisy top-k), context recall (missing evidence), answer correctness (wrong overall). The first three need no ground truth; the last two do. Common wrong answer to avoid: "RAGAS just has one metric called faithfulness."

Q5. Where does RAGAS fit in your eval pipeline? A. Offline against a golden set in CI to catch regressions before shipping; online on sampled production traffic to catch drift after shipping. Both feed a human-review queue for the hardest cases. Common wrong answer to avoid: "You run RAGAS once at launch and forget it."

Q6. Faithfulness is 0.95 but users still complain. What do you check next? A. Answer relevance and context recall. The model may be grounded in the wrong chunks (high faithfulness, low recall) or answering a different question than asked (high faithfulness, low relevance). Faithfulness is necessary, not sufficient. Common wrong answer to avoid: "Switch to a stronger generator model."

Q7. What are three failure modes of LLM-as-judge that you would watch for? A. Judge-family bias (judge rates same-family models too high), prompt sensitivity (small wording changes shift scores 5–10 points), and weak handling of negation and quantitative precision. Counter with cross-family judges, locked prompt versions, and periodic human-blind audits. Common wrong answer to avoid: "LLM judges don't really fail — they just need a bigger model."

Q8. A claim is opinion, not fact — for example, "this is the best policy." How should the judge handle it? A. Either filter opinion claims out before decomposition, or instruct the judge to skip non-verifiable claims (return "N/A") so they neither raise nor lower the score. Counting opinions as unsupported wrongly punishes faithful answers. Common wrong answer to avoid: "Mark it unsupported — anything not in the chunks fails."

Apply now (10 min)¶

Step 1 — I model it first. Take the running example. Here is my full score sheet.

Metric	Score	Why
Faithfulness	0.75	3 of 4 claims supported; "24-hour approval" invented
Answer Relevance	0.85	Covers the window, drifts into process details
Context Precision	0.70	C1 useful (rank 1); C2 partial; C3 tangential
Context Recall	1.00	Both ground-truth facts (window, admin) are in chunks
Answer Correctness	0.80	Right window and process, but the 24-hour claim is false

The pattern: faithfulness alone surfaces the hallucination. The other four explain why the system is otherwise healthy.

Step 2 — your turn. Pick one RAG answer your system shipped this week. Decompose it into atomic claims. Mark each as supported, partially supported, or unsupported by the retrieved chunks. Compute faithfulness. If you have a ground-truth answer too, compute answer correctness. Compare. Which number is lower, and what does the gap tell you?

Step 3 — sketch from memory. Redraw the two-report-card diagram. List the five RAGAS metrics on the right side. For each one, write one failure mode it catches. Then add one line: which two run live on production traffic without ground truth?

What you should remember¶

This chapter explained why retrieval metrics are necessary but not sufficient: the writer can have every right chunk on the reading desk and still produce a sentence the chunks do not support. Faithfulness measures exactly that gap — atomic-claim entailment against the retrieved context — and it is the metric that catches the fluent hallucination retrieval metrics cannot see. The five RAGAS metrics together cover the full surface: faithfulness, answer relevance, context precision, context recall, answer correctness. Three need no ground truth and can run on live traffic; two require a labelled answer.

You also learned why LLM-as-judge is defensible at all. Verification is asymmetrically easier than generation, frontier judges agree with expert humans 80–90% of the time, and the cost gap (≈$10/day vs $5,000/day) decides the argument. But the judge can drift, fail on negation, and rate same-family models too kindly. Lock judge prompts in version control, rotate across model families, and keep a small human-review queue for the hard cases.

Carry this diagnostic forward: when users complain but faithfulness is 0.95, look at answer relevance and context recall before touching the generator. The model may be grounded in the wrong chunks, or answering the wrong question. Faithfulness is necessary, not sufficient.

Remember:

Faithfulness ≠ correctness. A faithful answer can be wrong if retrieval was wrong.
Three of the five RAGAS metrics run online without ground truth. Use them.
The judge has an easier job than the answerer — verification is asymmetrically easier than generation.
Negation, quantitative precision, and implicit context are the judge's known weak spots. Sample by hand on these.
Pin judge prompts under version control. A one-sentence change moves scores 5–10 points.

Bridge. RAGAS measures what we can measure. But some RAG failures live in the gaps no metric covers — multi-hop reasoning, contradictions across chunks, the question that should never have been asked at all. The final chapter is honest about where RAG stops and what to tell users when it does.

→ 14-honest-admission.md