Skip to content

06. Explainability for LLMs — when the judge tells a nice story after the verdict

~15 min read. Language models can produce smooth explanations, but smooth is not the same as faithful.

Built on the ELI5 in 00-eli5.md. The verdict — the model's answer — may come with an explanation, but we still need to know whether the judge is describing real causes or only a plausible story.


Picture first: explanation can be evidence or theater

Imagine a judge finishing a verdict. Then a reporter asks, "Why did you decide that?" The judge can do two different things. One, reveal the actual reasoning path. Two, tell a polished story that sounds good afterward. LLM explanations live inside this tension.

Chain-of-thought feels attractive because it looks like transparent reasoning. The model writes intermediate steps. It sounds reflective. Sometimes it helps performance too. Good. But the existence of a chain does not prove the chain caused the answer.

question
┌──────────────────────┐
│ hidden model state   │  many internal activations
└──────────┬───────────┘
           ├──→ final answer
           └──→ spoken explanation

See the structural issue. The same judge generates both the answer and the explanation. If the model is biased, confused, or strategically polite, the explanation may inherit that behavior. So we care about faithfulness. Not only plausibility.

Chain-of-thought is useful, but not guaranteed faithful

Now what is faithfulness? An explanation is faithful if it tracks the real internal or external causes of the answer. A plausible explanation merely sounds reasonable to a human. Those are different standards.

Why can chain-of-thought fail faithfulness? Because the model may arrive at an answer through hidden pattern matching, then generate a clean rationale afterward. It may omit a shortcut. It may sanitize a biased association. It may mention one retrieved passage while actually depending on another.

This is why many teams do not expose full chain-of-thought to users. They prefer concise rationales, evidence citations, or tool traces. The case record should state which form of explanation is offered and what it does not guarantee. Simple, no? If you promise more than your explanation can support, you create a second trust problem.

Worked example: plausible citation, wrong real cause

Use a concrete setup. An LLM support assistant answers, "Deny the refund." It explains, "Because the item is outside the return window." That sounds clean.

But the prompt actually contained two relevant facts. Fact A: purchase date suggests it may still be inside the window. Fact B: a hidden fraud-risk flag from a tool is high. To test faithfulness, we ablate evidence.

Case 1: remove the cited return-window sentence. Answer probability for denial changes from 0.82 to 0.79. Small drop. Case 2: keep that sentence, but remove the fraud-risk signal. Answer probability for denial drops from 0.82 to 0.31. Big drop.

Look. The spoken explanation pointed to Fact A. But the real decision leaned much more on Fact B. The explanation was plausible. It was not faithful. That is the core problem.

original denial probability = 0.82
      ├── remove cited reason ─────→ 0.79
      └── remove hidden tool signal → 0.31

Simple, no? The judge told a courtroom-friendly story. The actual causal driver was elsewhere. This happens in LLMs more often than users expect.

Better explanation patterns for LLM systems

So what to do? Prefer explanations grounded in observable artifacts. Citations to retrieved passages. Tool traces showing which API was called. Structured reason codes. Confidence intervals or uncertainty bands when available. Human-review escalation for high-stakes cases.

If you use chain-of-thought internally, evaluate it. Check consistency across paraphrases. Check whether removing the cited evidence changes the answer materially. Check whether different rationales produce the same answer anyway. Check whether the explanation omits sensitive or proxy features that the model seems to exploit. That is part of the appeal process for LLM systems.

Also distinguish user-facing and auditor-facing explanations. Users need concise, safe, comprehensible reasons. Auditors need deeper evidence about prompts, retrieval, tools, and decision pathways. Do not confuse the two layers. One is a product surface. The other is a governance surface.

What honesty looks like in LLM explainability

An honest team says this clearly. Our explanation describes evidence we can show. It improves reviewability. It does not expose the model's full inner algorithm. It may not be fully faithful to every hidden computation.

That is not weakness. That is precise communication. The case record should document explanation type, evaluation method, and failure modes. For some products, faithful explanation may matter more than fluent explanation. For other products, safe concise rationale may matter more. But pretending the model is transparently self-aware is a mistake.

Yes? The courtroom analogy helps again. Sometimes the judge can cite documents used in the verdict. That is good. It is still not the same as letting you read every internal thought that produced the verdict.


Where this lives in the wild

  • GitHub Copilot with repository context — developer tools researcher: explanation quality depends more on visible citations and code references than on polished free-form rationales.
  • Perplexity answer pages — search product evaluator: citations improve inspectability, but users still need checks that the cited passage actually supports the claim.
  • Intercom Fin support workflows — support automation lead: concise explanation plus ticket evidence is safer than exposing full private chain-of-thought.
  • Harvey legal drafting assistant — legal ops reviewer: must separate persuasive legal-style prose from faithful grounding in the actual cited source set.
  • Notion AI workflow actions — product safety engineer: tool traces and explicit action logs help more than generic natural-language rationales after side effects occur.

Pause and recall

  • What is the difference between a plausible explanation and a faithful explanation?
  • In the worked example, which ablation showed the true causal driver more clearly?
  • Why might teams avoid exposing raw chain-of-thought directly to users?
  • What kinds of observable artifacts make LLM explanations more trustworthy?

Interview Q&A

Q: Why prefer evidence citations or tool traces and not only chain-of-thought for product explanations? A: Because citations and traces point to observable artifacts that can be audited, while chain-of-thought can be fluent yet causally unfaithful. Common wrong answer to avoid: "Because chain-of-thought never helps model performance."

Q: Why can post-hoc rationales look convincing even when they are wrong? A: Because the same language model that generated the answer is good at producing coherent stories after the fact, even if those stories omit real drivers. Common wrong answer to avoid: "Because users are too unsophisticated to read explanations."

Q: Why distinguish user-facing explanations from auditor-facing explanations? A: Because users need safe, concise, understandable reasons, while auditors need deeper evidence about retrieval, tool calls, and hidden dependencies. Common wrong answer to avoid: "Because auditor-facing explanations should always reveal proprietary chain-of-thought tokens verbatim."

Q: Why evaluate explanation faithfulness by perturbing or ablating evidence? A: Because changing the claimed cause is one of the clearest ways to test whether the answer actually depends on that cause. Common wrong answer to avoid: "Because ablation proves the explanation is morally fair."


Apply now (5 min)

Exercise. Take one LLM answer you trust. Write the explanation it gave. Now imagine removing the cited evidence and removing an uncited but plausible hidden clue. Which removal would likely change the verdict more?

Sketch from memory. Draw a split between final answer and spoken explanation. Under it, write one sentence starting with, "Plausible is not the same as..." Then list two observable explanation artifacts you would trust more than raw chain-of-thought.


Bridge. Per-answer explanations help, but they are not enough. We also need a durable case record that documents intended use, limitations, and fairness results before others trust the judge. → 07-model-cards-documentation.md