Skip to content

05. Hidden Chain-of-Thought — The model thinks more than it shows, and you pay for both

~11 min read. OpenAI hides the chain. Anthropic shows it. Google summarizes. The choice shapes your billing, your debugging, and your faithfulness story.

Built on the ELI5 in 00-eli5.md. the thinking pause — can be hidden, visible, or summarised depending on the provider — but is always billed because the compute happens either way.


Visible answer vs internal work

A concise answer can come out of a long internal scratchpad. The model has been trained to spend many tokens before the user-facing reply. Whether you see those tokens depends on the provider.

user prompt
[reasoning tokens — hidden or visible]
final answer

OpenAI's o-series and GPT-5 thinking tier hide reasoning tokens. The Responses API returns usage.reasoning_tokens (a number you pay for) but the actual content does not appear in the response. Anthropic's extended thinking returns a thinking block in the message content — you read it, log it, route on it. Google's Gemini thinking returns a summary of the thinking rather than raw chain. xAI's Grok 4 returns reasoning_details with the visible chain.

That single design choice — hide vs show — drives many downstream engineering decisions.


Why OpenAI hides raw reasoning

Three reasons, all defensible.

Safety. Raw chains may contain unstable rationalizations, false intermediate guesses, or instruction-followed steps that look bad out of context (e.g. "the user is trying to bypass policy, but let me see..."). A user reading the raw chain can misinterpret it. OpenAI's safety team prefers a clean output surface.

Prompt-extraction defense. If your system prompt is reflected in the model's reasoning ("the system said to refuse if X"), hiding the chain stops that leakage. Visible CoT has been used to extract system prompts in red-team exercises.

Training perverse incentives. If the model is trained to expose CoT for human approval, it may learn to perform its reasoning for the audience rather than solve the task. RLHF on visible scratchpads risks training a "show-good-reasoning" head distinct from the "solve-task" head. Hiding the chain keeps the training signal honest.

OpenAI's position is: the model thinks, you pay for the compute, the reasoning is for the model not for you, debug via outputs and evidence.


Why Anthropic shows it

Three counter-arguments, also defensible.

Debuggability. Production agent loops need to know why the model picked tool X over tool Y. Visible thinking is your only window into that decision before it became an action. Anthropic explicitly markets extended thinking as a debugging surface for agents.

Interleaved tool use. Claude can run tools during the thinking block and use results in the same chain. To make this work, the chain must be addressable — you write a message with a tool result block that the model's thinking then attends to. Hidden chains can't do this cleanly.

Safety via transparency. If the model is plotting something problematic, you want to see it. Anthropic's CoT-monitorability paper (arXiv 2507.11473) argues visible reasoning is a "fragile opportunity for AI safety" — it lets you catch deceptive patterns before they reach the output.

So the same architectural feature (reasoning tokens) has two reasonable engineering interfaces. There is no universal "right answer."


Worked example: visible vs hidden token economics

Same task. Two providers. Same effort level.

Provider Reasoning tokens Visible thinking Output tokens Billed Cost
OpenAI o3, effort=medium ~6,000 hidden 0 chars to user ~800 6,800 @ $8/M output $0.054
Claude Sonnet 4.6, effort=medium ~6,000 visible shown in thinking block ~800 6,800 @ $15/M output $0.102
Gemini 3 Pro, thinkingLevel=HIGH summary only ~200-char summary returned ~800 summary chars + reasoning @ $12/M ~$0.090

Two implications.

First, the hidden tokens still cost real money. OpenAI charging $0.054 for content you never see can feel unfair until you realize the compute happened anyway. Anthropic charges $0.102 but lets you replay, log, and route on the chain — debugging gain vs raw price.

Second, your downstream pipeline shape changes. If you want to log reasoning for compliance (legal, finance), Anthropic is the natural fit. If you want a clean output surface for end users (consumer chatbots), OpenAI's hidden chain saves you from a sanitization step.


What users should still get

Even when raw chain is hidden, users still need calibration to trust the answer. Show:

  • The final answer, clearly delimited.
  • Citations for any retrieved facts. "According to §4.2 of the policy, ..."
  • A short rationale keyed to evidence — not a stream-of-consciousness chain.
  • Confidence or uncertainty markers for high-stakes outputs (legal, medical, financial).
  • Validation results if tools were run: "Schema check passed. Unit tests passed 12/12."

The pattern: internal reasoning is for the model, the user-facing rationale is for the human. Anthropic's extended thinking has a built-in summary mode for exactly this — show a short summary, not the raw thinking block, while keeping the full block in your logs.

See. A short audit trail keyed to evidence is more useful than a giant scratchpad dump. Faithfulness research (see chapter 13) tells us the raw chain may not be the real causal reasoning anyway. So show summary + evidence, log the full chain for debugging.


The engineering consequence

Do not depend on visible reasoning alone for trust. Whether the provider shows or hides the chain, add external checks:

  • Tool calls for any external state lookup.
  • Schema validation on structured outputs.
  • Unit-test / compile / lint on any generated code.
  • Citation verification — fetch the cited document, confirm the quoted text exists.
  • Verifier pass with a separate model graded against the same task.
  • Outcome logging — log inputs, outputs, eval scores, costs. Reasoning chains are interesting but not as load-bearing as outcomes.

Hidden reasoning + strong external verification is often safer than visible reasoning + naive trust. Judge systems by outputs, evidence, and reliability — not by how much private thinking they expose.


Where this lives in the wild

  • OpenAI Responses API — encrypted reasoning items — for stateless flows (no session memory), you can pass encrypted_reasoning_items from prior turns so the model resumes its reasoning without you re-paying. Hidden CoT preserved across requests.
  • Claude Code (Anthropic's coding CLI) — extended thinking visible in the tool UI; users can read why the model chose to edit file A before B, valuable for debugging agent loops.
  • Perplexity Pro Deep Research — runs internal reasoning hidden from the user, presents a curated step list ("searched X, read Y, synthesised Z") as the visible rationale. Internal reasoning supports the search policy; visible rationale supports user trust.
  • Vertex AI Gemini for enterprise audit — admins can opt into raw-thinking logs for compliance review while end users see only the summarized thinking block.
  • xAI Grok 4 reasoning_details — fully visible chain by default; some enterprises wrap a sanitizer middleware to strip the chain before user display, while keeping it in logs.

Pause and recall

  1. Name three reasons OpenAI hides raw reasoning tokens. Name three reasons Anthropic shows them.
  2. In the cost table, why does Anthropic appear ~2× more expensive than OpenAI for the same reasoning depth?
  3. What is OpenAI's encrypted_reasoning_items mechanism, and what does it save you from?
  4. Even with hidden CoT, what five surfaces should users see for trust?

Interview Q&A

Q: A regulator asks "show me how the model reached this denied-claim decision." Your provider hides CoT. What do you do? A: Hidden CoT is your inference-time cost; it does not let you escape audit obligations. You need a paper trail independent of the model's chain: (1) log the input, retrieved context, tool calls, final output, and usage.reasoning_tokens count as evidence the model reasoned at all; (2) run a separate verifier model — could be cheaper — that produces an auditable rationale grounded in the same retrieved evidence; (3) for regulated domains, prefer a provider that exposes the chain (Anthropic) and route those decisions there. Architecture choice is a compliance choice.

Common wrong answer to avoid: "Switch to a model that shows reasoning so we can hand the chain to the regulator" — the chain may not be faithful (Anthropic 2025 research). What auditors actually need is evidence-grounded reasoning artifacts, not raw scratchpads.

Q: You're logging Claude extended thinking blocks to S3. What goes wrong if you log naively? A: Three problems. First, PII and sensitive details — the thinking block often echoes user inputs verbatim including PII; you need redaction. Second, cost explosion — thinking blocks can be 50,000+ tokens; raw logging is expensive at $0.023/GB plus replay storage. Use sampling or summarization. Third, legal discoverability — internal reasoning is a fresh artifact your legal team may have to produce in litigation. Set a retention policy and document why you log. Most teams log a hash + 200-char summary plus full chain for 1-5% sampled traces.

Common wrong answer to avoid: "Just log everything for debugging" — naive logging of full thinking creates a PII and legal risk surface most teams don't appreciate until counsel flags it.

Q: Why doesn't OpenAI just train the model to never put sensitive content in hidden CoT, then expose it? A: Two reasons. First, RL training on visible-CoT objectives biases the chain toward what humans approve of rather than what solves the task — you get performative reasoning, lower task performance. Anthropic's interpretability research has shown this trade-off explicitly. Second, hidden CoT acts as a safety margin — when the model considers a problematic instruction, it can reason about refusal internally without surfacing the dangerous framing to the user. Exposing the chain removes that margin. The hidden-vs-visible choice is a deliberate alignment-vs-transparency trade-off, not laziness.

Common wrong answer to avoid: "OpenAI hides it for competitive reasons" — there are competitive reasons too (prompt-extraction defense), but the safety and training-incentive arguments are the load-bearing ones in their own documentation.

Q: You need to pass reasoning context across three sequential API calls without re-paying. What's the OpenAI pattern? A: Use the Responses API's encrypted_reasoning_items. Call 1 returns a reasoning artifact bundle; you pass it as input to call 2; the model resumes its prior chain without you re-billing for the hidden tokens. Without this, you'd lose the chain between calls and either pay twice or lose continuity. Anthropic's equivalent is to keep the assistant message (with thinking block) in the conversation history, but watch the context window — thinking blocks are not automatically compressed.

Common wrong answer to avoid: "Just send the full conversation history each time" — for OpenAI's hidden CoT that won't work (you don't have the tokens); for Anthropic it works but eats context aggressively.


Apply now (5 min)

Pick one production reasoning call from your stack. Inspect: how many reasoning tokens did it spend? Are they visible or hidden? Where are they logged? If you're using Anthropic, sample 5 thinking blocks and label whether they contain PII, sensitive customer details, or content you would not want in S3 cold storage. If you're using OpenAI, calculate what encrypted_reasoning_items would save you on a 3-turn task.

Sketch from memory: Draw the prompt → hidden/visible chain → answer pipeline for OpenAI, Anthropic, and Google side by side, annotated with what gets billed and what gets returned.


Bridge. Hidden, visible, summarized — the three closed-API answers. But there is a fourth: open-weight models where you control everything. That is the DeepSeek-R1 story and the open ecosystem. → 06-deepseek-r1-open-ecosystem.md