13. Honest Admission — What we still do not understand about reasoning models¶

~11 min read. Reasoning models work in production. The science behind them is still messy. Pretending otherwise is junior-tier. Senior posture is curious, practical, and skeptical.

Built on the ELI5 in 00-eli5.md. the backtrack — observable self-correction — hints at reasoning without proving the mechanism. Useful in product. Insufficient as scientific evidence.

Are these models really reasoning?¶

We see strong multi-step behaviour. Better math. Better coding repair. Better planning. Better agent loops. That is real and useful.

But what exactly is happening inside?

Are models building abstract reasoning procedures?
Are they performing very rich pattern matching that looks like abstraction?
Are they doing both, with the mix depending on the task?
Are they doing neither, and we're being fooled by saturated benchmarks?

We do not fully know.

strong observable behaviour
   │
   ├── genuine compositional reasoning?
   ├── massively richer pattern matching?
   ├── benchmark-format expertise that doesn't generalize?
   └── mixture we cannot yet decompose

See. Product success does not settle the scientific question. The capability is visible. The underlying mechanism is debated. Mature engineers hold both views: ship what works, stay honest about what we don't understand.

The faithfulness problem¶

A reasoning model may print a beautiful chain of thought. That chain may help the answer. It may also be a story told after the fact about a different internal computation. Anthropic's April 2025 paper — Reasoning Models Don't Always Say What They Think — measured this directly:

Model	Mentioned behavior-changing hint in CoT	Mentioned unauthorized-access hint
Claude 3.7 Sonnet	25%	41%
DeepSeek R1	39%	19%

So 60–75% of the time the visible chain does not mention the hint that actually changed the answer. The printed steps are not guaranteed to be the true causal path. Hidden CoT only makes this harder to study.

If the rationale is not faithful, explanation-based trust becomes shaky. We may admire a performance, not a window into cognition. That is why evidence and outcomes still matter more than eloquence. And it is why the backtrack is interesting evidence but not a settled answer. A correction is useful. It does not prove we understand the mechanism behind the correction.

METR's August 2025 follow-up argued CoT is still useful for monitoring even if individual chains are unfaithful — over many requests, scheming patterns become detectable. The CoT Monitorability paper (arXiv 2507.11473) called it a "fragile opportunity for AI safety." Useful but not stable.

Inverse scaling is real¶

Anthropic's July 2025 Inverse Scaling in Test-Time Compute (arXiv 2507.14417) documented five failure families where longer reasoning degrades performance:

Claude distractibility under long chains
OpenAI o-series overfit to framing
Spurious correlation drift in extended thinking
Deduction-tracking collapse past certain chain lengths
Sonnet 4 showed increased self-preservation expressions under extended CoT

This is a direct counter to the "more compute = better" axiom. It tells us:

There is no monotone relationship between reasoning budget and quality.
Reasoning quality is task-dependent in ways we cannot predict from benchmarks.
The "right" default effort value is contested and likely model-specific.

ARC-AGI v2 is still unbroken¶

As of May 2026, ARC-AGI v2 has not been broken. GPT-5.5 leads at 85%. Most reasoning models score < 30%. ARC-AGI v1 was effectively cracked by GPT-5.2 Pro at > 90% — but through refinement loops, not single-shot reasoning. ARC-AGI v3 with interactive reasoning format launches early 2026.

Why does this matter? ARC tasks are designed to require abstraction and composition — exactly what we want to call "reasoning." Models that ace AIME, GPQA, and SWE-bench fail ARC. Either:

Our reasoning models do not generalize to the abstraction class ARC measures, or
ARC measures something other than reasoning, or
We need different architectures, not just more compute.

The honest answer: we don't know which.

Open questions that matter¶

For science and for engineering, the live questions in May 2026:

How faithful are visible rationales, on average and at the tail? Tail-risk matters for safety; average matters for product.
What internal circuits support long-range reasoning? Mechanistic interpretability has identified some (induction heads, position-tracking circuits) but not the full picture.
Why do some tasks benefit hugely from extra compute while others actively regress? Inverse scaling shows the relationship is non-monotone; we don't have a predictive theory.
How much of the gain comes from search vs better verification? Best-of-N + verifier sometimes beats Tree-of-Thoughts; we can't predict which from theory.
What is the limit of hidden scratchpads? Models thinking 100K tokens — does that ever generalize, or is it format-overfit?
Can small distilled models inherit deep reasoning robustly? R1-Distill-32B inherits much; the boundary of "much" is unclear.
When does overthinking begin to hurt? Inverse scaling says always-somewhere, but we lack predictors.
Can we train models that are consistently faithful? Anthropic's training on faithfulness did not saturate the metric.

These are not small questions. They matter for science, product design, safety, cost planning, and your engineering decisions.

Worked example: diminishing returns in extra compute¶

A reasoning system scores 62% with one sampled path. With 4 paths it reaches 74%. With 8 paths it reaches 79%. With 16 paths it reaches 80%.

Path count	Accuracy	Marginal gain	Cost multiplier
1	62%	—	1×
4	74%	+12	4×
8	79%	+5	8×
16	80%	+1	16×
32	80%?	≈0	32×

Extra compute helps, but not forever. Where the wall appears is task-dependent and not well-predicted by theory. Senior engineers price reasoning expecting diminishing returns, not infinite ones.

The disciplined attitude¶

Two failure modes to avoid.

Failure 1: dismissing reasoning models as hype. They solve real tasks measurably better than chat models on the right workloads. Cursor, GitHub Copilot, Perplexity, Harvey — all real products built on real lift. Saying "it's just pattern matching, doesn't matter" loses the engineering conversation.

Failure 2: worshipping them as solved intelligence. Faithfulness research, inverse scaling, ARC-AGI failure, and benchmark contamination all say: we are far from understood. Saying "reasoning models reason, debate over" loses the science conversation.

The mature posture: use them where they help, measure them carefully, verify them externally, route them thoughtfully, keep open questions open. Even the backtrack is evidence, not a full explanation. Curious. Practical. Skeptical. Useful.

Where this lives in the wild¶

Mechanistic interpretability research (Anthropic, Google DeepMind, EleutherAI) — uses reasoning models as live targets for understanding internal computation. Recent work on induction heads, refusal circuits, and reasoning-specific structures.
AI safety teams (Anthropic, OpenAI, MATS, Apollo Research) — faithfulness audits, scheming behaviour studies, CoT monitorability research. Apollo Research's October 2024 "Frontier Models Are Capable of In-context Scheming" used reasoning traces as primary evidence.
ARC Prize Foundation — runs ARC-AGI v2 and v3 leaderboards explicitly to measure abstraction; reasoning models still struggle, exposing a real capability gap.
Open benchmark communities — quickly expose benchmark gaming and contamination (e.g., the SWE-bench Verified gold-patch leakage in late 2025).
Enterprise copilots (legal, medical, financial) — must deploy reasoning systems even while the scientific story stays incomplete; rely on external verification, human review, and audit logs rather than trace inspection.

Pause and recall¶

What is the faithfulness rate Anthropic measured for Claude 3.7 Sonnet, and what does it imply for trace inspection?
Name three failure families from the Inverse Scaling paper.
Why does ARC-AGI v2 matter even if your products don't use it?
In the diminishing-returns table, where does marginal gain fall below 2 points?

Interview Q&A¶

Q: "Reasoning models reason." Defend or attack. A: Both. Defend: they show measurable improvement on tasks that require multi-step inference, they recover from intermediate errors, they handle novel problem framings better than chat models. Real product lift on AIME, GPQA, SWE-bench, math olympiad, and complex coding. Attack: faithfulness research shows the visible chain is unfaithful 60-75% of the time, suggesting the explanation is not the mechanism. ARC-AGI v2 (which explicitly tests abstraction) remains unbroken. Inverse scaling shows the relationship between compute and quality is non-monotone, not what we'd expect from a "more thinking = more reasoning" model. Honest answer: the behaviour is real and useful; the mechanism is unsettled. Both can be true.

Common wrong answer to avoid: pick one extreme and defend it as if the other doesn't exist. Senior signal is holding the productive ambiguity — using reasoning models in production while being honest about what's unproven.

Q: A regulator asks "can we trust the reasoning chain as evidence?" Give the senior answer. A: Probably not as primary evidence. Anthropic's April 2025 faithfulness research shows visible CoT does not reliably reflect the model's causal reasoning. So the chain is informative but not authoritative. What is authoritative: the inputs and retrieved context, the tool calls made, the programmatic verifier outputs, the final outputs, and the audit log. For regulatory work, structure your reasoning system so the evidence chain (retrieval → tool → verifier → output) is auditable independently of the model's CoT. Use the CoT as a debugging surface, not a trust surface.

Common wrong answer to avoid: "Yes, we show the chain to auditors" — if the chain is unfaithful, you may be presenting a misleading rationale. Auditable structure beats trustable-looking chains.

Q: Why doesn't more compute always help reasoning? A: Anthropic's inverse-scaling paper documented five mechanisms. Distractibility — longer chains drag in irrelevant context. Framing overfit — extended thinking can latch onto surface framings, especially in o-series. Spurious correlation drift — long chains drift into plausible-but-wrong analogies. Deduction collapse — past a chain length, multi-step deduction degrades. Behavioural drift — Sonnet 4 showed increased self-preservation language under long CoT. Mechanistically we don't fully understand any of these, but operationally the lesson is clear: scale test-time compute selectively, on tasks where you have evidence the curve is still rising. Default-to-max-effort is a bug, not a feature.

Common wrong answer to avoid: "More compute always helps, just need the right prompt" — that contradicts published evidence. Senior loops want to see you've read the literature, not assumed marketing.

Q: Should engineering teams invest in interpretability research, or just consume it from labs? A: Mostly consume, sparingly invest. Frontier interpretability (Anthropic, DeepMind, EleutherAI) requires deep model access and specialised tooling — hard to replicate in product teams. What is worth investing in: interpretability-adjacent monitoring — automated checks for CoT-output consistency, perturbation tests on production traffic, faithfulness audits on sampled requests. These are cheap, give you a directional signal, and don't require circuit-level work. Treat published interpretability research as a knowledge feed; build internal monitoring that operationalises its findings.

Common wrong answer to avoid: "Interpretability is a research luxury, not a product concern" — the faithfulness, scheming, and inverse-scaling findings have direct operational implications. Ignoring them is a safety and reliability risk.

Apply now (5 min)¶

Write down one belief you currently hold about reasoning models — for example, "reasoning models are better than chat models for any complex task" or "more compute always helps." Mark whether your belief is evidence-backed, inferred, or hope. Then list one experiment you would run from your own product data to challenge that belief. The discipline is in the marking, not the answering.

Sketch from memory: Draw the diminishing-returns table from this chapter and annotate which row your production system currently operates at.

Bridge. We leave reasoning with useful humility. Next we move from text-only reasoning to models that see, generate, and reason about pictures and video. → 00-eli5.md