Skip to content

03. Why CoT Works — Intermediate state is the real superpower

~10 min read. CoT is not "longer = smarter." It is the model writing its working memory to the output so the next tokens can read it.

Built on the ELI5 in 00-eli5.md. the thinking pause — intermediate state externalised into the output stream — explains the structural reason CoT lifts accuracy on some tasks and wastes tokens on others.


The whiteboard picture

A transformer doing next-token prediction has fixed working memory: the residual stream at the current position, conditioned on every previous token's residual. That's it. There is no separate "scratch register." If the model wants to remember a partial result across many steps, the only place to store it is in the tokens it has already emitted.

without CoT:
   prompt ───────────────────────────────→ answer
            (model must compute everything in one forward pass)

with CoT:
   prompt → subgoal 1 → state → subgoal 2 → state → answer
                  ▲              ▲              ▲
                  └──── each step reads the previous step's tokens

That is the thinking pause in mechanical terms. The model is literally using its own output as scratch memory. Each new token attends back to every prior token, so the partial result you wrote three sentences ago is still in scope for the answer at the end.

A hard task usually breaks when state disappears. CoT prevents that disappearance by making state explicit. The model does not "think harder" — it simply gets more positions to compute through, and each position can read every previous one.


Three structural reasons CoT helps

1. Serial-computation depth. A transformer with L layers can do at most L sequential reasoning steps inside one forward pass. With CoT, the model can do L × T steps where T is the number of intermediate tokens. Feng et al. (2024) and Merrill & Sabharwal (2024) proved this formally: CoT strictly expands the computational class a transformer can express. For tasks requiring deeper composition than L allows, CoT is not an optimization — it is the only way.

2. State externalization. A partial result computed at step 2 is hard to retrieve from layer activations at step 4. But if step 2 wrote subtotal = 2400 as tokens, attention at step 4 can re-read it cleanly. The model trades latent computation for explicit working memory.

3. Commitment delay. Without CoT, the model commits to a final-answer trajectory immediately. With CoT, early tokens shape but don't lock the final answer. This is why "draft then check" templates lift accuracy: the model gets a chance to notice its own early mistake before it has to commit.

So what to remember? CoT helps because it gives the model more steps and visible state — not because long output is magically wiser.


Worked example: profit after returns and shipping

A shop sells 80 items at ₹500 each. 10% are returned. Each successful sale has ₹60 shipping. Product cost is ₹280 per shipped item. Profit?

A fast model without CoT typically writes: "Revenue 40000, costs 22400, profit 17600." That mixes gross and net revenue.

CoT forces the state through each transformation:

Step 1: Gross revenue     = 80 × 500       = 40000
Step 2: Returned items    = 80 × 0.10      = 8
Step 3: Successful sales  = 80 - 8         = 72
Step 4: Net revenue       = 72 × 500       = 36000  ← state changed
Step 5: Shipping cost     = 72 × 60        = 4320
Step 6: Product cost      = 72 × 280       = 20160
Step 7: Total cost        = 4320 + 20160   = 24480
Step 8: Profit            = 36000 - 24480  = 11520
Answer: 11520

Notice step 3 → step 4. The model had to write successful sales = 72 to step 3, which conditioned every multiplication after. Without that explicit token, attention at step 5 would have a harder time recovering whether to multiply by 80 or 72.

See. The benefit is not the eloquent prose. The benefit is 72 written down as a token that step 5 can attend to.


When CoT actively hurts

Anthropic's July 2025 Inverse Scaling in Test-Time Compute paper documented five families where longer reasoning degrades performance:

Failure family Mechanism
Distraction Long CoT introduces irrelevant facts the model later treats as load-bearing
Framing overfit OpenAI o-series in particular overfits to surface framings during extended thinking
Spurious correlation drift Long chains drift into plausible-but-irrelevant analogies
Deduction collapse Tracking multi-step deductions actually degrades past a certain CoT length
Self-preservation expressions Claude Sonnet 4 showed increased self-preservation language under extended CoT

So "always add CoT" is wrong. The right rule: CoT helps when the task has dependent steps that need explicit state. CoT hurts when the task is shallow, when the extra tokens dilute the signal, or when the model has been RL-trained to plan internally already.

Visible steps can also anchor the model. Once it has written "Assume the customer is in the US" as step 1, every later token stays consistent with that assumption even if it was wrong. That is why the backtrack matters — but the backtrack requires a verifier, not just more CoT tokens.


How to use CoT well

Ask for subproblems, not drama. Ask the model to name assumptions explicitly so a verifier can check them. Ask it to track units. Ask it to mark the final answer with a delimiter so downstream code can parse cleanly.

For repeated workflows, show one or two demonstrations aligned to your real task — drift on style is the #1 few-shot bug. For one-off tasks, zero-shot beats poorly-chosen few-shot.

For native reasoning models (o3, GPT-5 thinking, Claude 4.5+ extended thinking, Gemini 2.5/3 thinking), do not add "think step by step." The model already plans internally. Give task instructions in plain prose and let the trained policy choose the reasoning shape.

And remember the subtle point. If a future model hides the scratchpad but still preserves state internally, the value of CoT remains — even when you can no longer see it. That is exactly what o3 and Gemini 2.5 do. Which leads to the next file.


Where this lives in the wild

  • Excel Copilot — finance formulas — when users ask "explain the variance," Copilot decomposes into named intermediate columns; the visible decomposition is also the explanation surface.
  • Claude Projects — policy analysis at law firms — IRAC-style few-shot CoT (Issue, Rule, Application, Conclusion) keeps Sonnet 4.6 base from collapsing case-law citations into a single sentence.
  • GitHub Copilot PR explanations — explicit decomposition into "cause → impact → fix" sections; reduces hallucinated-rationale rate measurably vs flat summaries.
  • Khan Academy Khanmigo — stepwise CoT prompting on a base GPT-4.1 family model; the step structure is also the pedagogy, so even faithfulness gaps don't break the product.
  • Notion AI — task breakdown — uses CoT to expand a goal into constraints and subgoals; the intermediate tokens become the user-visible plan, so CoT cost and product surface align.

Pause and recall

  1. Why does CoT structurally let a transformer compute deeper compositions than a single forward pass?
  2. In the profit example, what state-change token at step 3 makes step 5 produce the right multiplication?
  3. Name three families from the Inverse Scaling paper where longer CoT hurts.
  4. Why should you not add "think step by step" to o3 or Claude extended thinking prompts?

Interview Q&A

Q: Skeptic asks "isn't CoT just longer output?" Give the real argument. A: It is not about length, it is about serial computation depth. A transformer with L layers can do at most L sequential dependent steps in one forward pass. CoT lets the model do L × T steps where T is the intermediate token count, by writing partial results that subsequent tokens attend to. Merrill & Sabharwal (2024) proved CoT strictly expands the complexity class transformers can express. So CoT is not a UX trick — it changes what the architecture can compute.

Common wrong answer to avoid: "CoT works because longer responses are more correct on average" — this is correlation. Length without dependent steps is just verbose, and Anthropic's 2025 Inverse Scaling paper shows it can actively hurt.

Q: Your model with CoT prompting writes a beautiful 12-step chain and still produces the wrong answer. What's happening? A: Three likely causes. First, anchoring — the model wrote a wrong assumption in step 1 and every later step stayed locally consistent with it. Fix: ask the model to enumerate assumptions and verify them against tools or retrieved evidence. Second, unfaithful CoT — Anthropic April 2025 research showed Claude 3.7 mentions hints that change its behavior only 25% of the time in its scratchpad; the visible chain is not always the causal path. Fix: trust outputs and evidence, not eloquence. Third, no verifier — CoT generates options, it does not check them. Add a verifier model, a tool call, or a programmatic schema check on the final answer.

Common wrong answer to avoid: "Add more CoT steps" — if the chain is anchored to a wrong assumption, more steps just deepen the commitment. The fix is verification, not more tokens.

Q: When should the explanation in the output be different from the chain the model used internally? A: When user trust depends on a clean rationale but raw reasoning would include false branches or sensitive details. The pattern in production: run reasoning internally (hidden CoT or extended thinking) and produce a short post-hoc rationale keyed to evidence (citations, tool outputs, schema). The internal reasoning is for the model, the visible rationale is for the user — and the user-visible version is grounded in facts the verifier saw, not in the model's stream of attempts. This is how Perplexity Deep Research and Harvey AI present their work.

Common wrong answer to avoid: "Always show the full chain for transparency" — raw chains are often unfaithful, can contain unstable intermediate guesses, and overwhelm users. A grounded post-hoc summary plus citations is more honest than dumping the scratchpad.

Q: Why does CoT help on math but hurt on simple classification? A: Math has dependent serial computation — step k+1 needs the result of step k. CoT externalizes that state. Classification ("is this email spam?") usually has a single shallow decision; the extra CoT tokens introduce framing drift and distraction (per Anthropic's Inverse Scaling work). The rule: CoT helps in proportion to the task's serial dependence. Shallow tasks see no lift and sometimes regression.

Common wrong answer to avoid: "CoT always helps for accuracy, classification is just an exception" — there's no general "CoT helps" law. The lift is task-shape-dependent and on saturated easy tasks CoT can introduce noise.


Apply now (5 min)

Pick one task in your domain. Ask: does step k+1 need the output of step k? If yes, CoT will probably help. Run it on 10 examples with and without "Let's reason step by step. Track units. Answer on a new line." Record accuracy and token count. If accuracy gain > token-cost gain, ship it. If not, the task is either shallow or needs a tool, not more prose.

Sketch from memory: Draw the "L layers in one pass vs L × T positions with CoT" picture. Annotate where state externalisation happens.


Bridge. Prompting wins are real but fragile. The next jump is models actually trained to deliberate — o-series, extended thinking, Gemini thinking — where the pause is native and the API charges for hidden tokens. → 04-reasoning-model-architectures.md