04. Reasoning Model Architectures — From prompt tricks to RL-trained deliberation¶
~12 min read. Prompting asks for a pause. Architecture makes the pause native, trainable, and billable. Three families, three knobs, three pricing models.
Built on the ELI5 in 00-eli5.md. the time budget — compute spent before the visible answer — becomes a trained system behaviour you control through API parameters, not just prompt phrasing.
The system picture¶
A basic chat model runs one forward pass and emits tokens. A reasoning model still has a decoder at heart, but it has been RL-trained to spend many tokens before the user-visible answer. The system around the model also changes — there's a reasoning-tokens budget, a planner, sometimes a verifier, sometimes parallel candidates.
chat model:
prompt ──→ decoder pass ──→ visible answer
reasoning model:
prompt ──→ decoder + RL-trained planner ──→ hidden/visible thinking ──→ verify ──→ answer
│ │ │
reasoning reasoning optional
tokens tokens selector
That is the time budget in operation. Prompting says "please think." Architecture says "this system is built to think, and we charge by the reasoning token."
Three families to know cold¶
OpenAI — o-series and GPT-5 thinking tier. RL-on-chains training. The model emits hidden reasoning tokens you don't see but pay for at the output rate. Knob: reasoning_effort with five levels (minimal, low, medium, high, xhigh). Defaults are model-specific (GPT-5.1 = none, GPT-5.5 = medium, GPT-5-pro = high). Hidden reasoning tokens count toward your context window.
from openai import OpenAI
client = OpenAI()
resp = client.responses.create(
model="o3",
input="Plan a 3-step migration from MySQL 5.7 to Postgres 16 for a 2TB OLTP db.",
reasoning={"effort": "high"},
)
# resp.usage.reasoning_tokens → e.g. 3,847 hidden tokens
# resp.usage.output_tokens → visible answer tokens, billed at same rate
Anthropic — extended thinking. RL-trained with visible thinking blocks. Until Sonnet 4.5 the knob was thinking.budget_tokens (min 1024, target rather than strict cap). From Opus 4.6 / Sonnet 4.6 onward, the new knob is effort (low, medium, high, max — max only on Opus 4.6/4.7 and Mythos Preview). Thinking blocks are returned in the response and can be summarized rather than fully shown depending on safety policy.
from anthropic import Anthropic
client = Anthropic()
# Sonnet 4.5 legacy
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{"role": "user", "content": "Refactor this 400-line module..."}],
)
# Sonnet 4.6 / Opus 4.6+ new API
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=64000,
thinking={"type": "enabled", "effort": "high"},
messages=[...],
)
# msg.content includes thinking blocks and text blocks separately
Google — Gemini thinking. Gemini 2.5 Pro thinking is always on (cannot disable). On Gemini 2.5 the knob is thinkingConfig.thinkingBudget (−1 = dynamic, 0 = off where supported, range 0–24576). On Gemini 3 the API changed to thinkingLevel: "LOW" | "HIGH" — you cannot combine it with thinkingBudget (returns 400).
from google import genai
client = genai.Client()
# Gemini 3
resp = client.models.generate_content(
model="gemini-3-pro",
contents="Audit this access-control policy for principle-of-least-privilege violations.",
config={"thinking_config": {"thinking_level": "HIGH"}},
)
xAI — Grok 4. Knob is reasoning with levels none | low | medium | high (default low). Response surfaces reasoning_details array so the chain is visible. Grok 4 Heavy runs parallel sampled paths and selects.
That is the 2026 lineup. The knob name differs. The billing pattern is the same: reasoning tokens billed at output rate, hidden tokens (OpenAI) or visible blocks (Anthropic, xAI, Google).
What trained reasoning actually adds¶
Prompt-CoT depends on the user asking nicely. Trained reasoning makes deliberation default behaviour learned through RL. The model has seen reward signal over whole chains — DeepSeek-R1 paper (arXiv 2501.12948) calls this Group Relative Policy Optimization (GRPO), an evolution of PPO that drops the critic and computes advantages from group-relative rewards.
The result: the model learns to keep hidden state longer, to prefer checked chains over flashy ones, and to allocate budget proportional to problem hardness. R1-Zero (no SFT, pure RL) showed emergent "aha moments" — AIME 2024 accuracy jumped from 15.6% to 77.9% by training alone. That settled a question: frontier-class reasoning does not need proprietary SFT data, just well-designed reward.
So the change is not one magic layer. It is: - RL on chains, not next-token loss - A trained budget-allocation policy - Sometimes an external or internal verifier - An API parameter for the engineer to dial effort
Worked example: when does the extra budget pay?¶
Take one ambiguous coding task. Run it twice on Claude Opus 4.7 with different effort levels. Approximate 2026 numbers:
| Setting | Reasoning tokens | Output tokens | Latency P50 | Cost ($/req) | SWE-bench-class pass |
|---|---|---|---|---|---|
effort="low" |
~2,000 | ~800 | 8 s | $0.07 | 78% |
effort="medium" |
~6,000 | ~800 | 18 s | $0.17 | 84% |
effort="high" |
~16,000 | ~800 | 28 s | $0.42 | 87% |
effort="max" |
~48,000 | ~800 | 65 s | $1.22 | 87.6% |
Notice the diminishing curve. high → max tripled cost for 0.6 points. So the answer to "more budget?" is task-dependent. For a CI bot reviewing PRs in batch, max may earn its keep. For an interactive editor, medium is the sweet spot. For a refactor agent that runs five minutes in the background, high.
Look. Architecture gives you the knob. Engineering decides where to spend it. The cheapest tier that clears your quality bar wins.
Design lessons for engineers¶
Stop asking "which model is smartest?" Start asking "which reasoning behaviour does my task need?"
- Deep internal thought? Use o3/Opus 4.7/Gemini 3 thinking with high effort.
- Candidate generation? Either use Grok 4 Heavy (parallel sampling built in) or implement Best-of-N yourself with a cheaper model.
- Verification? Add a separate verifier call (cheaper model, or a tool, or a unit test). Reasoning models do not automatically verify themselves.
- Dynamic budgets? Look at Anthropic's
effortparameter or Gemini's dynamicthinkingBudget = -1. The model decides how hard to think per request. - Stateful flows? OpenAI exposes encrypted reasoning items via the Responses API so you can pass hidden CoT across turns without paying twice.
And one warning. A bigger architecture does not save bad routing. If you spend the high budget on trivial tasks you lose money. If you skip verification on fragile tasks you still lose quality. Architecture is the lever, not the answer.
Where this lives in the wild¶
- GitHub Copilot Auto-mode — task-aware routing; complex multi-step reasoning tasks go to o3 / o1; coding agents on github.com run on Claude Sonnet 4.x; the user never picks the model.
- Cursor Auto mode — Opus 4.7 for architecture and large refactors, GPT-5.5 default for general edits, DeepSeek V4 Pro as cheap reasoning fallback. Routes per task, not per user.
- Perplexity Computer (May 2026) — orchestrator is Claude Opus 4.6; sub-agents (Gemini for deep research, others for code) are dispatched by the orchestrator's reasoning trace.
- Harvey AI (legal) — cascading pipeline: custom case-law retriever → RAG over firm corpus → o1-class reasoning orchestrator → tool-grounded citation verifier. Raised $200M at $11B valuation in March 2026 on the strength of this architecture.
- OpenAI Codex (the cloud agent product, not the old completion API) — uses encrypted reasoning items via Responses API to maintain context across long async runs without re-paying for hidden CoT.
Pause and recall¶
- Name the four API knobs for reasoning effort across OpenAI, Anthropic, Google, and xAI. Which one is not a budget number?
- What is GRPO and what did DeepSeek-R1-Zero prove about it?
- In the Opus 4.7 budget table, where do diminishing returns kick in hardest?
- Why are reasoning tokens billed at the output rate even when hidden?
Interview Q&A¶
Q: A teammate proposes "let's standardise on reasoning_effort=high everywhere." What's your push-back?
A: Cost and latency. From the Opus 4.7 table, high is ~6× the cost of low for a 9-point lift; max is ~17× for an extra 0.6 points. Combined with a 5–60× TTFT inflation, blanket high breaks any interactive P95 and triples your inference bill. The right design is to route effort by task: shallow tasks → low or no thinking, multi-step tasks → medium, multi-file or compliance-critical → high, batch overnight → max. Show your cascade on a whiteboard with cost numbers.
Common wrong answer to avoid: "Just use high, quality is everything" — quality without budget discipline kills products. Senior loops expect you to defend choices with $ and ms.
Q: Why are reasoning tokens billed at output rate even though I never see them? A: Because they are output tokens — the model generates them through the same autoregressive process, they consume the same GPU time, and they fill your context window. OpenAI hides them from your response for safety and to prevent prompt-extraction attacks, but you still pay the compute. This is why a single o3 call at high effort can spend 16K reasoning tokens — about $0.13 at o3's $8/M output rate — before producing a single visible character. Pricing reflects compute, not visibility.
Common wrong answer to avoid: "Hidden tokens should be cheaper because the user doesn't see them" — the model spends the same compute. Some providers offer cached input discounts on prior reasoning items (Grok 4 cached = $0.75/M input vs $3 base) but generation pricing stays at output rate.
Q: When would you choose Anthropic extended thinking over OpenAI o-series for a coding agent? A: Three reasons. First, visible thinking — Anthropic returns the thinking block so you can debug agent behaviour, log it for replay, and use it for tool-call grounding. OpenAI hides the chain. Second, interleaved tool use — Claude can run tools during extended thinking and use results in the same chain; o-series tool integration is improving but the loop pattern is cleaner with Claude. Third, output length — Opus 4.7 supports 128K output tokens, which matters for long refactors. Counter: OpenAI's o3-pro and GPT-5.5 are still ahead on raw SWE-bench Verified and on multi-file repo reasoning when used with cached prior reasoning items.
Common wrong answer to avoid: "Anthropic is better at coding" — the benchmark gap is small and context-dependent. The architecture difference (visible vs hidden CoT, tool interleaving, output length) is the more defensible reason.
Q: How does DeepSeek-R1's training pipeline differ from o1's, and why does that matter for your stack? A: R1 used GRPO (Group Relative Policy Optimization) — dropped PPO's critic, computed advantages from group-relative rewards, saved 40–60% memory in training. R1-Zero (pure RL, no SFT) demonstrated that reasoning behaviour can emerge without proprietary instruction data. For your stack this matters because: (a) the weights are open, so you can self-host for privacy or run cheap distilled variants, (b) the training recipe is reproducible, so other open labs (Qwen, Kimi) followed and the open-source reasoning curve closed fast on closed models, and (c) you now have a real fallback if a closed API has an incident or pricing change.
Common wrong answer to avoid: "R1 just matches o1, who cares" — the open-weight aspect changed industry dynamics (closed price cuts, distillation pipelines, on-prem options). The architectural lesson — pure RL on chains works — is also strategically important.
Apply now (5 min)¶
Pick one production task. Run it at three effort levels on whichever reasoning model is in your stack. Log: reasoning_tokens, output_tokens, latency, cost, accuracy on your golden set. Plot the four curves. The knee of the cost-accuracy curve is your default effort. Anything higher needs a per-request justification.
Sketch from memory: Draw the three-family table (OpenAI / Anthropic / Google) with knob name, default value, and whether tokens are hidden or visible.
Bridge. Now the API hides or shows reasoning depending on the provider. That single design choice has cost, debugging, faithfulness, and safety implications — worth its own chapter. → 05-hidden-chain-of-thought.md