00. Reasoning Models — The Five-Year-Old Version¶
If a normal chatbot is the fast chess player, a reasoning model is the grandmaster who pauses before moving — and bills you for the pause.
Picture two chess players.
The first sees one move and slaps the piece down. Sometimes it wins. Sometimes the queen dies five moves later. That is a plain LLM. It plays the quick guess. Smooth, confident, and often wrong on anything with branches.
Now the grandmaster. She does not trust the first shiny option. She stops. She imagines two or three futures. She throws one bad line away. She returns to a better line. Only then does she move. That deliberate stop is the thinking pause.
A reasoning model is a model trained or wired to take that pause on demand. It spends the time budget — extra inference tokens you pay for — to walk the move tree of possible answers. When a branch turns rotten, it does the backtrack and tries another. Then it returns a clean final answer.
So what changed in 2024–2026? Earlier we tricked the pause through prompting ("think step by step"). Now the pause is native: OpenAI's o-series, Anthropic's extended thinking, Google's Gemini thinking, DeepSeek-R1, Qwen QwQ. The API exposes a knob — reasoning_effort, budget_tokens, thinking_budget — and the model spends private tokens before speaking.
The grandmaster is not always better. For 2 + 2, the pause is waste. For taxes, multi-file refactors, proofs, agent planning, and tool-heavy debugging, the pause is the product. The applied engineer's job is to know which lane is which and price accordingly.
See. Five placeholders carry the whole module. Memorise these names. Every chapter calls back to them.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| the quick guess | direct next-token generation with no scratchpad — base chat behavior |
| the thinking pause | visible or hidden chain-of-thought before the user-facing answer |
| the move tree | multiple candidate paths explored, scored, and pruned (search) |
| the backtrack | recovery move when a path turns out weak — verifier-driven retry |
| the time budget | the inference compute you allocate per request (effort/budget knob) |
What an applied engineer should walk into an interview knowing¶
| Axis | The senior-level answer |
|---|---|
| When to use a reasoning model | Multi-step task, high error cost, latency budget permits 5–60 s, not autocomplete |
| When NOT to | Classification, autocomplete, simple summarization, low error cost, tight P95 |
| Cost intuition | Frontier reasoning is 25–60× the price of fast models; pro-tier is 200–500× |
| Latency intuition | TTFT inflates 5–60× when reasoning enabled; P50 jumps from 1–2 s to 10–60 s |
| API knobs | reasoning_effort (OpenAI), thinking.budget_tokens or effort (Anthropic), thinkingLevel (Gemini 3), reasoning (Grok) |
| Production pattern | Cascade route 70/20/10 fast/reasoning/deep; verifier on output; fallback model |
| The faithfulness problem | Visible chain matches actual reasoning only 25–40% of the time (Anthropic 2025) |
| Eval discipline | Multi-column (accuracy, recovery, faithfulness, calibration, cost) on private golden set |
The whole module is one extended walk through this table. Each row gets its own chapter.
Top resources¶
- OpenAI reasoning guide — official
reasoning_effortsemantics, hidden-CoT billing, and prompt patterns for the o-series and GPT-5 thinking tier. - Anthropic extended thinking docs —
thinking.budget_tokensAPI, when thinking is summarized vs streamed, interleaving with tools. - Google Gemini thinking —
thinking_configparameters, deep-think mode, when Gemini thinks automatically. - DeepSeek-R1 paper (arXiv 2501.12948) — GRPO training, R1-Zero pure-RL surprise, open weights and distillation recipes.
- Lilian Weng — LLM reasoning — survey of CoT, self-consistency, ToT, and verifier-based selection.
- ARC-AGI v2 leaderboard — the benchmark where reasoning models still struggle; calibrates honest expectations.
- Scaling laws for test-time compute (Snell et al. 2024) — the canonical "compute traded at inference vs training" paper. Read the curves.
- Anthropic — Reasoning models don't always say what they think — the faithfulness paper everyone cites; required reading before you trust a printed scratchpad.
What's coming¶
- 01-opening-failure.md — Why fluent next-token models still crack the middle of multi-step tasks.
- 02-chain-of-thought-prompting.md — Zero-shot, few-shot, and decomposition prompts: the cheapest reasoning upgrade.
- 03-why-cot-works.md — The serial-computation argument and when extra tokens stop buying anything.
- 04-reasoning-model-architectures.md — From prompt tricks to RL-trained deliberation: o-series, extended thinking, Gemini thinking.
- 05-hidden-chain-of-thought.md — Why OpenAI hides raw CoT, why Anthropic shows it, what you actually get billed for.
- 06-deepseek-r1-open-ecosystem.md — How R1 broke the closed-only spell and what GRPO/distillation mean for your stack.
- 07-test-time-compute-scaling.md — Effort knobs, budget tokens, and the scaling curves that tell you when to stop paying.
- 08-search-verification-move-tree.md — Self-consistency, best-of-N, PRMs, and tree search inside production pipelines.
- 09-task-routing-patterns.md — Cascade routing, classifier gates, and where to put the cheap fast lane.
- 10-cost-quality-latency-tradeoffs.md — Real $/M-tokens math, P95 latencies, and the expected-loss frontier.
- 11-evaluating-reasoning.md — AIME, GPQA, SWE-bench, ARC-AGI: what they measure, where they leak, what to add at home.
- 12-production-reasoning-systems.md — End-to-end pipelines: router, retrieval, reasoner, verifier, tools, fallbacks, observability.
- 13-honest-admission.md — Faithfulness, scheming traces, ARC-AGI v2 failure, and the open scientific questions you should not pretend to have settled.
How to use this module for interview prep¶
Two passes work best.
First pass (3–4 hours), front to back. Read 00 through 13 in order. Take notes on the four-cell failure table (01), the three API knob families (04), the cascade routing pattern (09), and the cost-vs-latency table (10). These four artifacts cover ~70% of likely interview questions.
Second pass (2–3 hours), targeted. Re-read 09 (routing), 10 (cost/latency), 11 (eval), and 12 (production). Memorise the 2026 price table from chapter 10. Run the cascade math from chapter 09 on a workload you know. Run one perturbation/faithfulness test inspired by chapter 13 on a model you've used.
Before the interview itself. Reproduce the Apply Now sketches without looking at the book. If you can draw the cascade router, the cost-quality triangle, and the multi-column eval card from memory, you can walk into the loop.
The first job is to feel the failure mode clearly. If we cannot see why ordinary generation breaks, every later fix looks like marketing. So we open with one painful debugging trace.
Bridge. Reasoning models earn their price only against tasks where the fast model genuinely fails. So we start by watching one fail. → 01-opening-failure.md