03. Week 12 — Reasoning Models¶
Companion files¶
- Start with 02_explainer.md for the story and mental model.
- Use this file as the compact reference sheet.
- Apply the ideas in 05_hands_on_lab.md.
- Close the loop with 06_revision.md.
Section 1 — What reasoning models are¶
A standard LLM usually does direct generation: prompt -> next token -> next token -> answer.
A reasoning model allocates more inference-time compute before finalising the answer. That extra budget may be used for: - intermediate reasoning - branch exploration - self-correction - verification
Core trade-off: - better quality on hard multi-step tasks - higher cost - higher latency
Section 2 — Chain-of-thought prompting¶
Chain-of-thought (CoT) = ask the model to reason step by step before answering.
Main variants¶
- Zero-shot CoT — add an instruction like: "Think step by step."
- Few-shot CoT — provide worked examples with reasoning traces.
- Self-consistency — sample multiple chains and aggregate the final answer.
Why it can help¶
- buys more test-time compute
- keeps intermediate state visible
- encourages decomposition
- makes checking easier
When it helps most¶
- math and logic
- planning
- code debugging
- SQL generation
- multi-hop QA
When it hurts¶
- trivial tasks
- strict JSON / schema outputs
- tight latency workflows
- cases where longer reasoning becomes longer nonsense
Section 3 — Reasoning model architectures¶
High-level picture only; exact recipes are partly proprietary.
Common ingredients¶
- hidden or private chain-of-thought
- longer test-time compute
- training on hard reasoning tasks
- search across candidate solutions
- verification before the final answer
o1 / o3-style mental model¶
Think of these as models tuned to use longer reasoning trajectories effectively. The model is not merely talking longer. It is better at using the extra budget on hard tasks.
DeepSeek-R1¶
DeepSeek-R1 matters because it brought reasoning-style behaviour into the open ecosystem. It showed that reasoning is not only a closed-model story. That widened the production design space for engineers.
Section 4 — Test-time compute, search, and verification¶
Test-time compute¶
Test-time compute = extra inference budget spent on thinking before answering. Examples: - more internal reasoning tokens - several candidate solution paths - a check-revise loop
Search¶
Search means the system does not trust the first pleasant path. It explores alternatives.
start
├─ path A -> contradiction -> drop
├─ path B -> promising -> continue
└─ path C -> unverifiable -> drop
Verification¶
Verification grounds the reasoning. Examples: - run tests - execute SQL - validate JSON schema - check arithmetic - confirm constraints
Reasoning models become much stronger when cheap verifiers exist.
Section 5 — Routing and cost-quality trade-offs¶
Lead rule: use the cheapest model that clears the quality bar under the product latency constraint.
Simple routing pattern¶
request -> fast model -> verifier -> ship
\-> if fail -> reasoning model -> verifier -> ship / human
When to keep the fast path¶
- extraction
- rewriting
- classification with obvious labels
- FAQ lookup
- autocomplete
When to escalate¶
- multi-step planning
- code debugging
- SQL with many hidden constraints
- policy reasoning
- high-stakes decision support
Scorecard to track¶
- task success rate
- hard-slice success rate
- average latency
- p95 latency
- cost per request
- cost per solved task
- escalation rate
- verifier failure rate
Section 6 — Limitations and evaluation¶
Key limitations¶
- visible CoT may be unfaithful
- hidden CoT is not directly auditable
- models can game benchmarks
- reasoning models can overthink simple tasks
Evaluation layers¶
- final-answer accuracy
- verifier pass rate
- hard-slice accuracy
- cost and latency
- routing precision / recall
- sampled human review on high-risk cases
- adversarial holdout tasks
Two important distinctions¶
- Correct answer is not the same as faithful explanation.
- Long answer is not the same as good reasoning.
Section 7 — SOTA reference (2026)¶
Vocabulary check; revalidate quarterly.
See ../../tooling_landscape.md for the broader stack.
| Model family | Positioning | Notes |
|---|---|---|
| OpenAI o1 / o3 | flagship reasoning APIs | strong on math, code, planning; high inference cost |
| Claude Sonnet / Opus with thinking | hybrid reasoning + tools | strong coding and agent workflows |
| Gemini 2.5 Pro / related reasoning modes | long-context reasoning | strong context handling and multimodal bridges |
| DeepSeek-R1 | open reasoning model | important open reference point |
| QwQ / Qwen reasoning variants | open reasoning family | useful for local or cost-sensitive experiments |
Reading list¶
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
- Tree of Thoughts (Yao et al., 2023)
- DeepSeek-R1 public report / blog
- OpenAI o1/o3 public reasoning materials
- Anthropic or Google public write-ups on thinking systems
Reference material¶
YouTube¶
- Stanford CS25: Jason Wei & Hyung Won Chung of OpenAI - Good context on why prompting and reasoning traces matter.
- Deep Dive into LLMs like ChatGPT - Karpathy covers pretraining, post-training, and why extra compute can unlock harder behaviour.
Blogs¶
- Language Models Perform Reasoning via Chain of Thought - Accessible explanation of the original CoT result.
- Prompt Engineering (CoT, ToT, ReAct & more) - Strong survey of prompting patterns relevant to reasoning.
Self-check¶
- CoT prompting vs a native reasoning model — what changes in each case?
- Why can test-time compute improve hard-task accuracy?
- Why is verification so important for reasoning systems?
- When would routing beat always using the reasoning model?
- Why is hidden CoT both useful and uncomfortable?