03. Week 12 — Reasoning Models¶

Companion files¶

Start with 02_explainer.md for the story and mental model.
Use this file as the compact reference sheet.
Apply the ideas in 05_hands_on_lab.md.
Close the loop with 06_revision.md.

Section 1 — What reasoning models are¶

A standard LLM usually does direct generation: prompt -> next token -> next token -> answer.

A reasoning model allocates more inference-time compute before finalising the answer. That extra budget may be used for: - intermediate reasoning - branch exploration - self-correction - verification

Core trade-off: - better quality on hard multi-step tasks - higher cost - higher latency

Section 2 — Chain-of-thought prompting¶

Chain-of-thought (CoT) = ask the model to reason step by step before answering.

Main variants¶

Zero-shot CoT — add an instruction like: "Think step by step."
Few-shot CoT — provide worked examples with reasoning traces.
Self-consistency — sample multiple chains and aggregate the final answer.

Why it can help¶

buys more test-time compute
keeps intermediate state visible
encourages decomposition
makes checking easier

When it helps most¶

math and logic
planning
code debugging
SQL generation
multi-hop QA

When it hurts¶

trivial tasks
strict JSON / schema outputs
tight latency workflows
cases where longer reasoning becomes longer nonsense

Section 3 — Reasoning model architectures¶

High-level picture only; exact recipes are partly proprietary.

Common ingredients¶

hidden or private chain-of-thought
longer test-time compute
training on hard reasoning tasks
search across candidate solutions
verification before the final answer

o1 / o3-style mental model¶

Think of these as models tuned to use longer reasoning trajectories effectively. The model is not merely talking longer. It is better at using the extra budget on hard tasks.

DeepSeek-R1¶

DeepSeek-R1 matters because it brought reasoning-style behaviour into the open ecosystem. It showed that reasoning is not only a closed-model story. That widened the production design space for engineers.

Section 4 — Test-time compute, search, and verification¶

Test-time compute¶

Test-time compute = extra inference budget spent on thinking before answering. Examples: - more internal reasoning tokens - several candidate solution paths - a check-revise loop

Search¶

Search means the system does not trust the first pleasant path. It explores alternatives.

start
├─ path A -> contradiction -> drop
├─ path B -> promising -> continue
└─ path C -> unverifiable -> drop

Verification¶

Verification grounds the reasoning. Examples: - run tests - execute SQL - validate JSON schema - check arithmetic - confirm constraints

Reasoning models become much stronger when cheap verifiers exist.

Section 5 — Routing and cost-quality trade-offs¶

Lead rule: use the cheapest model that clears the quality bar under the product latency constraint.

Simple routing pattern¶

request -> fast model -> verifier -> ship
                     \-> if fail -> reasoning model -> verifier -> ship / human

When to keep the fast path¶

extraction
rewriting
classification with obvious labels
FAQ lookup
autocomplete

When to escalate¶

multi-step planning
code debugging
SQL with many hidden constraints
policy reasoning
high-stakes decision support

Scorecard to track¶

task success rate
hard-slice success rate
average latency
p95 latency
cost per request
cost per solved task
escalation rate
verifier failure rate

Section 6 — Limitations and evaluation¶

Key limitations¶

visible CoT may be unfaithful
hidden CoT is not directly auditable
models can game benchmarks
reasoning models can overthink simple tasks

Evaluation layers¶

final-answer accuracy
verifier pass rate
hard-slice accuracy
cost and latency
routing precision / recall
sampled human review on high-risk cases
adversarial holdout tasks

Two important distinctions¶

Correct answer is not the same as faithful explanation.
Long answer is not the same as good reasoning.

Section 7 — SOTA reference (2026)¶

Vocabulary check; revalidate quarterly. See ../../tooling_landscape.md for the broader stack.

Model family	Positioning	Notes
OpenAI o1 / o3	flagship reasoning APIs	strong on math, code, planning; high inference cost
Claude Sonnet / Opus with thinking	hybrid reasoning + tools	strong coding and agent workflows
Gemini 2.5 Pro / related reasoning modes	long-context reasoning	strong context handling and multimodal bridges
DeepSeek-R1	open reasoning model	important open reference point
QwQ / Qwen reasoning variants	open reasoning family	useful for local or cost-sensitive experiments

Reading list¶

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Tree of Thoughts (Yao et al., 2023)
DeepSeek-R1 public report / blog
OpenAI o1/o3 public reasoning materials
Anthropic or Google public write-ups on thinking systems

Reference material¶

YouTube¶

Stanford CS25: Jason Wei & Hyung Won Chung of OpenAI - Good context on why prompting and reasoning traces matter.
Deep Dive into LLMs like ChatGPT - Karpathy covers pretraining, post-training, and why extra compute can unlock harder behaviour.

Blogs¶

Language Models Perform Reasoning via Chain of Thought - Accessible explanation of the original CoT result.
Prompt Engineering (CoT, ToT, ReAct & more) - Strong survey of prompting patterns relevant to reasoning.

Self-check¶

CoT prompting vs a native reasoning model — what changes in each case?
Why can test-time compute improve hard-task accuracy?
Why is verification so important for reasoning systems?
When would routing beat always using the reasoning model?
Why is hidden CoT both useful and uncomfortable?