Skip to content

03. Week 12 — Reasoning Models

Companion files

Section 1 — What reasoning models are

A standard LLM usually does direct generation: prompt -> next token -> next token -> answer.

A reasoning model allocates more inference-time compute before finalising the answer. That extra budget may be used for: - intermediate reasoning - branch exploration - self-correction - verification

Core trade-off: - better quality on hard multi-step tasks - higher cost - higher latency

Section 2 — Chain-of-thought prompting

Chain-of-thought (CoT) = ask the model to reason step by step before answering.

Main variants

  • Zero-shot CoT — add an instruction like: "Think step by step."
  • Few-shot CoT — provide worked examples with reasoning traces.
  • Self-consistency — sample multiple chains and aggregate the final answer.

Why it can help

  • buys more test-time compute
  • keeps intermediate state visible
  • encourages decomposition
  • makes checking easier

When it helps most

  • math and logic
  • planning
  • code debugging
  • SQL generation
  • multi-hop QA

When it hurts

  • trivial tasks
  • strict JSON / schema outputs
  • tight latency workflows
  • cases where longer reasoning becomes longer nonsense

Section 3 — Reasoning model architectures

High-level picture only; exact recipes are partly proprietary.

Common ingredients

  • hidden or private chain-of-thought
  • longer test-time compute
  • training on hard reasoning tasks
  • search across candidate solutions
  • verification before the final answer

o1 / o3-style mental model

Think of these as models tuned to use longer reasoning trajectories effectively. The model is not merely talking longer. It is better at using the extra budget on hard tasks.

DeepSeek-R1

DeepSeek-R1 matters because it brought reasoning-style behaviour into the open ecosystem. It showed that reasoning is not only a closed-model story. That widened the production design space for engineers.

Section 4 — Test-time compute, search, and verification

Test-time compute

Test-time compute = extra inference budget spent on thinking before answering. Examples: - more internal reasoning tokens - several candidate solution paths - a check-revise loop

Search means the system does not trust the first pleasant path. It explores alternatives.

start
├─ path A -> contradiction -> drop
├─ path B -> promising -> continue
└─ path C -> unverifiable -> drop

Verification

Verification grounds the reasoning. Examples: - run tests - execute SQL - validate JSON schema - check arithmetic - confirm constraints

Reasoning models become much stronger when cheap verifiers exist.

Section 5 — Routing and cost-quality trade-offs

Lead rule: use the cheapest model that clears the quality bar under the product latency constraint.

Simple routing pattern

request -> fast model -> verifier -> ship
                     \-> if fail -> reasoning model -> verifier -> ship / human

When to keep the fast path

  • extraction
  • rewriting
  • classification with obvious labels
  • FAQ lookup
  • autocomplete

When to escalate

  • multi-step planning
  • code debugging
  • SQL with many hidden constraints
  • policy reasoning
  • high-stakes decision support

Scorecard to track

  • task success rate
  • hard-slice success rate
  • average latency
  • p95 latency
  • cost per request
  • cost per solved task
  • escalation rate
  • verifier failure rate

Section 6 — Limitations and evaluation

Key limitations

  • visible CoT may be unfaithful
  • hidden CoT is not directly auditable
  • models can game benchmarks
  • reasoning models can overthink simple tasks

Evaluation layers

  1. final-answer accuracy
  2. verifier pass rate
  3. hard-slice accuracy
  4. cost and latency
  5. routing precision / recall
  6. sampled human review on high-risk cases
  7. adversarial holdout tasks

Two important distinctions

  • Correct answer is not the same as faithful explanation.
  • Long answer is not the same as good reasoning.

Section 7 — SOTA reference (2026)

Vocabulary check; revalidate quarterly. See ../../tooling_landscape.md for the broader stack.

Model family Positioning Notes
OpenAI o1 / o3 flagship reasoning APIs strong on math, code, planning; high inference cost
Claude Sonnet / Opus with thinking hybrid reasoning + tools strong coding and agent workflows
Gemini 2.5 Pro / related reasoning modes long-context reasoning strong context handling and multimodal bridges
DeepSeek-R1 open reasoning model important open reference point
QwQ / Qwen reasoning variants open reasoning family useful for local or cost-sensitive experiments

Reading list

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
  2. Tree of Thoughts (Yao et al., 2023)
  3. DeepSeek-R1 public report / blog
  4. OpenAI o1/o3 public reasoning materials
  5. Anthropic or Google public write-ups on thinking systems

Reference material

YouTube

Blogs

Self-check

  1. CoT prompting vs a native reasoning model — what changes in each case?
  2. Why can test-time compute improve hard-task accuracy?
  3. Why is verification so important for reasoning systems?
  4. When would routing beat always using the reasoning model?
  5. Why is hidden CoT both useful and uncomfortable?