Skip to content

01. Week 12 — Reasoning Models

Key concepts to master

  • Standard LLM vs reasoning model
  • Zero-shot CoT vs few-shot CoT
  • Test-time compute
  • Search, verification, and backtracking
  • Cost-quality-latency trade-offs
  • Cheap-first escalation routing
  • Faithfulness vs final-answer correctness
  • Reasoning evaluation methodology

🧠 Mental models

  • Chain-of-thought: "scratch paper beside the exam"
  • Tree-of-thought: "explore a maze with branches, not one blind sprint"
  • Self-consistency: "ask several solvers independently and trust the answer where they converge"
  • Process reward models: "a coach grading every move, not just the final score"
  • o1/o3-style reasoning: "buy extra thinking time only for problems worth the bill"

⚠️ Common traps

  • Mistaking a long reasoning trace for proof of correctness; verbose steps can still be wrong.
  • Using reasoning models on trivial tasks and blowing latency or token budgets with no product gain.
  • Comparing a reasoning model to a fast baseline without normalizing for cost and SLA.
  • Assuming self-consistency rescues bad decomposition when every sample repeats the same blind spot.
  • Treating final-answer accuracy as the same thing as faithful reasoning or inspectable process quality.
  • Logging or exposing raw reasoning traces carelessly when they may be noisy, misleading, or sensitive.

🔗 Prerequisites & connections

Builds on: Module 11 eval discipline, production routing, and cost/latency trade-off thinking.

Feeds into: Module 13 multimodal systems, where some vision tasks need extra verification, multi-step grounding, or deliberate test-time compute.

💬 Interview phrasing

  • When would you route a request to an o1/o3-style reasoning model instead of a standard chat model?
  • Why can chain-of-thought improve math or coding accuracy and still be unfaithful?
  • What does self-consistency buy you, and when is the extra sampling cost justified?
  • How would you evaluate a reasoning model fairly against a cheaper fast model?
  • What problem do process reward models solve that outcome-only rewards miss?

⏱️ Difficulty markers

  • 🟢 zero-shot chain-of-thought prompting
  • 🟡 few-shot chain-of-thought prompting
  • 🟡 self-consistency sampling
  • 🔴 tree-of-thought search and backtracking
  • 🔴 process reward models
  • 🔴 faithfulness vs final-answer correctness

Foundation-gap audit for Module 13

Before leaving Week 12, you should be able to explain: 1. test-time compute in plain English 2. the cost-quality trade-off framework 3. one practical routing pattern 4. one layered evaluation plan for reasoning systems

If any of these feels shaky, revisit 02_explainer.md Chapter 4 and Chapter 5.

Self-check questions

  1. Why can a standard LLM sound fluent and still fail a hard multi-step task?
  2. Zero-shot CoT vs few-shot CoT — when would you use each?
  3. What does test-time compute scaling actually mean?
  4. Why is a reasoning model not the default choice for every workflow?
  5. What makes routing a Lead-level systems decision?
  6. Why can chain-of-thought be unfaithful?
  7. How would you evaluate a reasoning router fairly?

Exit signal

You are ready for Module 13 when you can explain, without notes, why some tasks deserve more thinking budget and why many tasks should still stay fast.