02. Reasoning Models — Narrative Explainer¶

Companion to `03_study_material.md`. This file gives you the picture in your head. The study material gives you the compact reference.¶

Table of contents¶

ELI5 — chessboard picture
Chapter 1: The opening failure
1.1 The confidently wrong answer
1.2 Why standard LLMs fail on hard chains
1.3 Why this becomes a Lead-level decision
Chapter 2: Chain-of-thought prompting
2.1 The quick guess and the thinking pause
2.2 Zero-shot CoT
2.3 Few-shot CoT
2.4 Why CoT works
2.5 When CoT helps
2.6 When CoT hurts
Chapter 3: Reasoning model architectures
3.1 From prompting tricks to trained reasoning
3.2 Hidden chain-of-thought
3.3 The o1 / o3-style picture
3.4 DeepSeek-R1 and the open ecosystem
3.5 Search, verification, and the move tree
3.6 Test-time compute scaling
Chapter 4: When to use reasoning models
4.1 Task complexity spectrum
4.2 Cost, quality, and latency
4.3 Routing patterns
4.4 Worked production examples
Chapter 5: Limitations and evaluation
5.1 Faithfulness of chain-of-thought
5.2 Benchmark gaming
5.3 Overthinking simple tasks
5.4 How to evaluate reasoning
5.5 Honest admission
Chapter 6: Recap and application
6.1 Failure-fix chain
6.2 Key points to remember
6.3 Interview questions
6.4 Production experience
6.5 Exercises
6.6 Foundation-gap audit
6.7 Bridge to Module 13

ELI5 — the whole thing in kid words¶

Imagine two chess players. The first player is fast. He sees one nice move. He plays immediately. Sometimes that works. Sometimes it loses the queen five moves later. This is the ordinary LLM pattern. It produces the quick guess. The quick guess means direct generation. Token after token. No deliberate pause. No explicit search. Now imagine a grandmaster. She does not play instantly. She pauses. She checks three candidate moves. She imagines replies. She discards one line. She returns to another line. Only then does she play. This pause is what a reasoning model tries to buy. We will use five names all week. - the quick guess = direct generation - the thinking pause = chain-of-thought - the move tree = search or exploration - the backtrack = self-correction after a bad path - the time budget = how much compute you allow The grandmaster is not magically smarter in every situation. If you ask, "What is 2 + 2?", the grandmaster wastes time. If you ask, "Plan a three-city trip under budget, with visa rules, layovers, and dates," the pause matters. Here is the chessboard picture.

question
  |
  |--> quick guess --> answer now
  |
  |--> thinking pause
         |
         |--> move A --> bad future --> backtrack
         |--> move B --> promising
         |--> move C --> violates rule
         |
         ---> best line --> final answer

That extra thinking has a cost. More tokens. More time. More money. So the real question is not, "Are reasoning models better?" The real question is, "When is the extra pause worth paying for?" A worked mini-example. Question: You have ₹300. You buy 3 pens costing ₹40 each. Then a notebook costing ₹90. Shop gives 10% discount on the total bill. How much do you pay? A quick guess often does this badly. It discounts only the notebook. Or it forgets to add the pens first. A reasoning model is more likely to do this.

pens: 3 × 40 = 120
notebook: 90
subtotal: 210
discount: 10% of 210 = 21
pay: 210 - 21 = 189

Same knowledge. Different use of compute. This whole module is about that difference.

Retrieval prompt 1. Without scrolling, define the five named placeholders: the quick guess, the thinking pause, the move tree, the backtrack, and the time budget.

Chapter 1: The opening failure¶

1.1 The confidently wrong answer¶

You ask GPT-4 a multi-step math problem. It gets step 3 wrong. The final answer is confidently wrong. You ask o1. It gets the problem right. But it costs roughly 10x more. And it takes 30 seconds. This scene is not rare. The standard model looks fluent. The reasoning model looks slow. The slow one wins on the hard case. That shocks teams the first time. They assume model quality is one number. It is not. Quality depends on task type. And reasoning depth. And latency tolerance. And budget. Let us make the failure concrete. Question: A bill is ₹3000. First apply a 20% discount. Then apply 18% GST. Then subtract a ₹120 coupon. Finally split the amount equally among 3 friends. How much does each friend pay? Correct calculation:

3000 × 0.80 = 2400
2400 × 1.18 = 2832
2832 - 120 = 2712
2712 / 3 = 904

Correct answer: ₹904 each. A fast standard output may look like this.

20% of 3000 is 600, so 3000 - 600 = 2400.
Subtract coupon: 2400 - 120 = 2280.
Add 18% GST: 2280 × 1.18 = 2690.4.
Split by 3: 896.8.
Answer: ₹896.80 each.

See the problem. The steps are fluent. The ordering is wrong. The confidence is still high. A reasoning-oriented output is more likely to do this.

The order matters.
Discount first: 3000 × 0.8 = 2400.
GST after discount: 2400 × 1.18 = 2832.
Coupon after tax: 2832 - 120 = 2712.
Split across 3: 2712 / 3 = 904.
Answer: ₹904 each.

The second answer is not better because it is longer. It is better because it kept state. It preserved order. It effectively used a scratchpad.

1.2 Why standard LLMs fail on hard chains¶

A standard LLM is excellent at local next-token prediction. That means it is very good at writing the next plausible sentence. But hard reasoning is not only local plausibility. Hard reasoning asks for global consistency. A single wrong step can poison every later step. In math, one arithmetic slip kills the result. In code, one wrong assumption kills the patch. In planning, one missed constraint ruins the itinerary. The model can still sound convincing. That is the dangerous part. Fluency and correctness are correlated. But they are not the same thing. The quick guess works beautifully on easy tasks. It also works on tasks where the answer pattern is common. But when the task needs branching, checking, or revising, direct generation becomes brittle. Think like this. Direct generation is a smooth road. Reasoning is a road with forks. If you drive fast through the forks, you miss the right exit.

1.3 A second worked example — coding, not math¶

Suppose you ask: "Why does this SQL query duplicate rows after joining orders and order_items? Give the minimal fix." A quick model may say:

Use DISTINCT.
That removes duplicates from the final result.

This sounds helpful. It is often wrong. DISTINCT hides the symptom. It does not fix the join logic. A reasoning model is more likely to ask:

What is the grain of each table?
orders: one row per order.
order_items: many rows per order.
If you join and then aggregate later, row multiplication happens.
Minimal fix:
aggregate order_items first or group by order_id before the join.

Again, the improvement is not poetry. It is structure. The model paused long enough to inspect the grain.

1.4 Why this becomes a Lead-level decision¶

A junior engineer asks, "Which model is most accurate?" A Lead asks, "Which task deserves which model under this SLA and cost target?" That is a completely different question. If you run a reasoning model on every autocomplete, you burn money for no gain. If you never use a reasoning model, hard tasks fail silently. If you route badly, you get the worst of both worlds. High cost. Slow product. Still wrong on difficult tasks. So this module is not only about model families. It is about judgment. The judgment has four parts. 1. Estimate task difficulty. 2. Estimate business cost of mistakes. 3. Estimate latency tolerance. 4. Route to the cheapest model that meets the bar. That sentence is a Lead sentence. Memorize it.

1.5 The three business axes¶

Every reasoning decision sits inside a triangle.

            quality
              /\
             /  \
            /    \
           /      \
          /        \
     cost ---------- latency

You rarely maximize all three. A more reasoning-heavy model usually improves hard-task quality. But it usually hurts cost. And usually hurts latency. Sometimes that is fine. Batch analysis? Use more thinking. Medical summarization with human review? Use more thinking. Live typing suggestions? Do not use more thinking.

1.6 Failure pattern checklist¶

When a fast model fails on a reasoning task, the failure usually looks like one of these. - it skips a step - it applies steps in the wrong order - it loses a constraint midway - it never checks its own answer - it follows one bad branch too long - it gives a plausible but inconsistent explanation Those six patterns appear again and again. Keep them in your head. We will keep mapping fixes to failures.

Chapter 2: Chain-of-thought prompting¶

2.1 The quick guess and the thinking pause¶

The simplest fix to the quick guess is very old. Do not ask for the answer immediately. Ask the model to think step by step. That request is called chain-of-thought prompting, usually shortened to CoT. The idea is simple. Visible intermediate steps can force decomposition. Instead of jumping from question to answer, the model walks through smaller moves. This is the thinking pause. A very common zero-shot version is:

Let's think step by step.

A few-shot version shows examples. Example format:

Question: ...
Reasoning: ...
Answer: ...

Then the new question follows. So CoT is not a new model architecture. It is a prompting pattern. That distinction matters.

2.2 Zero-shot CoT¶

Zero-shot CoT means no demonstrations. You only add an instruction. For example:

Solve the problem.
Think step by step.
Then give the final answer clearly.

Why does this sometimes help? Because the instruction changes the trajectory. The model is nudged toward decomposition. It allocates more output tokens to intermediate state. It becomes less likely to compress too early. A worked probability example. Question: A box has 5 red, 4 blue, and 3 green balls. Two balls are drawn without replacement. What is the probability that both balls have the same color? A quick direct answer often does this.

(5/12)^2 + (4/12)^2 + (3/12)^2
= 25/144 + 16/144 + 9/144
= 50/144
= 25/72

The arithmetic is neat. The logic is wrong. Why wrong? Because the draws are without replacement. Zero-shot CoT is more likely to do this.

Without replacement means combinations are cleaner.
Ways to choose 2 red: C(5,2) = 10.
Ways to choose 2 blue: C(4,2) = 6.
Ways to choose 2 green: C(3,2) = 3.
Favourable ways = 10 + 6 + 3 = 19.
Total ways to choose any 2 from 12 = C(12,2) = 66.
Probability = 19/66.

Correct answer: 19/66 ≈ 0.288. Same base model can change behavior a lot with this one prompt tweak.

2.3 Few-shot CoT¶

Few-shot CoT goes one step further. You do not just say, "Think step by step." You demonstrate what good reasoning looks like. That helps in two ways. First, it activates the right pattern. Second, it sets the desired granularity. For example, if your demonstrations explicitly track units, the model is more likely to track units later. If your demonstrations check constraints before answering, the model is more likely to do that too. A tiny template.

Example 1
Question: If 4 machines make 40 parts in 5 minutes, how many parts do 2 machines make in 10 minutes?
Reasoning:
4 machines make 40 parts in 5 minutes.
So 1 machine makes 10 parts in 5 minutes.
In 10 minutes, 1 machine makes 20 parts.
So 2 machines make 40 parts.
Answer: 40
Example 2 Question: ... Reasoning: ... Answer: ...
Now solve:
Question: ...
Reasoning:

Few-shot CoT often beats zero-shot CoT on tasks with stable patterns. Math. Symbolic logic. Rate problems. Certain code transforms. But it has a cost. Longer prompt. Higher token usage. Potential overfitting to the demonstration style.

2.4 Why CoT works, at least partially¶

We do not have a single complete theory. But we have useful hypotheses.

Hypothesis 1: inference-time compute helps¶

CoT gives the model more room to work. If the model emits ten intermediate lines, it effectively uses more test-time compute than a two-line answer. This matters because some tasks are compute-limited at inference. The model has the knowledge. But it needs time to unpack it.

Hypothesis 2: decomposition reduces search difficulty¶

A hard problem may be impossible in one jump. It becomes easy when split into five subproblems. CoT encourages that split. Humans do the same thing. We write on rough paper. Not because the formula is unknown. But because working memory is limited.

Hypothesis 3: explicit state reduces drift¶

Reasoning steps keep temporary facts visible. If the model writes, "Subtotal = 2400," that value becomes part of the immediate context. This is better than asking the hidden activations alone to preserve it.

Hypothesis 4: pattern completion loves formats¶

LLMs are pattern machines. If the pattern is "question -> reasoning -> answer," the model often completes it well. Especially if the examples are clean.

Hypothesis 5: self-check becomes possible¶

Once there are intermediate steps, another pass can inspect them. Or the same model can inspect them. That enables consistency checks.

2.5 CoT is not magic¶

The phrase "let's think step by step" became famous for a reason. But please do not turn it into superstition. It is not a universal spell. If the base model is too weak, CoT may produce longer nonsense. If the task is trivial, CoT only adds delay. If the task is format-sensitive, CoT may break the schema. If the prompt encourages over-elaboration, the model may hallucinate confident steps. So the right mental model is not, "CoT makes the model smarter." The better mental model is, "CoT buys extra structured computation at inference time." Sometimes that extra compute pays off. Sometimes it does not.

2.6 When CoT helps¶

CoT usually helps most when the task has these properties. - multiple dependent steps - hidden intermediate variables - constraint checking - arithmetic or symbolic manipulation - planning with branches - multi-hop question answering - code debugging with causal reasoning Examples: - word problems - scheduling - SQL debugging - contract clause comparison - multi-step support triage - policy reasoning with exceptions A good rule: if losing one intermediate variable would ruin the answer, CoT may help.

2.7 When CoT hurts¶

CoT can hurt in at least five ways.

Hurt 1: extra latency¶

More output means more time. That alone can violate product constraints.

Hurt 2: extra cost¶

Visible steps use tokens. Tokens cost money. On high-volume systems, that matters immediately.

Hurt 3: wrong but detailed answers¶

A short wrong answer is bad. A long wrong answer is worse, because humans trust detail.

Hurt 4: formatting failures¶

If you need strict JSON, CoT can produce prose before the JSON. That breaks downstream systems.

Hurt 5: information leakage and safety complications¶

In some products, you do not want to reveal all intermediate reasoning. You may want concise answers. Or redacted rationale. Or policy-safe summaries. This is one reason hidden CoT became attractive.

2.8 CoT versus self-consistency¶

A useful extension is self-consistency. Instead of one chain, sample several chains. Then take the most common final answer.

question
  |
  |--> chain 1 --> answer A
  |--> chain 2 --> answer A
  |--> chain 3 --> answer B
  |--> chain 4 --> answer A
  |
  ---> majority answer = A

This often improves reliability on math and logic. Why? Because one path may stumble. Several independent paths can average out mistakes. But again, cost rises quickly.

2.9 CoT versus Tree of Thoughts¶

CoT is usually a single path. Tree of Thoughts, or ToT, allows branching. Instead of, "follow one reasoning line," you do, "explore several candidate lines." This matters when early branching decisions are hard. Example: - choose an approach - estimate feasibility - abandon a dead end - try another path That is much closer to deliberate search. We will revisit this as the move tree in Chapter 3.

2.10 CoT versus reasoning models¶

Here is the core distinction. CoT prompting is an external instruction. Reasoning models are trained to use internal reasoning patterns more reliably. So CoT is a prompt-time intervention. Reasoning models are a model-time intervention. That is why a reasoning model may outperform a standard model, even if both are given the same CoT prompt. The reasoning model has been optimized for longer- horizon problem solving. Not just nudged.

2.11 A tiny comparison table¶

Method	What changes	Typical upside	Typical downside
Direct prompting	nothing	fastest	brittle on hard chains
Zero-shot CoT	add instruction	cheap accuracy boost	verbose, inconsistent
Few-shot CoT	add demonstrations	stronger patterning	long prompt, brittle demos
Self-consistency	sample many chains	more robust final answer	expensive
Reasoning model	change the model itself	best on hard tasks	highest cost and latency
### 2.12 Practical prompting advice
For standard models, use CoT carefully. Ask for steps when the task needs steps. For reasoning models, do not overstuff the prompt. Say the goal. State constraints clearly. Ask for the output format.
Then let the model use its own thinking budget. Many teams make this mistake. They use a reasoning model. Then they add an enormous visible scratchpad template. Then latency explodes. Let the model
think. Do not always micromanage the thinking.
---
> Retrieval prompt 2.
> Explain the difference between zero-shot CoT,
> few-shot CoT,
> self-consistency,
> and a native reasoning model.
> One sentence each.
---
## Chapter 3: Reasoning model architectures
### 3.1 From prompting tricks to trained reasoning
CoT prompting showed an important fact. Inference-time computation can improve results. Once that became clear, labs asked a bigger question. What if we train models to reason more deliberately by
default? That question led to modern reasoning models. The exact internal recipes are partly proprietary. So be careful. Speak in high-level mechanisms, not fake certainty. At a high level, modern
reasoning models seem to combine several ideas.
- internal or hidden scratchpads
- longer test-time compute
- reinforcement learning on hard problems
- search over candidate reasoning paths
- verification before final answer
That list is enough for an interview answer.
### 3.2 Standard LLM pipeline versus reasoning pipeline
A standard pipeline looks like this.
`prompt \| ---> next token ---> next token ---> next token ---> final answer`
A reasoning pipeline looks more like this.
`prompt \| ---> internal reasoning state \| \|--> candidate path 1 \|--> candidate path 2 \|--> candidate path 3 \| ---> check / score / revise \| ---> final answer`
Notice the difference. The standard pipeline is mostly linear. The reasoning pipeline has internal branching and evaluation. Not always literally like a full search tree. But functionally that is the
shift.
### 3.3 Hidden chain-of-thought
One visible trend is hidden CoT. The model may do internal reasoning, but the user only sees the final answer or a short summary. Why hide it? Several reasons. First, safety. Full raw traces may
reveal unsafe patterns, spurious reasoning, or training artifacts. Second, product design. Users often want the answer, not pages of scratch work. Third, training freedom. If the model knows its
private workspace is not user-facing, it may reason more flexibly. But this creates a major downside. We cannot fully audit hidden CoT. We can audit outcomes. We can audit summaries. We can run
verifiers. But the exact internal path is not visible. That becomes important in Chapter 5.
### 3.4 The o1 / o3-style picture
Again, speak carefully here. Publicly, the high-level story is this. These systems appear to use more deliberate test- time compute, and they are trained to perform better on difficult reasoning
tasks. The common mental model is:
1. spend more compute on hard tasks
2. maintain internal scratch work
3. use reinforcement learning or outcome-based training on hard problems
4. prefer trajectories that lead to correct final answers
5. sometimes search and verify before responding
You do not need the exact secret sauce to understand the product trade-off. The important thing is this. The model is not merely chatting longer. It is optimized to use longer reasoning trajectories
effectively. That is why it can beat a standard model with the same prompt.
### 3.5 Reinforcement learning on hard problems
Why bring reinforcement learning into reasoning? Because next-token imitation is not enough for hard search. Hard reasoning needs credit hands_on_lab across long sequences. Suppose a math problem has 40
steps. A model may do 39 steps correctly. Then fail at the last step. Or it may choose a bad approach at step 2, and everything after that looks polished but doomed. Training only on final text
imitation does not strongly teach, "explore better branches." RL-like methods can push the model toward better trajectories. Especially when the task has a verifiable outcome. Math answer matches.
Code passes tests. Constraint-satisfying plan works. That gives a training signal. In plain words: reward the lines of thinking that end correctly, and penalize the lines that do not.
### 3.6 Search, verification, and the move tree
Now we meet the move tree properly. Imagine a model solving a puzzle. There may be three plausible first approaches.
- algebraic manipulation
- case split
- working backward
A direct model often picks one and commits. A reasoning model is more likely to do something like this.
`start ├─ path A: algebra │ ├─ step 1 looks fine │ ├─ step 2 creates contradiction │ └─ backtrack ├─ path B: case split │ ├─ case 1 works │ ├─ case 2 works │ └─ promising └─ path C: guess formula ├─ fast └─ unverifiable`
This is search. Search means we do not trust the first pleasant path. We inspect alternatives. We may score them. We may prune them. We may return to a better branch. That is the backtrack.
### 3.7 Verification changes everything
Search without verification can still wander. Verification is what grounds search. A verifier can be external or internal. External verifier examples:
- run the unit tests
- execute the SQL
- check the arithmetic
- validate a JSON schema
- confirm constraints are satisfied
Internal verifier examples:
- consistency check
- critic model
- self-evaluation head
- scoring rubric
A simple loop looks like this.
`draft answer \| ---> verifier \| \|--> pass --> return \| \|--> fail --> revise / search again`
This loop is very powerful. In coding, it is almost magical. Because code has cheap external verification. Tests either pass or fail. Reasoning models plus tools plus verification create big
capability jumps.
### 3.8 A code example of search plus verification
User asks: "Write a Python function that merges overlapping intervals." A quick model may produce code that works on easy cases. But it may fail on:
- nested intervals
- unsorted input
- equal boundaries
A reasoning model can do better because it can internally do something like this.
`candidate 1: sort and sweep check on nested case passes candidate 2: pairwise merge without sorting fails on unsorted input candidate 3: recursive version works but overcomplicated`
Then it picks candidate 1. The user only sees clean code. Behind the scenes, the model effectively searched and filtered.
### 3.9 DeepSeek-R1 and the open ecosystem
DeepSeek-R1 matters for one big reason. It made reasoning feel less mystical. Before open reasoning models, people could say, "Only a frontier closed lab can do this." Open releases changed that
conversation. The open ecosystem showed that reasoning-like behaviour can be reproduced, distilled, studied, and improved outside a single provider. DeepSeek-R1 is important as an open reference
point. Not because it solved every question. But because it widened the design space. It also forced engineers to think comparatively. Closed reasoning models versus open reasoning models. Best
accuracy versus deployability. API convenience versus weight control. That is a healthy shift.
### 3.10 Distillation and smaller reasoning-capable models
A common trick is distillation. Let a strong reasoning model generate high-quality solutions. Then train a smaller model on those solutions. This does not perfectly transfer the deep capability. But
it can transfer a surprising amount. So the ecosystem often looks like this.
`very strong slow model \| ---> generate good trajectories / answers \| ---> train smaller cheaper model`
This matters in production. You may use the expensive model offline. Then serve the cheaper distilled model online.
### 3.11 Test-time compute scaling
This is the central concept. Training-time scaling asks: what happens if I train a bigger model on more data? Test-time compute scaling asks: what happens if I let the model think longer at inference
time? The reasoning era made test-time compute a first-class knob. You can think of it like this.
`same question \| \|--> 1 second budget --> quick draft \|--> 5 second budget --> some checking \|--> 30 second budget --> branching + verification`
More time can improve hard-task accuracy. But only if the model knows how to use the time. That last clause is crucial. A weak model with more time may just ramble longer. A trained reasoning model
uses the extra budget more productively.
### 3.12 Why test-time compute resembles search
Every extra token can do one of several jobs.
- expand an intermediate step
- try a different approach
- verify a claim
- restate constraints
- recover from an earlier mistake
So the extra compute is not just verbosity. It can act like selective search over solution space. That is why analogies from chess are useful. More search depth can beat shallow pattern recall.
### 3.13 Diminishing returns still apply
Do not think longer is always better. There is usually a curve like this.
`accuracy ^ \| ______ \| __/ \| __/ \| __/ \| __/ +------------------------> compute budget`
Early extra compute helps a lot. Later extra compute helps less. Sometimes later extra compute hurts, because the model starts overthinking. So budget control matters.
### 3.14 Overthinking is a real failure mode
A reasoning model can turn a simple task into a long detour. Ask for a filename extraction. It may discuss edge cases. Ask for a capital city. It may ramble about history. This is not intelligence. It
is misallocated compute. So the time budget must be tied to task value.
### 3.15 A compact architecture comparison
Idea	Standard model	Reasoning model
---	---	---
Default response style	direct answer	deliberate reasoning
Intermediate state	mostly implicit	often richer and longer
Branch exploration	weak	stronger
Self-correction	limited	more common
Test-time compute	low and fixed	adjustable and higher
Best use case	easy or high-volume tasks	hard multi-step tasks
### 3.16 Do not confuse reasoning with long outputs
This mistake is everywhere. Long answer does not mean deep reasoning. Short answer does not mean shallow reasoning. A good reasoning model may think privately and answer briefly. A bad prompt may make
a standard model output two pages of fake steps. So measure reasoning by improved task performance, not by visible verbosity.
---
> Retrieval prompt 3.
> In your own words,
> explain hidden CoT,
> search plus verification,
> and test-time compute scaling.
> One paragraph total.
---
## Chapter 4: When to use reasoning models
### 4.1 Start with a task complexity spectrum
Not every task deserves the grandmaster. Put tasks on a spectrum.
`easy ------------------------------------------------------------ hard spell fix \| extract field \| classify email \| SQL debug \| plan trip \| prove result`
On the left, the quick guess is enough. On the right, the thinking pause pays for itself. A practical table helps.
Task type	Default choice	Why
---	---	---
Rewrite sentence	fast standard model	low reasoning depth
Summarize transcript	fast or medium model	mostly compression
Extract entities to JSON	fast model + schema check	reasoning is small
Generate SQL from messy request	standard first, reason if failed	hidden constraints matter
Debug failing code	reasoning model	multi-step causal analysis
Long-horizon agent planning	reasoning model	branching and backtracking matter
Safety or policy adjudication	reasoning model + human	mistakes are costly
### 4.2 Use the cheapest model that clears the bar
This is the operating principle. Not the smartest model. Not the cheapest model. The cheapest model that clears the quality bar. That sentence keeps budgets sane. To apply it, you need the bar. What
accuracy is acceptable? What failure rate is acceptable? What latency percentile is acceptable? What is the cost of a wrong answer? Without those numbers, model selection becomes vibes.
### 4.3 Cost-quality trade-off framework
Here is a simple decision frame. Ask four questions.
1. How expensive is a mistake?
2. How frequent is the task?
3. How much latency can the user tolerate?
4. How much quality gain does reasoning actually buy here?
Now map it.
- high-frequency + low-stakes + tight latency -> fast model
- low-frequency + high-stakes + tolerant latency -> reasoning model
- mixed case -> route adaptively
A concrete product example. Autocomplete in an editor. Millions of requests. Sub-second expectation. Mistakes are annoying but cheap. Use the fast model. Now compare contract redlining for legal
review. Lower volume. High stakes. Users accept slower output. Use the reasoning model.
### 4.4 Latency is not just an engineering metric
Latency changes product behaviour. A 500 millisecond assistant feels instant. A 30 second assistant feels like a batch job. So when you choose a reasoning model, you are not only choosing a model. You
are choosing a product mode. Chat mode. Agent mode. Batch mode. Review mode. Those modes have different user expectations.
### 4.5 Routing is the key pattern
Routing means deciding which model handles which request. A simple router might do this.
`incoming task \| ---> fast model \| \|--> passes verifier --> ship \| \|--> fails verifier --> escalate \| ---> reasoning model \| \|--> passes --> ship \| \|--> fails --> human review`
This pattern is gold. It preserves speed on easy tasks. It preserves quality on hard tasks. It also makes cost more predictable.
### 4.6 Cheap first, escalate second
This is the most common production design. Start with the cheap model. Escalate only when needed. How do you decide "needed"? Several signals help.
- low confidence
- long prompt with many constraints
- user asks for planning or debugging
- fast-model output fails a verifier
- previous attempt was unsatisfactory
- retrieved context is large and conflicting
You do not need one perfect hardness score. A few strong heuristics already help.
### 4.7 Routing by task class
One easy router is rule-based. If task class is extraction, use fast model. If task class is math, use reasoning model. If task class is code debugging, use reasoning model. If task class is creative
copy, use standard model. This is simple and often enough for version one. Later, you can learn a router from data.
### 4.8 Routing by verifier failure
This is often stronger. Instead of guessing difficulty upfront, attempt cheaply. Then verify. If verification fails, escalate. Examples:
- JSON parse failed
- SQL execution failed
- unit tests failed
- arithmetic check failed
- citation support missing
- constraint checklist incomplete
Verification-driven routing is powerful because it is grounded.
### 4.9 Routing by uncertainty
Some systems estimate uncertainty directly. For example:
- model self-rated confidence
- entropy-like signal
- disagreement across samples
- judge model score
- calibrated probability of error
Be careful though. Self-confidence is noisy. Models can be overconfident. So uncertainty signals are useful, but they should not be the only signal.
### 4.10 Worked example — customer support
Task: Classify incoming support tickets into 12 queues. Most tickets are easy. "Reset password." "Update address." "Cancel subscription." A fast model is enough. But some tickets contain multiple
issues. Refund request. Fraud suspicion. Shipping damage. Regulatory complaint. Escalation threat. Those deserve extra care. A router can do this.
- short simple tickets -> fast classifier
- ambiguous or multi-issue tickets -> reasoning classifier
- legal or fraud language -> reasoning + human review
This saves cost without blind risk.
### 4.11 Worked example — analytics assistant
Task: Generate SQL for business users. Easy request: "Show daily signups last 30 days." Fast model can often handle it. Hard request: "Compare conversion before and after campaign launch, excluding
internal users, using first-touch attribution, grouped by week, and flag weeks with incomplete data." That request is full of traps.
- definition of conversion
- attribution rule
- exclusion logic
- date window alignment
- incomplete data handling
A reasoning model is much more justified. Even then, you should execute the SQL and verify.
### 4.12 Worked example — coding agent
Task: Propose a minimal fix for a failing test suite. If the failure is a one-line import mistake, a fast model can patch it. If the failure spans three modules and hidden invariants, use a reasoning
model. Even better, let the model inspect tests, edit code, and run verification. Reasoning plus tools plus verifier is stronger than raw reasoning alone.
### 4.13 Worked example — research synthesis
Task: Read six policy documents and answer, "What changed in the refund policy between versions?" This is not only retrieval. It is comparison, conflict resolution, and exception tracking. Reasoning
model, absolutely. But still require citations. Reasoning without grounding becomes elegant fiction.
### 4.14 When not to use reasoning models
Do not use them by default for:
- spelling correction
- short paraphrasing
- entity extraction
- routing emails by obvious keywords
- simple FAQ lookup
- strict low-latency autocomplete
- bulk low-stakes transformations
That is like sending a senior architect to rename files. Possible. Not economical.
### 4.15 A budgeting rule of thumb
If the task volume is huge, save reasoning for the hardest slice. Often that hardest slice is only 5 to 20 percent. That small slice creates most of the user pain. So use the expensive model there.
This is how you get leverage.
### 4.16 Cost and value must meet
Suppose the reasoning model costs 8 times more. If it reduces serious failures by 90 percent on a tiny high-value workflow, that is great. If it improves easy summarization by 1 percent on a million
daily requests, that is terrible. Always compare extra spend to error reduction on business-critical tasks.
### 4.17 Routing is a product and evaluation problem
Teams sometimes think routing is just prompt engineering. It is not. Routing needs:
- task taxonomy
- verifier design
- logging
- cost accounting
- latency monitoring
- fallback behaviour
- human escalation policy
That is why routing feels senior. It touches everything.
### 4.18 A minimal routing scorecard
Track these metrics.
Metric	Why it matters
---	---
overall task success	end outcome
success on hard slice	where reasoning matters
average latency	user experience
p95 latency	tail pain
cost per request	budget control
escalation rate	router aggressiveness
verifier failure rate	quality signal
human escalation rate	operational burden
Without this scorecard, your routing story is incomplete.
### 4.19 Foundation-gap audit before Module 13
Module 13 will move into image and video models. Before you go there, four concepts from this module must feel natural.
#### Concept 1: test-time compute
Can you explain why letting a model think longer can improve hard-task accuracy? If not, revisit Chapter 3.11 to 3.13.
#### Concept 2: cost-quality trade-off
Can you explain why the best model is not always the best production choice? If not, revisit Chapter 4.2 and 4.3.
#### Concept 3: routing patterns
Can you sketch a cheap-first, escalate-second architecture from memory? If not, revisit Chapter 4.5 to 4.9.
#### Concept 4: evaluation methodology
Can you say how you would prove the routing policy is actually better? If not, Chapter 5 is not optional. These four ideas transfer directly to multimodal systems. The modality changes. The decision
logic does not.
---
> Retrieval prompt 4.
> Sketch a production router from memory.
> Fast model first.
> Reasoning model second.
> Verifier in the loop.
> Human fallback last.
---
## Chapter 5: Limitations and evaluation
### 5.1 Faithfulness of chain-of-thought
Now the uncomfortable part. A reasoning trace can look excellent and still be misleading. Sometimes the model gets the right answer for the wrong reasons. Sometimes it gives a plausible explanation
after arriving at the answer. That is called a faithfulness problem. The visible CoT may not be the true causal process. A toy example. Question: Is 17 prime? Bad reasoning trace:
`17 is odd. It is not divisible by 5. Therefore it is prime.`
Final answer: correct. Reasoning: insufficient. The answer happened to be right. But the justification is not reliable. A faithful explanation would check divisibility up to `sqrt(17)`. Why does this
matter? Because in production, we often want to trust the process, not just the output. If the explanation is fake, audit becomes hard.
### 5.2 Hidden CoT is even harder to audit
With hidden CoT, the internal reasoning is not visible at all. That has product and safety advantages. But it also means we cannot inspect the exact path. So we must evaluate differently. We lean more
on:
- final answer quality
- external verifiers
- consistency tests
- counterfactual probes
- human review for high-risk tasks
This is why the sentence "hidden CoT is unauditable" is important. Not because the model is useless. Because the evaluation strategy must change.
### 5.3 Benchmark gaming
Benchmarks are helpful. Benchmarks are also dangerous. A model can look brilliant on a benchmark and disappoint users. How does that happen?
#### Problem 1: benchmark contamination
The benchmark or close variants may appear in training data. Then the score overstates reasoning.
#### Problem 2: narrow optimisation
Teams optimise for the published benchmark. The model learns the test style, not the general skill.
#### Problem 3: judge gaming
If an LLM judge scores the output, the candidate model may learn to please the judge. Length. Style. Certain phrases. Not actual correctness.
#### Problem 4: distribution mismatch
Real user tasks are messy. Benchmarks are tidy. Production reasoning often includes missing information, unclear goals, badly written inputs, and mixed constraints. That gap is huge.
### 5.4 Overthinking simple tasks
Reasoning models can absolutely overthink. This failure is underrated. You ask for a two- field JSON extraction. The model gives a mini-essay. You ask for a yes/no answer. The model explores nine edge
cases that do not apply. This hurts in three ways.
- slower response
- higher cost
- more chances to drift off format
So "more reasoning" is not equivalent to "better system." Correct allocation matters more than maximum allocation.
### 5.5 How to evaluate reasoning systems
You need a layered eval plan. One metric is never enough.
#### Layer 1: final outcome accuracy
Did the model solve the task? This is the most important metric. For math, that may be exact answer. For code, tests passing. For SQL, correct result on validation queries. For planning, constraint
satisfaction.
#### Layer 2: verifier pass rate
How often does the answer satisfy cheap automated checks? Examples:
- parses as JSON
- compiles
- cites supporting document
- respects budget limit
- matches schema
Verifiers are crucial because they scale.
#### Layer 3: hard-slice accuracy
Create a curated set of genuinely hard tasks. Not random easy traffic. Hard-slice evals are where reasoning models earn their keep.
#### Layer 4: cost and latency
A model that is 2 points better but 15 times slower may not be a win. So measure:
- average latency
- p95 latency
- token cost
- cost per solved task
Cost per solved task is especially useful. It combines quality with spend.
#### Layer 5: routing metrics
If you use a router, evaluate the router too.
- how often did it escalate?
- did it escalate the right tasks?
- what easy tasks got escalated unnecessarily?
- what hard tasks were mistakenly left on the fast path?
A confusion matrix is helpful.
Actual difficulty	Routed fast	Routed reasoning
---	---	---
easy	good if most here	waste if too many here
hard	dangerous if many here	desired destination
#### Layer 6: human review on high-risk cases
Some tasks cannot be fully automated. Use sampled human review. Especially for:
- legal reasoning
- medical reasoning
- safety-critical decisions
- ambiguous policy applications
#### Layer 7: adversarial and holdout sets
Do not trust only the benchmark you tune on. Keep holdout sets. Create adversarial cases. Mix noisy inputs. Add conflicting instructions. Use edge conditions. That is how you detect brittle reasoning.
### 5.6 Evaluate reasoning models against the right baseline
A common mistake is unfair comparison. Teams compare:
- fast model with bad prompt
against
- reasoning model with perfect prompt
That tells you very little. The right comparison is usually one of these.
- best standard model prompt
versus
- best reasoning model prompt
or
- fast model plus verifier plus retry
versus
- reasoning model plus verifier
Reasoning is part of a system. Not a floating intelligence score.
### 5.7 A worked evaluation example
Suppose you are building an analytics assistant. You compare three setups.
1. fast model direct
2. fast model + zero-shot CoT + SQL execution retry
3. reasoning model + SQL execution retry
Your held-out hard set has 100 tasks. Results might look like this.
Setup	Success rate	Avg latency	Relative cost
---	---	---	---
fast direct	61%	1.2s	1x
fast + CoT + retry	72%	2.8s	2.2x
reasoning + retry	86%	11.5s	7.8x
Now the Lead question is not, "Which is highest?" The Lead question is, "Which workflow deserves 11.5 seconds and 7.8x cost?" If this assistant is used by finance analysts for month-end close, maybe
yes. If it is a casual dashboard helper, maybe no.
### 5.8 Faithfulness eval is different from outcome eval
Sometimes you want to know whether the explanation reflects the actual basis. That is harder. Possible methods include:
- compare explanation against verified supporting evidence
- ask a separate judge whether each step follows
- perturb intermediate facts and see if the explanation changes appropriately
- use tasks with known derivations
- inspect tool traces instead of internal prose
But be humble here. Faithfulness is not solved.
### 5.9 Calibration also matters
A reasoning model should know when not to trust itself. Two answers with the same confidence can have different actual error rates. So calibration matters. Useful signs:
- selective abstention on uncertain cases
- honest uncertainty language
- better alignment between confidence and correctness
Overconfident wrong reasoning is worse than short uncertainty.
### 5.10 Common evaluation traps
Avoid these.
- measuring only average accuracy
- ignoring tail latency
- evaluating only easy tasks
- tuning on the same benchmark you report
- trusting self-confidence too much
- equating verbose reasoning with faithful reasoning
- ignoring human review on high-risk cases
### 5.11 Honest admission
We do not fully understand why chain-of-thought works as well as it does. We have good hypotheses. We do not have a complete theory. We also do not have perfect visibility into hidden CoT systems. If
the internal scratchpad is private, we cannot fully audit it. We can measure outcomes. We can build verifiers. We can compare systems. But we cannot pretend the black box disappeared. Also, benchmarks
can overstate progress. A model may improve on GSM8K, MATH, or GPQA, and still fail your real workflow. So be honest in interviews. Say this clearly. "Reasoning models are genuinely useful on hard
tasks. But the mechanism is partly opaque, and evaluation must be outcome-focused plus verifier-driven." That is a trustworthy answer.
### 5.12 What good teams do anyway
Despite the uncertainty, good teams do not freeze. They do practical things.
- curate hard eval sets
- add verifiers
- route intelligently
- cap compute budgets
- log failures
- sample human review
- compare against cheap baselines
- keep holdouts
Engineering is often progress under uncertainty. This is one of those cases.
---
> Retrieval prompt 5.
> Why can visible CoT be unfaithful,
> and why is hidden CoT even harder to audit?
> Answer in four lines.
---
## Chapter 6: Recap and application
### 6.1 The failure-fix chain
Every concept in this module exists because something broke.
#	Failure	Fix
---	---	---
1	quick guess fails on multi-step tasks	ask for a thinking pause
2	one chain gets trapped early	explore a move tree
3	bad branch continues confidently	backtrack and revise
4	prompting alone is fragile	train native reasoning models
5	hard tasks need more search	allocate more test-time compute
6	reasoning everywhere is too expensive	route by difficulty and value
7	slow answers break UX	respect latency budgets
8	pretty reasoning can still be fake	evaluate faithfulness separately
9	benchmarks flatter the system	use holdouts and real-task evals
10	hidden reasoning cannot be audited directly	rely on outcomes, verifiers, and human review
11	simple tasks get overthought	keep a cheap fast path
12	router guesses can be wrong	measure escalation quality and revise
That table is the spine of the module.
### 6.2 Key points to remember
- reasoning models are not universally better
- they are better on tasks with deeper dependency chains
- CoT is prompt-time compute
- reasoning models are model-time optimisation for longer reasoning
- search plus verification is often more important than raw verbosity
- test-time compute is a budget knob
- routing is the production pattern that makes reasoning affordable
- evaluation must include quality, cost, latency, and verifier pass rate
- faithfulness is not guaranteed
- hidden CoT improves product design but reduces auditability
### 6.3 Important interview questions
Q1. What is the difference between CoT prompting and a reasoning model? CoT prompting asks a standard model to expose intermediate steps. A reasoning model is trained to use longer-horizon
reasoning more effectively at inference time. The first is a prompt trick. The second is a model capability. Q2. What is test-time compute scaling? It means allocating more compute during
inference, such as longer internal reasoning, branch exploration, or verification, to improve hard-task accuracy. It is the inference analogue of scaling training compute. **Q3. When should you not
use a reasoning model? Do not use it for low-stakes, high- volume, simple transformations with tight latency budgets. Examples: extraction, paraphrase, basic FAQ lookup, or autocomplete. Q4. Why
can a chain- of-thought be unfaithful?** Because the visible explanation may be a plausible story after the answer, not the true causal reasoning path. Correct answer does not guarantee correct
justification. Q5. How would you evaluate a reasoning system in production? Use a layered approach: hard-set accuracy, verifier pass rate, cost, latency, routing metrics, and sampled human review
for high-risk cases.
Q6. Why is routing such an important design pattern? Because it lets you preserve fast cheap performance on easy traffic, while reserving expensive reasoning for the slice where it truly pays off.
Q7. Why is search plus verification so powerful in code tasks? Because code has cheap external feedback. Tests, compilers, and linters provide objective signals that can guide search and
correction.
### 6.4 Production experience, the practical version
In real systems, the fastest model often handles most traffic. The reasoning model handles the painful minority. That minority may still create most of the business risk. So do not obsess over average
task difficulty. Find the expensive failures. That is where reasoning helps. Also, keep the architecture boring where possible. Fast model. Verifier. Escalate. Reasoning model. Verifier again. Human
if needed. This boring loop ships well. Another practical lesson. If the answer can be externally checked, reasoning systems get much better. Code. Math. Schema generation. SQL. Budget- constrained
planning. These are friendly environments for reasoning models. Pure open-ended essay judgement is harder. The verifier is weaker there.
### 6.5 Worked mini playbook for deployment
If you were shipping a reasoning feature next week, I would suggest this order.
1. collect 50 to 100 hard real tasks
2. establish a fast baseline
3. add a verifier wherever possible
4. test CoT on the fast model
5. test a reasoning model on the same set
6. compare quality, latency, and cost per solved task
7. design a cheap-first router
8. monitor escalations and human handoffs after launch
That sequence saves time. It also prevents "reasoning model by default" laziness.
### 6.6 Exercises
Try these without notes.
#### Exercise 1
Take one hard analytics question from your work. Write three versions of the prompt.
- direct answer
- zero-shot CoT
- reasoning model with concise instructions
Compare output quality, latency, and cost.
#### Exercise 2
Find a task where CoT hurts. Use a strict JSON extraction task. Observe whether visible reasoning breaks the schema. Then fix it with a better output contract.
#### Exercise 3
Design a router for a support assistant. State:
- what goes to the fast model
- what escalates
- what the verifier checks
- when humans step in
#### Exercise 4
Create five adversarial reasoning examples. Each should include one hidden trap. Examples:
- order-of-operations trap
- unit conversion trap
- conflicting policy clauses
- off-by-one planning constraint
- SQL grain mismatch
#### Exercise 5
Take a model explanation that sounds convincing. Ask whether every step is actually necessary and sufficient. Separate "correct answer" from "faithful explanation."
#### Exercise 6
Define a time budget policy. For each task class, decide the maximum acceptable latency and why. That is a very interview-friendly exercise.
### 6.7 A compact memory sheet
Memorize these six sentences.
1. A standard model does the quick guess.
2. CoT creates the thinking pause.
3. Search creates the move tree.
4. Self-correction enables the backtrack.
5. Budgeted compute is the time budget.
6. Production success comes from routing plus evaluation.
If those six sentences are internalised, the module is already working.
### 6.8 Foundation-gap audit
Module 13 assumes you can already do four things from memory.
#### Audit question 1
Can you explain test-time compute without using jargon? Target answer: "The model is given more inference budget to explore, check, and revise before finalising the answer."
#### Audit question 2
Can you explain cost-quality trade-offs like a Lead? Target answer: "We should use the cheapest model that clears the business quality bar under the product latency constraint."
#### Audit question 3
Can you draw a routing pattern? Target answer:
`request -> fast model -> verifier -> ship \-> if fail -> reasoning model -> verifier -> ship or human`
#### Audit question 4
Can you describe a minimal eval stack? Target answer: "Hard-set accuracy, verifier pass rate, cost, latency, routing metrics, and human review for risky cases." If any of those answers feel shaky, re-
read this explainer before moving on.
### 6.9 Final honest sentence
Reasoning models are a real step forward. But they are not magic. They are better thought of as systems that spend compute more deliberately on hard tasks. That framing will keep you sane.
### 6.10 Bridge to the next module
Next module — `01_multimodal_vision_systems` — moves from text to vision. The attention and scaling ideas transfer. But images and video add spatial dimensions, temporal coherence, and entirely new failure
modes. Carry forward the four ideas from this week:
- test-time compute
- cost versus quality
- routing patterns
- evaluation discipline
You will need all four again.
---
> Final retrieval prompt.
> Without scrolling,
> explain this whole module to a friend using the chess analogy,
> the routing loop,
> and one sentence on why hidden CoT is hard to audit.