Skip to content

01. Opening Failure — Fluent answers can still lose the plot

~10 min read. Watch a fast model crack on a real production task. Until you feel this break, the rest of the module looks like marketing.

Built on the ELI5 in 00-eli5.md. the quick guess — direct next-token generation with no scratchpad — is the failure mode every reasoning upgrade tries to fix. We start by watching one fail in slow motion.


The failure shape

A plain LLM is excellent at local continuation. It picks the next token that sounds right. It does not protect the whole chain. So multi-step tasks slip in the middle, not the ending. The output still finishes smoothly because every later sentence is locally plausible. That polish is what fools reviewers.

question ──→ step 1 ──→ step 2 ──→ step 3 ──→ final answer
                  missed constraint
            everything downstream reads clean
            but rides on the broken state

See. The bug is not at the bottom. The bug is upstream. The bottom only inherits the damage. The model writes the wrong final number in a fluent paragraph and a tired engineer ships it.

the quick guess is the friendly name for this pattern: locally fluent, globally fragile.


Why next-token skill is not reasoning

Next-token prediction is local. Reasoning correctness is global. Local fluency asks "what sounds right now?" Global reasoning asks "what still holds at the end?" Related, yes. Identical, no.

A model may know every rule but compose them in the wrong order. It may forget units halfway. It may drop a constraint silently. It may never run a check on its own answer. So what to do? First we stop trusting fluent text as proof. We look for skipped steps, reordered operations, lost constraints, unverified claims. That hunt is where the thinking pause earns its keep.


Worked example: a customer-support refund agent

Picture a refund tool wired into an LLM. The agent sees an order and decides the refund amount. Four real rules apply, in this order:

  1. Eligibility window — refund only if order_date >= today - 30 days.
  2. Subtotal — start from item_total (excludes tax and shipping).
  3. Restocking fee — 15% deduction on opened electronics.
  4. Original payment — refund to the same method, in original currency, after FX conversion at today's rate.

The user message:

"Hi, I bought the wireless headphones on April 12, opened them, didn't like them. Paid €189 + €15 shipping + 18% VAT. I'm in Mumbai now and want it back in INR."

A fast model (GPT-4o-mini, Claude Haiku 4.5 without thinking, Gemini Flash without thinking) reads this in one pass and writes a confident reply.

prompt
"You bought the headphones 32 days ago. So eligible.
 Subtotal €189. Restocking fee 15% = €28.35.
 Refund = €189 - €28.35 = €160.65.
 At today's FX rate (₹91/€), refund ≈ ₹14,619."

Sounds tidy. Three things are wrong.

What the model did What was actually true
"32 days ago, eligible" 32 days > 30 day window → not eligible. The model missed the inequality direction.
Refunded subtotal Customer paid VAT; if eligible the tax also refunds. The model dropped that constraint silently.
Used "today's rate" Policy is original payment in original currency. INR conversion not in scope.

Every individual sentence reads fluent. The chain is structurally broken in three places. the quick guess got the local pattern right ("compute refund, mention FX") and the global policy wrong.

A reasoning model (o4-mini with reasoning_effort="medium", or Claude Sonnet 4.6 with thinking.budget_tokens=8000) would burn ~3000 hidden tokens enumerating the rules, checking each against the order data, and outputting a single line: "Order is outside the 30-day window. Not eligible for refund. Escalate to manager review." Boring. Correct.


The four failure patterns you should label

Every confidently-wrong trace fits one of these. Tag your failures with these labels so you can route the fix.

┌──────────────┬──────────────────────────────────────────────────┐
│ skip         │ skipped a required step entirely                 │
│ reorder      │ applied steps in the wrong order                 │
│ forget       │ dropped a constraint partway through             │
│ never-check  │ wrote the final answer with no self-verification │
└──────────────┴──────────────────────────────────────────────────┘
  • Skip is solved by decomposition prompts — force the model to enumerate.
  • Reorder is solved by template prompts that fix the order — or by structured outputs.
  • Forget is solved by tool calls that re-read the constraint at each step.
  • Never-check is solved by a verifier model, a unit test, or a programmatic schema check.

So reasoning is not one fix. It is four fixes for four bug classes. You should know which one bites your workflow.


What a lead engineer actually asks

A junior question is "which model is smartest?" A lead question is "which task class deserves the time budget under this SLA?" That reframe is most of the job.

Four properties of your task tell you the answer:

Property If yes →
Dependent steps with hidden middle? Buy the thinking pause
Multiple valid branches? Buy the move tree
Wrong answer reversible? Stay fast
Wrong answer hits compliance/money/safety? Add the backtrack verifier

That table is the start of every routing decision in Chapter 9. Memorize the four columns.


Where this lives in the wild

  • GitHub Copilot Chat — repo Q&A — base GPT-4.1 misses imports across multiple files; the agent mode escalates to o3 or Claude Sonnet 4.6 thinking for multi-file refactors, with measured ~22% lift on internal SWE-bench-style task suites.
  • TurboTax Live Assist — calculation flows — sequential operations (deductions before credits before AMT) are exactly the "reorder" failure class; Intuit reports running cross-checks via a deterministic verifier rather than trusting one LLM pass.
  • Cursor — Composer mode — small edits stay on the fast model (Claude Haiku 4.5 or GPT-4.1 mini); whole-task agent runs route to Claude Sonnet 4.6 with extended thinking on a budget of ~16K reasoning tokens per turn.
  • Perplexity Pro — Deep Research — synthesis over 20+ sources is exactly where a fast model "forgets" a constraint; Perplexity's Deep Research mode spends 3–5 minutes and many hidden reasoning tokens to bring a verifier pass.
  • Stripe Radar — fraud explanations — the score still comes from a gradient-boosted model, but the explanation generator must reason over feature interactions; mis-ordered explanations cause analyst confusion, so explanation generation is gated through a verifier.

Pause and recall

  1. Why is local fluency not the same as global reasoning correctness?
  2. In the refund example, which rule did the fast model invert, and what was the cost?
  3. Name the four failure-pattern labels and one fix for each.
  4. What four properties of a task should a lead engineer use to decide whether to pay for reasoning?

Interview Q&A

Q: A senior engineer says "we'll just use GPT-5 for everything because it reasons natively." What do you push back on? A: Cost and latency. GPT-5 with reasoning enabled bills hidden CoT tokens at output rates and adds 5–60 s of latency. For autocomplete, classification, simple summarization, or tool-call argument filling, that spend buys zero quality. The right design is a cascade: cheap model first, escalate on signal (low confidence, long task, multi-file). Blanket escalation triples your bill and breaks your P95.

Common wrong answer to avoid: "We should always use the strongest model for consistency" — consistency without efficiency is bad product design and will fail any SLO conversation.

Q: Your reasoning model returned a fluent wrong answer. How do you debug? A: Three steps. First, inspect the middle of the trace, not the bottom — the bug is upstream. Second, label the failure: skip, reorder, forget, or never-check. Third, pick the matching fix: decomposition prompt, template, tool-grounding, or verifier. Final-answer-only debugging finds nothing useful in reasoning systems.

Common wrong answer to avoid: "Only the final answer matters in production" — without a middle-of-chain diagnosis you cannot tell whether to swap the prompt, add a tool, or add a verifier. Each fix is for a different bug class.

Q: Why does a model trained on trillions of tokens still drop the unit in step 4? A: Because next-token loss rewards local plausibility, not global state preservation. The model can know every individual rule and still compose them wrong because the training signal never explicitly penalized losing state across steps. Reasoning models partly fix this by training on RL signals over full chains, not just next-token cross-entropy.

Common wrong answer to avoid: "Bigger pretraining solves reasoning" — Chinchilla-scale pretraining did not produce o-series behavior. The improvement came from RL on reasoning chains, not from more pretraining tokens.

Q: Why is confidence dangerous in a reasoning failure? A: Confidence is a language signal — the same decoder produces it. A wrong step three sentences ago still produces a confident final paragraph because every later token is locally fluent on top of broken state. Calibration is one of the hardest reasoning-eval axes. Treat tone as zero evidence.

Common wrong answer to avoid: "Higher logprobs on the final answer mean it's correct" — logprobs reflect language likelihood, not factual correctness, especially when the reasoning chain has compounded errors.


Apply now (5 min)

Take one production failure from your team's last sprint. Label it skip / reorder / forget / never-check. Then write one sentence on which fix class it needs — decomposition, template, tool-grounding, or verifier. If you can't classify it, the failure is probably more than one class compounded — that is the signal to add a verifier.

Sketch from memory: Draw the four-cell failure-pattern table from this chapter. Next to each cell, write the production fix.


Bridge. Now we have the failure named. The cheapest first lever is not a new model — it is a prompt that forces the pause. That is chain-of-thought. → 02-chain-of-thought-prompting.md