02. Chain-of-Thought Prompting — Asking for steps is the cheapest reasoning upgrade¶

~10 min read. One sentence in your system prompt can lift accuracy 10–40 points on multi-step tasks. It can also waste tokens on tasks that did not need it.

Built on the ELI5 in 00-eli5.md. the thinking pause — visible reasoning steps inserted before the user-facing answer — is the cheapest way to buy the grandmaster's stop without paying for a trained reasoning model.

The picture before the prompt¶

If the quick guess is the bug, asking the model to slow down is the cheapest patch. Do not demand the final number. Ask for steps. Ask for decomposition. Ask the model to hold intermediate state. That request — in one sentence or in worked examples — is chain-of-thought prompting. People shorten it to CoT.

Replace one jump with a small ladder. Sometimes one instruction is enough:

without CoT:   prompt ──────────────────────────────→ answer
with CoT:      prompt ──→ step 1 ──→ step 2 ──→ step 3 ──→ answer

It is not magic. It is a structured delay before commitment. The model writes its scratch state into the output, which then conditions the next tokens. The scratch is literally the trick — the model gets to read its own intermediate work.

Zero-shot vs few-shot CoT¶

Zero-shot CoT adds an instruction only. No examples. Wei et al. (2022) showed the original lift. Kojima et al. added the magic phrase: "Let's think step by step." That alone lifted GSM8K accuracy on PaLM-540B from 17.7% → 78.7%. The sentence cost: ~6 tokens.

prompt = f"""Question: {q}

Let's think step by step, then give the final answer on a new line
prefixed with 'Answer:'."""

Few-shot CoT goes further. You show 2–8 worked examples of the reasoning style you want.

prompt = f"""Q: A canteen has 23 apples. They use 20 to make lunch and buy 6 more. How many do they have?
A: Start with 23. Use 20 → 23 - 20 = 3. Buy 6 → 3 + 6 = 9.
Answer: 9

Q: {your_question}
A:"""

Few-shot controls format, granularity, units, and final-answer discipline. It costs more tokens (the demonstrations sit in every call) but reduces drift on repeated task types. Production rule of thumb: zero-shot for one-off tasks, few-shot for repeated workflows.

Simple, no? Prompting is the cheapest way to buy the thinking pause. No model swap, no API key change, no retraining.

Worked example: drawing two balls — and why CoT saves it¶

A box has 5 red, 4 blue, 3 green balls. Draw two without replacement. Probability both same color?

A fast model often computes per-color probabilities by squaring fractions: (5/12)² + (4/12)² + (3/12)² = 25/144 + 16/144 + 9/144 ≈ 0.347. Wrong. It ignored that the box changes after the first draw.

A CoT prompt forces the state change to surface:

Step 1: Total ways to pick 2 from 12 = C(12,2) = 66.
Step 2: Same-color combinations:
  red:   C(5,2) = 10
  blue:  C(4,2) = 6
  green: C(3,2) = 3
Step 3: Favorable = 10 + 6 + 3 = 19.
Step 4: Probability = 19 / 66 ≈ 0.288.
Answer: 0.288

See. The step header "Total ways to pick 2 from 12" forces the model to commit to the with-replacement-or-not decision before doing arithmetic. That is the thinking pause preserving global state. Without it, the model races to the formula it has seen most often and ignores the structural difference.

When CoT helps and when it hurts¶

CoT helps when the task has order, units, branches, or hidden assumptions. It hurts in four common cases:

Failure mode	What goes wrong
Trivial tasks	Wasted tokens, slower response, no quality gain
Noisy reasoning	Long rationales amplify a wrong assumption; later steps stay consistent with the early mistake
Tool-shaped tasks	Prose cannot replace running code or hitting an API
Reasoning-trained models	Adding "think step by step" to o3 or Claude with extended thinking can hurt — these models already plan internally and the instruction confuses the trained behavior

That last row matters in 2026. OpenAI's o-series and GPT-5 reasoning explicitly tell you not to add CoT phrasing — the model has its own planner. Anthropic's extended-thinking docs make the same warning. So the rule is: prompt-CoT for non-reasoning models, drop it for native reasoners.

Prompt design rules that actually move the needle¶

Be explicit about the final-answer format. End the prompt with "Final answer on a single line prefixed with 'Answer:'." This separates scratch from output and makes downstream parsing trivial.
Name what to track. Units. Constraints. Candidate options. "Track units at every step. Reject any step that loses a unit."
Ask for a check. "Before answering, verify that holds." This is a poor-man's verifier in one sentence.
Show examples aligned to your real task. Few-shot examples shape the reasoning style. Mismatched examples teach noise.
Decompose for branching tasks. "First list candidate options. Then evaluate each. Then pick."
Cap the steps. "Reason in at most 5 steps." Bounds the budget for non-reasoning models.

Where this lives in the wild¶

OpenAI prompt engineering docs — non-o-series models — official recommendation for GPT-4.1 family: zero-shot CoT for math and logic, few-shot CoT for structured extraction. Explicit warning to omit CoT phrasing for o-series and GPT-5 thinking.
Anthropic prompt library — Claude Sonnet 4.6 (base, no thinking) — recommends explicit <scratchpad> tags wrapped around reasoning, parsed away from the final user-facing answer.
Khanmigo (Khan Academy) — tutor flows — uses few-shot CoT to enforce a "ask a guiding question, then solve" reasoning style instead of jumping to the answer; aligns hint delivery with pedagogy.
Harvey AI — legal drafting checks — few-shot CoT with "issue → rule → application → conclusion" template forces the IRAC reasoning structure, reduces hallucinated case law.
GitHub Copilot prompt files (.github/prompts/) — repo-level few-shot examples ship the team's preferred debug sequence so the model follows house style on PR explanations.

Pause and recall¶

What is the one-line zero-shot CoT trigger phrase, and what was its measured lift on GSM8K?
When should you not add "think step by step" to your prompt?
In the ball example, what wrong shortcut does a fast model usually take, and which step prevents it?
Why does few-shot CoT cost more tokens than zero-shot, and when is the cost worth it?

Interview Q&A¶

Q: When would you reach for CoT prompting before switching to a reasoning model? A: When the task is moderate complexity but high volume — autocomplete-class latency budgets where you cannot afford 10 s of hidden thinking but you can afford 200 extra output tokens. Also: when you are debugging your own prompts and want to see where the chain breaks, since a reasoning model hides that chain. CoT prompting is your microscope.

Common wrong answer to avoid: "CoT only matters when you can't afford a reasoning model" — CoT is also the standard way to expose reasoning for debugging and for tasks where you must show the work to the user (tutoring, legal IRAC, audit explanations).

Q: Few-shot CoT examples are tokens that sit in every call. How do you decide they earn that cost? A: A/B test against zero-shot CoT on your golden eval set. Compute (accuracy lift) / (extra cost per call × call volume). The break-even depends on error cost. For a $0.10 wrong answer at 10K calls/day, even a 2-point lift beats the extra few-shot tokens at $5/M output. Prompt caching (Anthropic ephemeral, OpenAI automatic prefix cache) also lets you keep the examples warm so they bill at ~10% of the output rate.

Common wrong answer to avoid: "Few-shot is always better than zero-shot" — at scale with prompt caching it usually is, but mis-chosen examples actively poison reasoning style. Always eval before shipping.

Q: Why does telling Claude Sonnet 4.6 with extended thinking enabled to "think step by step" sometimes hurt quality? A: The model has been RL-trained to do its own planning inside the <thinking> block. Adding instructions about how to think conflicts with the trained policy and produces shorter, more performative scratchpads. Anthropic's official docs say to give task instructions in plain prose and let the model decide the reasoning shape. Same applies to o-series and GPT-5 thinking tier.

Common wrong answer to avoid: "All models benefit from explicit step-by-step instructions" — this was true through GPT-4 era. From late 2024 onward, native-reasoning models have their own internal CoT policy and respond better to plain task instructions.

Q: A reviewer says your CoT prompt fixed the model's chain. How do you know the chain is actually faithful? A: You don't, from inspection alone. Faithfulness research (Anthropic 2024–2025) shows visible scratchpads often diverge from the model's actual causal reasoning. The defenses are: tool-grounded verification (does the cited equation produce the cited number?), perturbation tests (does the final answer change if I edit a non-causal step?), and process-reward-model audits. Trust outputs and evidence, not eloquence.

Common wrong answer to avoid: "If the reasoning sounds right, the model reasoned correctly" — coherence and faithfulness are separate axes. A model can write a polished chain and reach the right answer through an entirely different internal path, or worse, reach the right answer by guessing and post-rationalize.

Apply now (5 min)¶

Take one underperforming prompt from your team. Add "Let's reason step by step. Track units at every step. Final answer on a separate line prefixed with 'Answer:'." to the end. Run it on 10 golden examples. Note the accuracy delta and the token cost delta. If accuracy lift × call volume > cost delta, ship it. If not, the task probably needs a reasoning model or a tool, not more prose.

Sketch from memory: Draw the prompt → step1 → step2 → step3 → answer ladder, then mark which step the fast model usually skips in your domain.

Bridge. Prompting buys the pause. But why do those steps lift accuracy at all? Is it just longer output, or something structural? The next file explains the mechanism. → 03-why-cot-works.md