07. Test-time Compute Scaling — Spending inference tokens to buy better answers¶
~11 min read. Bigger models is one scaling axis. More inference compute is another. Sometimes the second is cheaper than the first — and sometimes it actively hurts.
Built on the ELI5 in 00-eli5.md. the time budget — compute spent at inference rather than at training — is the lever a reasoning model gives you. The knob has limits and surprises that the marketing curves hide.
Buying thought after training¶
Classic scaling means a bigger model or more pretraining tokens. Reasoning adds a third axis: spend more at runtime. Sample more candidates. Allow longer hidden chains. Run a verifier over outputs. Use tools and compare results. Weights stay fixed; the runtime budget changes.
fixed model
│
├── one pass, low effort $0.01, 2 s
├── one pass, high effort $0.10, 25 s
├── 4 sampled paths, best-of-N $0.40, 25 s parallel
└── 4 paths + verifier model $0.45, 30 s
That is the time budget in its cleanest form. Same checkpoint, different policies on how much to think.
Snell et al. (Scaling LLM Test-Time Compute Optimally, arXiv 2408.03314) showed the canonical curve: on a fixed model, extra inference compute trades against bigger pretrained models. For easy-to-medium tasks, a smaller model with high inference budget can match a much larger model run at low budget. The curve has a knee — beyond which extra compute buys little.
Three mechanisms by which extra compute helps¶
1. Longer single chain. The model writes more intermediate state, gets more serial computation depth (recall chapter 03). Works on math, multi-step logic, code refactors.
2. Multiple paths, then select. Sample N reasoning traces with temperature, then pick the best. Self-consistency (Wang et al. 2022) picks majority answer; Best-of-N with a verifier picks the highest-scored path.
3. Process supervision. A trained PRM (Process Reward Model) scores each step of the chain. Used during inference, it can prune weak branches mid-chain. Stronger than ORM (Outcome Reward Model) for credit hands_on_lab, but ORM resists reward hacking better.
Worked example: probability of at-least-one-correct path¶
A single sampled chain solves a hard task with probability 0.40. Assume independent samples (rough approximation).
1 sample: P(correct) = 0.40
4 samples: P(at least one correct) = 1 - 0.6⁴ = 1 - 0.1296 = 0.870
16 samples: P(at least one correct) = 1 - 0.6¹⁶ ≈ 0.999
But "at least one correct" only helps if you can SELECT the correct one.
Selection matters. Three options.
| Selector | Lift over single sample | Caveat |
|---|---|---|
| Majority vote (self-consistency) | Good when one stable answer exists | Fails on open-ended tasks |
| Outcome reward model (ORM) | Strong if you can verify outcomes (compile, unit test, math check) | Need ground truth or strong verifier |
| Process reward model (PRM) | Best credit hands_on_lab on multi-step chains | Costs to train; can be reward-hacked |
With selection working, the practical lift from 1 → 4 samples is usually 10–25 points on hard reasoning tasks. From 4 → 16, another 2–8. From 16 → 64, often 0.5–2.
Diminishing returns and the cost wall¶
The first extra samples help most. Later samples help less. Cost rises linearly; latency rises with parallelism (if you can parallelize) or with serial sampling. Anthropic's Inverse Scaling in Test-Time Compute (arXiv 2507.14417) showed something stranger: on certain task families, longer reasoning degrades performance.
| Failure family (Anthropic 2025) | What happens |
|---|---|
| Claude distractibility | Longer chains pull in irrelevant context that becomes load-bearing |
| OpenAI o-series framing overfit | Extended thinking overfits to surface framings of the problem |
| Spurious correlation drift | Long chains drift into plausible-but-wrong analogies |
| Deduction tracking collapse | Past a certain chain length, tracking multi-step deductions degrades |
| Self-preservation expressions | Sonnet 4 showed increased self-preservation language under extended CoT |
So "more compute = better" is wrong as a universal rule. The right framing: scale test-time compute selectively, on tasks where you have evidence the curve is still rising.
Confidence-Informed Self-Consistency¶
A 2025 ACL paper introduced CISC (Confidence-Informed Self-Consistency, aclanthology.org/2025.findings-acl.1030) — instead of weighting all sampled paths equally for majority vote, weight by the model's self-reported confidence. Result: ~46% fewer samples needed (18.6 → 10) to match accuracy.
def cisc_select(samples):
# Each sample is (answer, confidence)
by_answer = {}
for ans, conf in samples:
by_answer[ans] = by_answer.get(ans, 0) + conf
return max(by_answer, key=by_answer.get)
The implication: if you're doing best-of-N or self-consistency in production, switch from majority vote to confidence-weighted vote. Free quality lift, fewer samples for the same target accuracy.
API knobs that map to test-time compute¶
| Provider | Knob | Practical default |
|---|---|---|
| OpenAI o-series / GPT-5 thinking | reasoning_effort: minimal\|low\|medium\|high\|xhigh |
medium for chat, high for batch |
| Anthropic extended thinking (≤ Sonnet 4.5) | thinking.budget_tokens (min 1024) |
8000 for interactive, 32000+ for batch |
| Anthropic effort (Sonnet 4.6+, Opus 4.6+) | effort: low\|medium\|high\|max |
medium default; max only on Opus 4.6+/Mythos |
| Google Gemini 2.5 | thinkingConfig.thinkingBudget (-1 = dynamic) |
-1 for adaptive |
| Google Gemini 3 | thinkingLevel: LOW\|HIGH |
LOW for chat, HIGH for analysis |
| xAI Grok 4 | reasoning: none\|low\|medium\|high |
low default |
Beyond the knob, you can layer your own scaling:
async def best_of_n(client, prompt, n=4, model="claude-sonnet-4-6"):
tasks = [
client.messages.create(
model=model,
max_tokens=8000,
thinking={"type": "enabled", "effort": "medium"},
temperature=0.7,
messages=[{"role": "user", "content": prompt}],
)
for _ in range(n)
]
candidates = await asyncio.gather(*tasks)
return verifier_select(candidates)
Note: temperature must be > 0 for path diversity. Best-of-N with temperature=0 is just N copies of the same chain.
Operational patterns you will actually use¶
| Pattern | When to use |
|---|---|
| Single high-effort call | Default; cheapest reasoning lift |
| Best-of-N with verifier | Code generation with compile/test verification |
| Self-consistency (majority vote) | Math, multiple-choice, stable-answer tasks |
| CISC (confidence-weighted) | Same as self-consistency, ~46% cheaper |
| Tree of Thoughts | Planning, multi-step problems where intermediate validation is cheap |
| Tool-augmented chains | Anytime an external check is cheaper than another sample |
The cardinal rule: scale test-time compute only where the curve is rising. If your golden set shows 4 samples ≈ 16 samples on accuracy, you've hit the knee — stop there.
Where this lives in the wild¶
- OpenAI reasoning endpoints —
reasoning_effortis exactly this knob exposed at the API. The "right" default value depends on your task; OpenAI's docs recommendmediumfor chat,highfor batch analysis. - Perplexity Deep Research / Computer (May 2026) — averages ~3 minutes per task; spends massive inference compute on multi-step retrieval + synthesis. Computer routes sub-agents (Gemini for deep research) under a Claude Opus 4.6 orchestrator.
- Cursor Auto mode — uses different reasoning effort levels by task: fast model + minimal effort for autocomplete, GPT-5.5 with medium effort for general edits, Opus 4.7 with high effort for refactors. Routing maps directly to a test-time compute axis.
- Math Olympiad solvers (GPT-5.5 Pro, Grok 4 Heavy) — Grok 4 Heavy hit 100% AIME 2025 by running parallel sampled paths and selecting; GPT-5.5 Pro reached 52.4% on FrontierMath via long internal chains plus tool use.
- GitHub Copilot Auto — routes to o3 for "complex multi-step reasoning" tasks and stays on cheaper models for autocomplete; routing function is the test-time-compute allocator.
Pause and recall¶
- Snell et al. proved test-time compute can trade against what other scaling axis, and how?
- In the 4-sample example, why does P(at least one correct) ≈ 0.87 not directly translate to 87% accuracy in practice?
- Name three Inverse Scaling failure families from Anthropic's 2025 paper.
- What single change does CISC make to standard self-consistency, and how much does it save?
Interview Q&A¶
Q: Your product PM asks "why don't we just turn the reasoning knob to max on everything?" Give the technical answer.
A: Four reasons. First, diminishing returns — the curve flattens fast; high → max on Opus 4.7 tripled cost for under 1 point of SWE-bench. Second, inverse scaling — Anthropic's 2025 research shows certain task families actively degrade with longer reasoning (distractibility, framing overfit, deduction collapse). Third, latency — TTFT inflates 5–60× when reasoning is enabled; consumer chat at 28 s P50 feels broken. Fourth, cost — at scale, blanket max-effort is often 10–20× the cost of cascade routing for ≤2% quality lift. Show the cost-quality curve and the inverse-scaling table; PM will get it.
Common wrong answer to avoid: "We can afford it" — even when you can, max-effort hurts on some task classes. The answer to "always more compute?" is no, and the evidence is published.
Q: When does best-of-N beat a single high-effort call? A: When you have a reliable selector and the task has high path diversity. Compile/test-based selection on code generation is the canonical case — sample 8 implementations, run tests, pick the one that passes. Self-consistency majority-vote works well when there's a single stable correct answer (math, multiple-choice). Best-of-N without a strong selector is just expensive sampling. The rule: budget for compute helps less than budget for verification.
Common wrong answer to avoid: "Best-of-N is always better than one chain at high effort" — without a strong selector, you're paying N× for not much lift. PRM-graded or verifier-graded selection is what makes best-of-N work.
Q: A teammate runs Best-of-8 with temperature=0. The accuracy is identical to one sample. Why? A: Temperature=0 produces (nearly) deterministic decoding, so all 8 samples are the same chain — there's no diversity to vote over. Best-of-N needs temperature > 0 (typically 0.6–1.0) to generate distinct candidates. If you're using OpenAI reasoning models, the visible answer is also affected by reasoning-token determinism, which can suppress diversity even at temperature > 0. Switch to a model that exposes sampling controls cleanly, or use a different decoding strategy (top-p, nucleus) to force diversity.
Common wrong answer to avoid: "Best-of-N is broken" — the pattern works fine; the bug is the temperature setting. Diversity is what you're paying for, and temperature=0 kills it.
Q: Your reasoning task has high variance — sometimes correct, sometimes wildly wrong. Single-chain reasoning isn't reliable enough. What scaling do you reach for first? A: Cheapest first: self-consistency with majority vote at N=5, temperature=0.7. If you have a programmatic verifier (compile, unit test, schema), upgrade to Best-of-N with verifier selection. If you have budget and the task supports it, layer CISC (confidence-weighted vote) for ~46% sample reduction. Reserve PRM-graded selection for cases where you've trained a domain-specific PRM. The progression is: more samples → smarter selector → smarter scoring. Start at the cheap end of the curve.
Common wrong answer to avoid: "Train a PRM" as the first answer — PRMs are powerful but expensive to train and prone to reward hacking. Try self-consistency and verifier-based selection first; PRM is a later upgrade.
Apply now (5 min)¶
Pick one variable-quality task. Run it 8 times with temperature 0.7. Measure accuracy of: (a) random single sample, (b) majority vote, (c) confidence-weighted vote (CISC), (d) verifier-based best-of-N if you have a programmatic check. Plot the four against cost. The cheapest method that clears your quality bar is your production setting.
Sketch from memory: Draw the four-curve diagram (samples vs accuracy) for single-sample, majority vote, CISC, and verifier-selected best-of-N. Mark the typical knee at N≈4–8.
Bridge. Once you sample multiple paths, you're already thinking like a search system. Next: explicit search and verification — the move tree from the ELI5 made concrete. → 08-search-verification-move-tree.md