07. Test-time Compute Scaling — Spending inference tokens to buy better answers¶

~11 min read. Bigger models is one scaling axis. More inference compute is another. Sometimes the second is cheaper than the first — and sometimes it actively hurts.

Built on the ELI5 in 00-eli5.md. the time budget — compute spent at inference rather than at training — is the lever a reasoning model gives you. The knob has limits and surprises that the marketing curves hide.

Buying thought after training¶

Classic scaling means a bigger model or more pretraining tokens. Reasoning adds a third axis: spend more at runtime. Sample more candidates. Allow longer hidden chains. Run a verifier over outputs. Use tools and compare results. Weights stay fixed; the runtime budget changes.

fixed model
   │
   ├── one pass, low effort               $0.01, 2 s
   ├── one pass, high effort              $0.10, 25 s
   ├── 4 sampled paths, best-of-N         $0.40, 25 s parallel
   └── 4 paths + verifier model           $0.45, 30 s

That is the time budget in its cleanest form. Same checkpoint, different policies on how much to think.

Snell et al. (Scaling LLM Test-Time Compute Optimally, arXiv 2408.03314) showed the canonical curve: on a fixed model, extra inference compute trades against bigger pretrained models. For easy-to-medium tasks, a smaller model with high inference budget can match a much larger model run at low budget. The curve has a knee — beyond which extra compute buys little.

Three mechanisms by which extra compute helps¶

1. Longer single chain. The model writes more intermediate state, gets more serial computation depth (recall chapter 03). Works on math, multi-step logic, code refactors.

2. Multiple paths, then select. Sample N reasoning traces with temperature, then pick the best. Self-consistency (Wang et al. 2022) picks majority answer; Best-of-N with a verifier picks the highest-scored path.

3. Process supervision. A trained PRM (Process Reward Model) scores each step of the chain. Used during inference, it can prune weak branches mid-chain. Stronger than ORM (Outcome Reward Model) for credit hands_on_lab, but ORM resists reward hacking better.

Worked example: probability of at-least-one-correct path¶

A single sampled chain solves a hard task with probability 0.40. Assume independent samples (rough approximation).

1 sample:     P(correct) = 0.40
4 samples:    P(at least one correct) = 1 - 0.6⁴ = 1 - 0.1296 = 0.870
16 samples:   P(at least one correct) = 1 - 0.6¹⁶ ≈ 0.999

But "at least one correct" only helps if you can SELECT the correct one.

Selection matters. Three options.

Selector	Lift over single sample	Caveat
Majority vote (self-consistency)	Good when one stable answer exists	Fails on open-ended tasks
Outcome reward model (ORM)	Strong if you can verify outcomes (compile, unit test, math check)	Need ground truth or strong verifier
Process reward model (PRM)	Best credit hands_on_lab on multi-step chains	Costs to train; can be reward-hacked

With selection working, the practical lift from 1 → 4 samples is usually 10–25 points on hard reasoning tasks. From 4 → 16, another 2–8. From 16 → 64, often 0.5–2.

Diminishing returns and the cost wall¶

The first extra samples help most. Later samples help less. Cost rises linearly; latency rises with parallelism (if you can parallelize) or with serial sampling. Anthropic's Inverse Scaling in Test-Time Compute (arXiv 2507.14417) showed something stranger: on certain task families, longer reasoning degrades performance.

Failure family (Anthropic 2025)	What happens
Claude distractibility	Longer chains pull in irrelevant context that becomes load-bearing
OpenAI o-series framing overfit	Extended thinking overfits to surface framings of the problem
Spurious correlation drift	Long chains drift into plausible-but-wrong analogies
Deduction tracking collapse	Past a certain chain length, tracking multi-step deductions degrades
Self-preservation expressions	Sonnet 4 showed increased self-preservation language under extended CoT

So "more compute = better" is wrong as a universal rule. The right framing: scale test-time compute selectively, on tasks where you have evidence the curve is still rising.

Confidence-Informed Self-Consistency¶

A 2025 ACL paper introduced CISC (Confidence-Informed Self-Consistency, aclanthology.org/2025.findings-acl.1030) — instead of weighting all sampled paths equally for majority vote, weight by the model's self-reported confidence. Result: ~46% fewer samples needed (18.6 → 10) to match accuracy.

def cisc_select(samples):
    # Each sample is (answer, confidence)
    by_answer = {}
    for ans, conf in samples:
        by_answer[ans] = by_answer.get(ans, 0) + conf
    return max(by_answer, key=by_answer.get)

The implication: if you're doing best-of-N or self-consistency in production, switch from majority vote to confidence-weighted vote. Free quality lift, fewer samples for the same target accuracy.

API knobs that map to test-time compute¶

Provider	Knob	Practical default
OpenAI o-series / GPT-5 thinking	`reasoning_effort: minimal\\|low\\|medium\\|high\\|xhigh`	`medium` for chat, `high` for batch
Anthropic extended thinking (≤ Sonnet 4.5)	`thinking.budget_tokens` (min 1024)	8000 for interactive, 32000+ for batch
Anthropic effort (Sonnet 4.6+, Opus 4.6+)	`effort: low\\|medium\\|high\\|max`	`medium` default; `max` only on Opus 4.6+/Mythos
Google Gemini 2.5	`thinkingConfig.thinkingBudget` (-1 = dynamic)	-1 for adaptive
Google Gemini 3	`thinkingLevel: LOW\\|HIGH`	`LOW` for chat, `HIGH` for analysis
xAI Grok 4	`reasoning: none\\|low\\|medium\\|high`	`low` default

Beyond the knob, you can layer your own scaling:

async def best_of_n(client, prompt, n=4, model="claude-sonnet-4-6"):
    tasks = [
        client.messages.create(
            model=model,
            max_tokens=8000,
            thinking={"type": "enabled", "effort": "medium"},
            temperature=0.7,
            messages=[{"role": "user", "content": prompt}],
        )
        for _ in range(n)
    ]
    candidates = await asyncio.gather(*tasks)
    return verifier_select(candidates)

Note: temperature must be > 0 for path diversity. Best-of-N with temperature=0 is just N copies of the same chain.

Operational patterns you will actually use¶

Pattern	When to use
Single high-effort call	Default; cheapest reasoning lift
Best-of-N with verifier	Code generation with compile/test verification
Self-consistency (majority vote)	Math, multiple-choice, stable-answer tasks
CISC (confidence-weighted)	Same as self-consistency, ~46% cheaper
Tree of Thoughts	Planning, multi-step problems where intermediate validation is cheap
Tool-augmented chains	Anytime an external check is cheaper than another sample

The cardinal rule: scale test-time compute only where the curve is rising. If your golden set shows 4 samples ≈ 16 samples on accuracy, you've hit the knee — stop there.

Where this lives in the wild¶

OpenAI reasoning endpoints — reasoning_effort is exactly this knob exposed at the API. The "right" default value depends on your task; OpenAI's docs recommend medium for chat, high for batch analysis.
Perplexity Deep Research / Computer (May 2026) — averages ~3 minutes per task; spends massive inference compute on multi-step retrieval + synthesis. Computer routes sub-agents (Gemini for deep research) under a Claude Opus 4.6 orchestrator.
Cursor Auto mode — uses different reasoning effort levels by task: fast model + minimal effort for autocomplete, GPT-5.5 with medium effort for general edits, Opus 4.7 with high effort for refactors. Routing maps directly to a test-time compute axis.
Math Olympiad solvers (GPT-5.5 Pro, Grok 4 Heavy) — Grok 4 Heavy hit 100% AIME 2025 by running parallel sampled paths and selecting; GPT-5.5 Pro reached 52.4% on FrontierMath via long internal chains plus tool use.
GitHub Copilot Auto — routes to o3 for "complex multi-step reasoning" tasks and stays on cheaper models for autocomplete; routing function is the test-time-compute allocator.

Pause and recall¶

Snell et al. proved test-time compute can trade against what other scaling axis, and how?
In the 4-sample example, why does P(at least one correct) ≈ 0.87 not directly translate to 87% accuracy in practice?
Name three Inverse Scaling failure families from Anthropic's 2025 paper.
What single change does CISC make to standard self-consistency, and how much does it save?

Interview Q&A¶

Q: Your product PM asks "why don't we just turn the reasoning knob to max on everything?" Give the technical answer. A: Four reasons. First, diminishing returns — the curve flattens fast; high → max on Opus 4.7 tripled cost for under 1 point of SWE-bench. Second, inverse scaling — Anthropic's 2025 research shows certain task families actively degrade with longer reasoning (distractibility, framing overfit, deduction collapse). Third, latency — TTFT inflates 5–60× when reasoning is enabled; consumer chat at 28 s P50 feels broken. Fourth, cost — at scale, blanket max-effort is often 10–20× the cost of cascade routing for ≤2% quality lift. Show the cost-quality curve and the inverse-scaling table; PM will get it.

Common wrong answer to avoid: "We can afford it" — even when you can, max-effort hurts on some task classes. The answer to "always more compute?" is no, and the evidence is published.

Q: When does best-of-N beat a single high-effort call? A: When you have a reliable selector and the task has high path diversity. Compile/test-based selection on code generation is the canonical case — sample 8 implementations, run tests, pick the one that passes. Self-consistency majority-vote works well when there's a single stable correct answer (math, multiple-choice). Best-of-N without a strong selector is just expensive sampling. The rule: budget for compute helps less than budget for verification.

Common wrong answer to avoid: "Best-of-N is always better than one chain at high effort" — without a strong selector, you're paying N× for not much lift. PRM-graded or verifier-graded selection is what makes best-of-N work.

Q: A teammate runs Best-of-8 with temperature=0. The accuracy is identical to one sample. Why? A: Temperature=0 produces (nearly) deterministic decoding, so all 8 samples are the same chain — there's no diversity to vote over. Best-of-N needs temperature > 0 (typically 0.6–1.0) to generate distinct candidates. If you're using OpenAI reasoning models, the visible answer is also affected by reasoning-token determinism, which can suppress diversity even at temperature > 0. Switch to a model that exposes sampling controls cleanly, or use a different decoding strategy (top-p, nucleus) to force diversity.

Common wrong answer to avoid: "Best-of-N is broken" — the pattern works fine; the bug is the temperature setting. Diversity is what you're paying for, and temperature=0 kills it.

Q: Your reasoning task has high variance — sometimes correct, sometimes wildly wrong. Single-chain reasoning isn't reliable enough. What scaling do you reach for first? A: Cheapest first: self-consistency with majority vote at N=5, temperature=0.7. If you have a programmatic verifier (compile, unit test, schema), upgrade to Best-of-N with verifier selection. If you have budget and the task supports it, layer CISC (confidence-weighted vote) for ~46% sample reduction. Reserve PRM-graded selection for cases where you've trained a domain-specific PRM. The progression is: more samples → smarter selector → smarter scoring. Start at the cheap end of the curve.

Common wrong answer to avoid: "Train a PRM" as the first answer — PRMs are powerful but expensive to train and prone to reward hacking. Try self-consistency and verifier-based selection first; PRM is a later upgrade.

Apply now (5 min)¶

Pick one variable-quality task. Run it 8 times with temperature 0.7. Measure accuracy of: (a) random single sample, (b) majority vote, (c) confidence-weighted vote (CISC), (d) verifier-based best-of-N if you have a programmatic check. Plot the four against cost. The cheapest method that clears your quality bar is your production setting.

Sketch from memory: Draw the four-curve diagram (samples vs accuracy) for single-sample, majority vote, CISC, and verifier-selected best-of-N. Mark the typical knee at N≈4–8.

Bridge. Once you sample multiple paths, you're already thinking like a search system. Next: explicit search and verification — the move tree from the ELI5 made concrete. → 08-search-verification-move-tree.md