08. Search, Verification, and the Move Tree — Generate options, score them, prune ruthlessly¶

~11 min read. A hard problem is a tree, not a line. Reasoning systems that win do it by exploring branches and trusting a strong verifier — not by writing longer chains.

Built on the ELI5 in 00-eli5.md. the move tree — multiple candidate paths explored, scored, and pruned — turns extra compute into explicit search. the backtrack is the natural move when a branch collapses.

The tree picture¶

A single chain is a straight road. Hard reasoning is rarely a road; it's a junction, then another junction, then another. At each branch we generate several options. Some are silly. Some violate constraints. Some are promising. Search means we do not marry the first option. Verification means we do not trust a branch just because it is long.

start
 ├── path A
 │    ├── A1 ✗ (verifier rejects)
 │    └── A2 ✓ → expand
 │         ├── A2a ✓ → expand
 │         └── A2b ✗
 ├── path B
 │    ├── B1 ?
 │    └── B2 ✗
 └── path C
      ├── C1 ✓ → expand
      └── C2 ?

That is the chess picture returning in full. the move tree from the ELI5 made concrete. Each ✗ triggers the backtrack; the verifier decides what survives.

Search alone is weak — you need a verifier¶

A tree of unranked options is just expensive noise. You need a score on each branch.

Verifier type	Cost	When to use
Programmatic (compile, unit test, schema, type check)	Cheap	Code generation, structured output, tool args
Domain rule engine	Cheap	Tax rules, policy compliance, accounting
Self-judge (model scores its own outputs)	Medium	Open-ended quality scoring, with calibration
Pairwise model judge	Medium	Comparing two candidate outputs
Trained ORM (Outcome Reward Model)	Medium-high	Math, code, fact-verification
Trained PRM (Process Reward Model)	High to train, fast to use	Multi-step reasoning credit hands_on_lab
Human-in-loop	Very high	High-stakes irreversible actions

Search generates options. Verification kills weak options. The better the verifier, the more useful the search. A weak verifier with a wide search wastes budget — you sample 32 candidates and pick the wrong one. A strong verifier with a narrow search is the production pattern.

Look. Good search is guided exploration, not random overthinking. The verifier is what makes it guided.

Worked example: pruning the explosion¶

A branching factor of 3 and depth 4. Full tree:

level 0: 1
level 1: 3
level 2: 9
level 3: 27
level 4: 81
Total nodes: 121

Now prune to top-2 branches after each level:

level 0: 1
level 1: 3 explored
level 2: 2 × 3 = 6 explored
level 3: 2 × 2 × 3 = 12 explored
level 4: 2 × 2 × 2 × 3 = 24 explored
Total: 46

The verifier cut the search burden from 121 to 46 — about 60% saved. At depth 6 the saving grows to 90%+. That is why even an imperfect verifier earns its keep.

Now imagine each node is one model call costing $0.05. Unpruned: $6.05 per problem. Pruned with a 50× cheaper verifier model: ~$2.30. The verifier pays for itself many times over.

PRM vs ORM in production¶

Outcome Reward Model (ORM) — grades the final answer only. Easy to train (compare to ground truth), resists reward hacking (because the ground truth is fixed), but provides poor credit hands_on_lab over long chains.

Process Reward Model (PRM) — grades each step of the chain. Better credit hands_on_lab, lets you prune mid-chain. But harder to train (needs step-level labels), and more vulnerable to reward hacking — the policy can learn to produce step text that the PRM rewards even when the answer is wrong.

The 2025 survey (arXiv 2510.08049) showed domain-specific PRMs (Fin-PRM for finance, OR-PRM for operations research) lift Best-of-N by 10–13 points over generic PRMs. In production:

Use ORM-style verifiers (compile, test, schema, ground-truth match) when you have them. Strongest signal.
Use PRM for mid-chain pruning when you've trained one on your domain.
Mix them: PRM for pruning at depth, ORM at the leaves.

Where search shows up in real systems¶

System	Search pattern	Verifier
Cursor Composer / GitHub Copilot agent	Edit → run → test → fix loop	Compile + unit tests
Math olympiad solvers (GPT-5.5 Pro, Grok 4 Heavy)	Parallel sampled chains	Math equality check + PRM
Perplexity Deep Research	Multiple retrieval paths, synthesis branches	Citation overlap + LLM judge
AlphaCode-style code synthesis	1000+ candidate programs	Hidden test suite
Robot planning (Boston Dynamics, Figure AI)	Action sequence search	Physics simulator
Chess engines (Stockfish, Leela)	Alpha-beta or MCTS	Neural value head
LangGraph reflection agents	Plan → critique → revise	Self-critique model

The pattern repeats: generate options, evaluate, expand the promising branch, back up when the branch collapses. That is the backtrack working inside the move tree. Same shape, different domains.

Engineering rules for production search¶

Always bound the budget. Max depth, max nodes, max wall-clock. Unbounded search will eat your inference budget on one bad request.
Always have a verifier. Even a cheap heuristic beats no scoring. "Does the answer pass the JSON schema?" filters 30% of failures for free.
Cheap tests early, expensive tests late. Run compile before unit tests, run unit tests before semantic LLM judges.
Cache repeated subproblems. A chain that revisits the same state should reuse the prior verifier score.
Log where branches die. Failure clustering tells you where your search policy is weak.
Measure whether search actually beats a strong single-pass reasoner. Sometimes Opus 4.7 with high effort + tools beats your hand-rolled Tree-of-Thoughts at 1/10 the cost. Always benchmark search vs simpler baselines.

That last point matters. Search is powerful and expensive. The wrong place to use it: easy tasks where a single fast pass suffices. The right place: branching tasks where one early commitment is fatal and a cheap verifier exists.

Code skeleton: Best-of-N with verifier¶

async def best_of_n_with_verifier(client, prompt, n=4, model="claude-sonnet-4-6"):
    # Generate N candidates with thermal diversity
    cands = await asyncio.gather(*[
        client.messages.create(
            model=model,
            max_tokens=4000,
            temperature=0.7,
            thinking={"type": "enabled", "effort": "medium"},
            messages=[{"role": "user", "content": prompt}],
        )
        for _ in range(n)
    ])

    # Score each candidate with verifier (cheaper model or programmatic check)
    scored = [(c, verifier_score(c.content[-1].text)) for c in cands]
    return max(scored, key=lambda s: s[1])[0]

def verifier_score(text: str) -> float:
    # Cheapest first: schema check
    if not passes_schema(text): return 0.0
    # Then programmatic: does the code compile?
    if not compiles(extract_code(text)): return 0.3
    # Then unit tests
    pass_rate = run_tests(extract_code(text))
    if pass_rate < 1.0: return 0.5 + 0.5 * pass_rate
    # All checks passed
    return 1.0

Notice the cascade inside the verifier — cheap checks first. That alone halves verifier cost in most workloads.

Where this lives in the wild¶

Grok 4 Heavy — runs parallel sampled paths and selects with an internal scorer; hit 100% AIME 2025 and 88.4% GPQA Diamond by treating "more parallel paths" as the headline scaling axis.
AlphaCode (DeepMind, code generation) — generates 1000s of candidate programs, filters by passing hidden tests, then clusters and selects. Pure compile-and-test verification on a massive candidate pool.
GitHub Copilot agent mode — edit → compile → test → fix repair loops effectively search patch space; build/test results are the verifier; capped at ~8 iterations to bound cost.
Perplexity Deep Research / Computer — branches across multiple retrieval queries and synthesis paths; uses citation overlap and content alignment as the verifier; reports ~3-min average task time.
Chess and Go engines — the canonical instance of explicit move-tree search with a neural value head as the verifier; the mental model that gave reasoning models their pause-before-moving framing.

Pause and recall¶

Why is search without a verifier "expensive wandering"?
In the depth-4, branching-3 example, what fraction of nodes did pruning save?
Name two reasons ORM resists reward hacking better than PRM.
What does the AlphaCode pattern (1000 candidates → hidden tests) tell you about when verifier-based selection beats one strong chain?

Interview Q&A¶

Q: Your team is considering Tree-of-Thoughts for a planning task. When does it justify the cost over a single high-effort reasoning call? A: Three conditions. First, the task has real branches — distinct valid paths with different consequences, not just one path with multiple phrasings. Second, a cheap verifier exists — programmatic check, schema, simulator, or rule engine. Third, the curve says so — you've benchmarked single-chain high-effort against ToT on your golden set and ToT wins by enough to justify ~5× cost. If any condition fails, the simpler reasoning model with high effort usually wins on cost-quality. ToT in production is rarer than papers suggest.

Common wrong answer to avoid: "ToT is always more thorough than single chains" — without a strong verifier, ToT is N parallel chains that all might be wrong. The verifier is what makes the tree useful.

Q: Why do PRMs help on long math chains but hurt on open-ended tasks? A: PRMs grade individual steps against learned-quality criteria. On math, "is this algebraic step valid?" is well-defined and the PRM can be trained accurately. On open-ended tasks (essay quality, creative writing, ambiguous planning), "is this step good?" is ill-defined; the PRM ends up scoring surface features (length, vocabulary, formatting) rather than true progress, and the policy learns to game those features — classic reward hacking. ORMs on outcome ground truth resist this because the policy can't game a fixed correct answer. The rule: PRMs need well-defined step-quality criteria, which math and code provide and creative writing does not.

Common wrong answer to avoid: "PRMs are always better than ORMs because they give finer feedback" — finer feedback only helps when it points at the right thing. In open-ended domains the finer feedback points at noise.

Q: AlphaCode generated 1M candidates and selected with hidden tests. Why isn't this the production pattern for general code generation? A: Three reasons. Verifier strength — AlphaCode had a curated hidden test suite per problem; production code-gen tasks usually don't have ground-truth tests at the moment of generation. Cost — 1M candidates at $0.05 each is $50K per problem; only competitive-programming research can afford that. Latency — even parallelised, the verifier is the bottleneck; users won't wait. Production code-gen uses 4–8 candidates with compile-and-test verification — same pattern, 5 orders of magnitude smaller. The lesson scales; the budget doesn't.

Common wrong answer to avoid: "AlphaCode's results don't transfer to production" — they transfer at small N. The architectural lesson (generate + verify + select) is exactly what Cursor and Copilot do.

Q: A reviewer says "Best-of-4 is just sampling noise." How do you defend it? A: With three numbers. First, the quality lift on your golden set — typical 5–15 points on hard reasoning tasks with a working verifier. Second, the cost ratio vs accuracy gain — at $0.40 for 4 candidates vs $0.10 for one, the gain has to clear $0.30 of error cost; on a high-stakes task that's often easy. Third, the failure-mode change — Best-of-N reduces tail failures (the catastrophic wrong answer) more than it lifts mean accuracy, which matters for production. Show the histogram, not just the mean.

Common wrong answer to avoid: "Best-of-N always lifts accuracy" — without measurement on your task, that's a guess. The right defense is data: lift, cost, and tail-risk change.

Apply now (5 min)¶

Take one branching task in your stack. Sketch its move tree to depth 3. For each level, write the cheapest verifier you could attach: schema? compile? unit test? rule engine? Calculate the expected cost at branching factor 3, pruning to top-2 each level, with verifier cost = 1/10 of generator cost. Compare against one high-effort single chain. Decide which wins.

Sketch from memory: Draw the tree with ✓ and ✗ annotations. Mark where the verifier kills branches and where the backtrack kicks in.

Bridge. Search is powerful and expensive. So the real production question is not "can we reason?" but "when does this task deserve to pay for reasoning at all?" That is routing. → 09-task-routing-patterns.md