05. Assignment 12 — Reasoning Router + Evaluation Harness¶

Week 12. Build a small system that decides when to escalate from a fast model to a reasoning model.

Goal¶

Ship a benchmark and router that answers one practical question: When is the extra thinking budget worth the cost and latency?

What you are building¶

A system with three paths: 1. Fast baseline — standard model, direct prompt 2. Fast + helper path — standard model with CoT, retry, or verifier loop 3. Reasoning path — reasoning model for hard cases

Then add a router that decides when to escalate.

Requirements¶

Dataset — 60-100 tasks split across easy, medium, and hard buckets
Task mix — include at least 3 categories:
math / logic
SQL / code / structured reasoning
planning / policy / multi-constraint text tasks
Baselines — evaluate all three paths on the same dataset
Router — cheap-first, escalate on heuristics, low confidence, or verifier failure
Evaluation — report quality, latency, cost, and routing behaviour
Failure taxonomy — identify 5-7 recurring reasoning failures

Recommended architecture¶

request -> fast model -> verifier -> ship
                     \-> if fail or hard -> reasoning model -> verifier -> ship / human

Deliverables¶

tasks.jsonl — labelled task set with difficulty bucket and reference answer / rubric
run_baselines.py — executes the three baseline strategies
router.py — routing logic with escalation criteria
evaluate.py — computes task success, latency, cost, and routing metrics
results/summary.md — headline comparison table and recommendation
results/failures.md — failure taxonomy with examples
README.md — methodology, prompts, limitations, and what you learned

Minimum metrics to report¶

Metric	Why it matters
Overall success rate	top-line quality
Hard-slice success rate	where reasoning should earn its keep
Average latency	user experience
p95 latency	tail pain
Cost per request	spend control
Cost per solved task	quality-adjusted spend
Escalation rate	router aggressiveness
Verifier failure rate	cheap quality signal

Suggested experiments¶

Experiment	Compare	What to learn
Prompting	direct vs zero-shot CoT vs few-shot CoT	when prompting alone is enough
Model choice	fast model vs reasoning model	hard-task lift
Router policy	rule-based vs verifier-driven	best escalation trigger
Budgeting	short vs long reasoning budget	diminishing returns

Success criteria¶

The reasoning path clearly beats the fast baseline on the hard slice.
The router is cheaper than always using the reasoning model.
The router preserves most of the hard-task quality gain.
You can explain at least 3 cases where CoT helped and 3 where it did not.
Your failure taxonomy contains concrete examples, not vague labels.

Optional stretch goals¶

Add a judge model for qualitative comparison, but keep outcome metrics primary.
Add confidence calibration or abstention.
Add a human-review bucket for high-risk tasks.
Distill the best reasoning outputs into a cheaper prompt or smaller model.

Why this matters¶

This hands_on_lab teaches the production skill that matters most here: not merely using a stronger model, but deciding when to pay for stronger reasoning. That is exactly the bridge into future multimodal and agent systems.