05. Assignment 12 — Reasoning Router + Evaluation Harness¶
Week 12. Build a small system that decides when to escalate from a fast model to a reasoning model.
Goal¶
Ship a benchmark and router that answers one practical question: When is the extra thinking budget worth the cost and latency?
What you are building¶
A system with three paths: 1. Fast baseline — standard model, direct prompt 2. Fast + helper path — standard model with CoT, retry, or verifier loop 3. Reasoning path — reasoning model for hard cases
Then add a router that decides when to escalate.
Requirements¶
- Dataset — 60-100 tasks split across easy, medium, and hard buckets
- Task mix — include at least 3 categories:
- math / logic
- SQL / code / structured reasoning
- planning / policy / multi-constraint text tasks
- Baselines — evaluate all three paths on the same dataset
- Router — cheap-first, escalate on heuristics, low confidence, or verifier failure
- Evaluation — report quality, latency, cost, and routing behaviour
- Failure taxonomy — identify 5-7 recurring reasoning failures
Recommended architecture¶
request -> fast model -> verifier -> ship
\-> if fail or hard -> reasoning model -> verifier -> ship / human
Deliverables¶
tasks.jsonl— labelled task set with difficulty bucket and reference answer / rubricrun_baselines.py— executes the three baseline strategiesrouter.py— routing logic with escalation criteriaevaluate.py— computes task success, latency, cost, and routing metricsresults/summary.md— headline comparison table and recommendationresults/failures.md— failure taxonomy with examplesREADME.md— methodology, prompts, limitations, and what you learned
Minimum metrics to report¶
| Metric | Why it matters |
|---|---|
| Overall success rate | top-line quality |
| Hard-slice success rate | where reasoning should earn its keep |
| Average latency | user experience |
| p95 latency | tail pain |
| Cost per request | spend control |
| Cost per solved task | quality-adjusted spend |
| Escalation rate | router aggressiveness |
| Verifier failure rate | cheap quality signal |
Suggested experiments¶
| Experiment | Compare | What to learn |
|---|---|---|
| Prompting | direct vs zero-shot CoT vs few-shot CoT | when prompting alone is enough |
| Model choice | fast model vs reasoning model | hard-task lift |
| Router policy | rule-based vs verifier-driven | best escalation trigger |
| Budgeting | short vs long reasoning budget | diminishing returns |
Success criteria¶
- The reasoning path clearly beats the fast baseline on the hard slice.
- The router is cheaper than always using the reasoning model.
- The router preserves most of the hard-task quality gain.
- You can explain at least 3 cases where CoT helped and 3 where it did not.
- Your failure taxonomy contains concrete examples, not vague labels.
Optional stretch goals¶
- Add a judge model for qualitative comparison, but keep outcome metrics primary.
- Add confidence calibration or abstention.
- Add a human-review bucket for high-risk tasks.
- Distill the best reasoning outputs into a cheaper prompt or smaller model.
Why this matters¶
This hands_on_lab teaches the production skill that matters most here: not merely using a stronger model, but deciding when to pay for stronger reasoning. That is exactly the bridge into future multimodal and agent systems.