Skip to content

05. Assignment 12 — Reasoning Router + Evaluation Harness

Week 12. Build a small system that decides when to escalate from a fast model to a reasoning model.

Goal

Ship a benchmark and router that answers one practical question: When is the extra thinking budget worth the cost and latency?

What you are building

A system with three paths: 1. Fast baseline — standard model, direct prompt 2. Fast + helper path — standard model with CoT, retry, or verifier loop 3. Reasoning path — reasoning model for hard cases

Then add a router that decides when to escalate.

Requirements

  1. Dataset — 60-100 tasks split across easy, medium, and hard buckets
  2. Task mix — include at least 3 categories:
  3. math / logic
  4. SQL / code / structured reasoning
  5. planning / policy / multi-constraint text tasks
  6. Baselines — evaluate all three paths on the same dataset
  7. Router — cheap-first, escalate on heuristics, low confidence, or verifier failure
  8. Evaluation — report quality, latency, cost, and routing behaviour
  9. Failure taxonomy — identify 5-7 recurring reasoning failures
request -> fast model -> verifier -> ship
                     \-> if fail or hard -> reasoning model -> verifier -> ship / human

Deliverables

  1. tasks.jsonl — labelled task set with difficulty bucket and reference answer / rubric
  2. run_baselines.py — executes the three baseline strategies
  3. router.py — routing logic with escalation criteria
  4. evaluate.py — computes task success, latency, cost, and routing metrics
  5. results/summary.md — headline comparison table and recommendation
  6. results/failures.md — failure taxonomy with examples
  7. README.md — methodology, prompts, limitations, and what you learned

Minimum metrics to report

Metric Why it matters
Overall success rate top-line quality
Hard-slice success rate where reasoning should earn its keep
Average latency user experience
p95 latency tail pain
Cost per request spend control
Cost per solved task quality-adjusted spend
Escalation rate router aggressiveness
Verifier failure rate cheap quality signal

Suggested experiments

Experiment Compare What to learn
Prompting direct vs zero-shot CoT vs few-shot CoT when prompting alone is enough
Model choice fast model vs reasoning model hard-task lift
Router policy rule-based vs verifier-driven best escalation trigger
Budgeting short vs long reasoning budget diminishing returns

Success criteria

  • The reasoning path clearly beats the fast baseline on the hard slice.
  • The router is cheaper than always using the reasoning model.
  • The router preserves most of the hard-task quality gain.
  • You can explain at least 3 cases where CoT helped and 3 where it did not.
  • Your failure taxonomy contains concrete examples, not vague labels.

Optional stretch goals

  • Add a judge model for qualitative comparison, but keep outcome metrics primary.
  • Add confidence calibration or abstention.
  • Add a human-review bucket for high-risk tasks.
  • Distill the best reasoning outputs into a cheaper prompt or smaller model.

Why this matters

This hands_on_lab teaches the production skill that matters most here: not merely using a stronger model, but deciding when to pay for stronger reasoning. That is exactly the bridge into future multimodal and agent systems.