Skip to content

Exercise 04 — Eval Harness

Timebox: 20-30 minutes

Goal

Turn evaluate.py into a simple but reusable eval harness for an agent or LLM-backed feature.

Work in

  • evaluate.py

Tasks

  1. Expand the gold set to 10-20 cases.
  2. Improve the scoring rule beyond simple substring matching.
  3. Categorize failures instead of only pass/fail.
  4. Export the results as a markdown table or JSON report.

Done when

  • You can run the script and get a pass rate
  • Failures are grouped in a way that suggests fixes