Exercise 04 — Eval Harness¶
Timebox: 20-30 minutes
Goal¶
Turn evaluate.py into a simple but reusable eval harness for an agent or LLM-backed feature.
Work in¶
evaluate.py
Tasks¶
- Expand the gold set to 10-20 cases.
- Improve the scoring rule beyond simple substring matching.
- Categorize failures instead of only pass/fail.
- Export the results as a markdown table or JSON report.
Done when¶
- You can run the script and get a pass rate
- Failures are grouped in a way that suggests fixes