Skip to content

AI Engineering Playbook

Exercise 04 — Eval Harness

Exercise 04 — Eval Harness¶

Timebox: 20-30 minutes

Goal¶

Turn evaluate.py into a simple but reusable eval harness for an agent or LLM-backed feature.

Work in¶

evaluate.py

Tasks¶

Expand the gold set to 10-20 cases.
Improve the scoring rule beyond simple substring matching.
Categorize failures instead of only pass/fail.
Export the results as a markdown table or JSON report.

Done when¶

You can run the script and get a pass rate
Failures are grouped in a way that suggests fixes