Eval Harness — Analysis¶
What this harness does¶
- 15-case gold set across 6 categories (geo, math, science, common-sense, literature, history).
- Per-case scoring with three matching levels: exact, substring (case + punctuation insensitive), and digit-only for numeric expectations.
- Failure categorisation into
pass,refusal_or_unknown,wrong_answer,wrong_number. - Aggregation by category and by failure type.
- Reports in markdown (human-readable) or JSON (machine-readable).
Run output¶
Pass rate: 14/15 = 93.3%
Per-category pass rate:
| literature | 50.0% | ← the one weak category
| (others) | 100% |
Failures by type:
| refusal_or_unknown | 1 |
The 50% pass rate in literature is a deliberate seeded failure: the mock agent says "I'm not sure" about Plato. The harness catches the refusal and categorises it correctly.
Why categorise failures, not just pass/fail¶
A 70% pass rate by itself is uninformative. The same number could mean:
- 30% of answers are subtly wrong (the model is hallucinating).
- 30% of answers are refusals (the model is being too cautious).
- 30% of answers are off by formatting (the answer is correct but the matcher is brittle).
Each requires a different fix:
- Hallucination → improve grounding, RAG, fact-checking.
- Over-refusal → tune the system prompt; loosen safety filters.
- Format mismatches → improve the matcher (which is what
score()does via normalisation and digit extraction).
Per-category breakdown surfaces which kind of question the system is bad at. If math is 50% and geo is 100%, the gap is the model's arithmetic, not its knowledge base.
The scoring layers¶
score() has three matching steps:
- Normalise both strings. Lowercase, strip punctuation, collapse whitespace. So "New-Delhi," matches "new delhi".
- Check expected and all accepted alternatives. Some answers have multiple acceptable forms (Delhi vs. New Delhi; Shakespeare vs. William Shakespeare; 7 vs. seven).
- Numeric rescue. For digit-bearing answers, extract numbers and check if any match. So "3e8" matches "299792458" via accepted alternatives, but the rescue path also handles "approximately 3 × 10^8".
A brittle matcher (substring-only, case-sensitive) would inflate false negatives — counting correct answers as wrong because of formatting. A too-loose matcher would inflate false positives — counting wrong answers as correct because some token happened to appear. The three-layer matcher splits the difference: tolerant of formatting, strict on content.
What this harness deliberately doesn't do¶
- Doesn't call a real LLM. The mock is deterministic; the harness is the subject of test. Plug in a real agent function in
evaluate(real_agent, GOLD_SET). - Doesn't measure latency or cost. Those are separate concerns; could be added as per-case metrics.
- Doesn't do statistical confidence. With 15 cases, the pass rate is noisy. For real evaluations, 100+ cases per category give stable estimates.
- Doesn't run regression checks. A real harness compares this run to a baseline and flags regressions per category.
- Doesn't sample failures for human review. Production eval harnesses often surface 5-10 random failures per category for engineer eyes.
Each is a 30-line addition; the structure is there.
What this exercise teaches¶
- Eval design is mostly matching design.
- Categorising failures is more valuable than counting them.
- Per-slice metrics catch what aggregates miss.
- A simple JSON / markdown report is more usable than a pass/fail bit.
Interview probes:
- "How do you handle answers like '7' vs. 'seven'?"
- "What if the gold answer can be expressed multiple correct ways?"
- "How do you compare today's run to yesterday's?"
- "How would you extend this to LLM-as-judge scoring?"
- "What is the difference between regression catching and absolute scoring?"