Eval Harness — Analysis¶

What this harness does¶

15-case gold set across 6 categories (geo, math, science, common-sense, literature, history).
Per-case scoring with three matching levels: exact, substring (case + punctuation insensitive), and digit-only for numeric expectations.
Failure categorisation into pass, refusal_or_unknown, wrong_answer, wrong_number.
Aggregation by category and by failure type.
Reports in markdown (human-readable) or JSON (machine-readable).

Run output¶

Pass rate: 14/15 = 93.3%

Per-category pass rate:
| literature | 50.0% |   ← the one weak category
| (others)   | 100% |

Failures by type:
| refusal_or_unknown | 1 |

The 50% pass rate in literature is a deliberate seeded failure: the mock agent says "I'm not sure" about Plato. The harness catches the refusal and categorises it correctly.

Why categorise failures, not just pass/fail¶

A 70% pass rate by itself is uninformative. The same number could mean:

30% of answers are subtly wrong (the model is hallucinating).
30% of answers are refusals (the model is being too cautious).
30% of answers are off by formatting (the answer is correct but the matcher is brittle).

Each requires a different fix:

Hallucination → improve grounding, RAG, fact-checking.
Over-refusal → tune the system prompt; loosen safety filters.
Format mismatches → improve the matcher (which is what score() does via normalisation and digit extraction).

Per-category breakdown surfaces which kind of question the system is bad at. If math is 50% and geo is 100%, the gap is the model's arithmetic, not its knowledge base.

The scoring layers¶

score() has three matching steps:

Normalise both strings. Lowercase, strip punctuation, collapse whitespace. So "New-Delhi," matches "new delhi".
Check expected and all accepted alternatives. Some answers have multiple acceptable forms (Delhi vs. New Delhi; Shakespeare vs. William Shakespeare; 7 vs. seven).
Numeric rescue. For digit-bearing answers, extract numbers and check if any match. So "3e8" matches "299792458" via accepted alternatives, but the rescue path also handles "approximately 3 × 10^8".

A brittle matcher (substring-only, case-sensitive) would inflate false negatives — counting correct answers as wrong because of formatting. A too-loose matcher would inflate false positives — counting wrong answers as correct because some token happened to appear. The three-layer matcher splits the difference: tolerant of formatting, strict on content.

What this harness deliberately doesn't do¶

Doesn't call a real LLM. The mock is deterministic; the harness is the subject of test. Plug in a real agent function in evaluate(real_agent, GOLD_SET).
Doesn't measure latency or cost. Those are separate concerns; could be added as per-case metrics.
Doesn't do statistical confidence. With 15 cases, the pass rate is noisy. For real evaluations, 100+ cases per category give stable estimates.
Doesn't run regression checks. A real harness compares this run to a baseline and flags regressions per category.
Doesn't sample failures for human review. Production eval harnesses often surface 5-10 random failures per category for engineer eyes.

Each is a 30-line addition; the structure is there.

What this exercise teaches¶

Eval design is mostly matching design.
Categorising failures is more valuable than counting them.
Per-slice metrics catch what aggregates miss.
A simple JSON / markdown report is more usable than a pass/fail bit.

Interview probes:

"How do you handle answers like '7' vs. 'seven'?"
"What if the gold answer can be expressed multiple correct ways?"
"How do you compare today's run to yesterday's?"
"How would you extend this to LLM-as-judge scoring?"
"What is the difference between regression catching and absolute scoring?"