Skip to content

Eval Harness — Analysis

What this harness does

  1. 15-case gold set across 6 categories (geo, math, science, common-sense, literature, history).
  2. Per-case scoring with three matching levels: exact, substring (case + punctuation insensitive), and digit-only for numeric expectations.
  3. Failure categorisation into pass, refusal_or_unknown, wrong_answer, wrong_number.
  4. Aggregation by category and by failure type.
  5. Reports in markdown (human-readable) or JSON (machine-readable).

Run output

Pass rate: 14/15 = 93.3%

Per-category pass rate:
| literature | 50.0% |   ← the one weak category
| (others)   | 100% |

Failures by type:
| refusal_or_unknown | 1 |

The 50% pass rate in literature is a deliberate seeded failure: the mock agent says "I'm not sure" about Plato. The harness catches the refusal and categorises it correctly.

Why categorise failures, not just pass/fail

A 70% pass rate by itself is uninformative. The same number could mean:

  • 30% of answers are subtly wrong (the model is hallucinating).
  • 30% of answers are refusals (the model is being too cautious).
  • 30% of answers are off by formatting (the answer is correct but the matcher is brittle).

Each requires a different fix:

  • Hallucination → improve grounding, RAG, fact-checking.
  • Over-refusal → tune the system prompt; loosen safety filters.
  • Format mismatches → improve the matcher (which is what score() does via normalisation and digit extraction).

Per-category breakdown surfaces which kind of question the system is bad at. If math is 50% and geo is 100%, the gap is the model's arithmetic, not its knowledge base.

The scoring layers

score() has three matching steps:

  1. Normalise both strings. Lowercase, strip punctuation, collapse whitespace. So "New-Delhi," matches "new delhi".
  2. Check expected and all accepted alternatives. Some answers have multiple acceptable forms (Delhi vs. New Delhi; Shakespeare vs. William Shakespeare; 7 vs. seven).
  3. Numeric rescue. For digit-bearing answers, extract numbers and check if any match. So "3e8" matches "299792458" via accepted alternatives, but the rescue path also handles "approximately 3 × 10^8".

A brittle matcher (substring-only, case-sensitive) would inflate false negatives — counting correct answers as wrong because of formatting. A too-loose matcher would inflate false positives — counting wrong answers as correct because some token happened to appear. The three-layer matcher splits the difference: tolerant of formatting, strict on content.

What this harness deliberately doesn't do

  • Doesn't call a real LLM. The mock is deterministic; the harness is the subject of test. Plug in a real agent function in evaluate(real_agent, GOLD_SET).
  • Doesn't measure latency or cost. Those are separate concerns; could be added as per-case metrics.
  • Doesn't do statistical confidence. With 15 cases, the pass rate is noisy. For real evaluations, 100+ cases per category give stable estimates.
  • Doesn't run regression checks. A real harness compares this run to a baseline and flags regressions per category.
  • Doesn't sample failures for human review. Production eval harnesses often surface 5-10 random failures per category for engineer eyes.

Each is a 30-line addition; the structure is there.

What this exercise teaches

  • Eval design is mostly matching design.
  • Categorising failures is more valuable than counting them.
  • Per-slice metrics catch what aggregates miss.
  • A simple JSON / markdown report is more usable than a pass/fail bit.

Interview probes:

  • "How do you handle answers like '7' vs. 'seven'?"
  • "What if the gold answer can be expressed multiple correct ways?"
  • "How do you compare today's run to yesterday's?"
  • "How would you extend this to LLM-as-judge scoring?"
  • "What is the difference between regression catching and absolute scoring?"