12. Evaluation benchmarks — measuring translators by receipts, not vibes¶

~14 min read. Benchmarks matter because demos can look brilliant while silent failure modes stay hidden.

Built on the ELI5 in 00-eli5.md. The only honest way to judge the translator is through the receipt. For structured data and code, execution-based receipts are far better than stylistic impressions.

Benchmarks exist to resist demo theatre¶

A single successful demo proves almost nothing. Maybe the schema was easy. Maybe the code task matched training data. Maybe the happy path was selected carefully. Maybe the hidden edge cases were skipped. Benchmarks force repeated, comparable evaluation. That is why they matter.

For this module, think in three buckets. SQL benchmarks. Small code benchmarks. Repository-scale software benchmarks. Each bucket tests a different layer of capability. Simple, no?

benchmarks
 ├── SQL tasks ──→ schema linking + execution accuracy
 ├── code tasks ──→ function generation + unit tests
 └── repo tasks ──→ issue fixing + multi-file edits

The metric should match the vendor. If the vendor is a database, execution accuracy matters. If the vendor is a compiler and unit tests, pass rate matters. If the vendor is a full repository task, issue resolution matters. Stylish answers alone are not enough.

SQL benchmarks and code benchmarks test different things¶

For text-to-SQL, benchmarks like Spider and BIRD ask whether the model can map natural language to correct SQL across varied schemas. Exact string match is one metric. Execution accuracy is often better. Two different SQL queries may produce the same correct result. So the receipt matters more than the wording.

For code, HumanEval and MBPP focus on small function-level tasks. The model gets a prompt and must generate code that passes hidden tests. This is much better than eyeballing the code. A pretty wrong function still fails. That is the whole point.

For repository tasks, SWE-bench is the famous example. Now the model must read a real issue, inspect a real repo, patch code, and satisfy the repository's tests. This is much closer to actual engineering work. The difficulty jumps because multi-file understanding and environment setup enter the picture. Yes?

Worked numerical example: exact match versus execution¶

Suppose one benchmark item asks for total April revenue. Gold SQL is: SELECT SUM(amount) FROM orders WHERE month = '2025-04'; A model outputs: SELECT SUM(amount) AS total FROM orders WHERE month = '2025-04';

String match says these differ. Execution says they return the same number. Imagine the April rows are 120, 180, and 90. Then both queries compute 120 + 180 + 90 = 390. The receipt is identical. So execution accuracy should count this as correct. That is why exact match alone is too rigid.

Now take a code example. Suppose a model solves 3 out of 5 hidden unit tests for one task. That task is still a failure under pass/fail evaluation. Partial credit may help diagnosis. But the program is not yet acceptable. For production engineering, the vendor usually wants all required checks passed. See the difference between analysis metrics and shipping metrics.

One more useful number is candidate sampling. If a model produces 5 candidates and 2 pass all tests, your per-task pass@5 is stronger than pass@1. That captures search power. Search matters in code. Simple, no?

Execution-based eval beats stylistic eval for this domain¶

A human judge may think one SQL query looks cleaner. A human judge may think one generated function looks elegant. That is sometimes useful. It is not the core metric. For structured tasks, execution is king. Did the SQL run and return the right result? Did the code pass tests? Did the patch resolve the issue? That is the main line.

Of course, execution is not everything. A passing solution may be slow, insecure, or unmaintainable. A SQL query may be correct but overbroad. A code patch may pass visible tests and still break hidden behavior. So strong eval stacks include secondary checks too. Latency. Cost. Security scans. Human review. But execution remains the backbone.

Private evals and benchmark limits¶

Public benchmarks are useful because everyone can compare scores. Private evals are useful because they reflect your own business language and failures. So what to do? Harvest failed SQL questions from support logs. Harvest bad code patches from incident reviews. Turn each one into a regression item. Store the shopping list, the expected receipt, and why the failure mattered. That gives the translator a phrasebook of your real pain points.

public benchmark ──→ comparability
private benchmark ──→ realism
regression suite  ──→ do not repeat old mistakes

Look. Teams that ship reliably usually track all three. One tells you where you stand. One tells you what your customers actually need. One tells you whether you are getting better or worse.

Benchmarks are useful and still incomplete.

Benchmarks can be gamed. Models may memorize benchmark patterns. Environments may differ from real production setups. Toy tasks may not capture messy business ambiguity. So what to do? Use public benchmarks for comparability. Use private evals for realism. Track regression suites from your own failures. That is the mature path.

Look at the market analogy. A benchmark is a basket of shopping lists. The vendor receipts tell you which bargains actually worked. Without those receipts, you are ranking translators by confidence and accent. Bad idea.

Where this lives in the wild¶

GitHub Copilot model team — ML engineer: watches HumanEval-style execution metrics before shipping a new coding model.
Snowflake text-to-SQL team — applied scientist: cares about execution accuracy, not just literal SQL overlap, on enterprise query tasks.
Databricks Genie team — product analyst: needs eval sets that reflect business language and governance constraints from real warehouses.
SWE-bench researchers — software engineering evaluator: measure whether systems can actually resolve repository issues end to end.
Code generation startup founders — platform lead: blend public benchmarks with private regression tasks from customer bugs.

Pause and recall¶

Why are benchmarks necessary even when demos look strong?
In the SQL example, why should execution accuracy count the model output as correct?
What does repository-scale evaluation test that HumanEval does not?
Why are private regressions still necessary after public benchmark wins?

Interview Q&A¶

Q: Why is execution accuracy often better than exact string match for text-to-SQL? A: Because different SQL forms can be semantically equivalent, and the database receipt captures that equivalence better than literal text comparison. Common wrong answer to avoid: "Because exact match is useless for SQL."

Q: Why does passing small coding benchmarks not guarantee success on SWE-bench-style tasks? A: Small benchmarks test local synthesis, while repository tasks additionally require environment setup, retrieval, multi-file editing, and issue understanding. Common wrong answer to avoid: "Because SWE-bench is just a harder version of the same thing."

Q: Why are public benchmarks alone insufficient for production evaluation? A: They offer comparability but miss your domain's schemas, APIs, failure costs, and historical bugs, so private regressions are needed for realism. Common wrong answer to avoid: "A top public benchmark score means the system is production-ready."

Q: Why keep non-execution metrics like latency or security in the eval stack? A: A solution can execute correctly and still be too slow, too expensive, unsafe, or impossible to maintain in production. Common wrong answer to avoid: "If it passes tests, the rest is irrelevant."

Apply now (5 min)¶

Exercise. Pick one SQL task and one code task from your own world. Write what the real receipt would be for each: returned rows, test pass, or issue resolved. Then compute one example metric, like 120 + 180 + 90 = 390, and ask whether string match really matters.

Sketch from memory. Draw the three benchmark buckets: SQL, code, and repo tasks. Under each one, write the main receipt metric. See. That diagram is your eval map.

Bridge. Benchmarks show where systems succeed, but they also reveal where the field is still brittle and unfinished. Next we end honestly. → 13-honest-admission.md