08. Cost and throughput¶

Mechanism fit makes scores meaningful. Operating the set at scale is the next discipline. Every eval run costs model calls and judge calls; multiply by frequency, by set size, by CI runs per day, and the cost is real. Throughput — how fast the eval completes — gates how often the team can iterate.

A platform engineer at a Bengaluru SaaS company calculates the cost of running her eval set in CI. The set has 250 cases; each case calls the system (one model call) and then calls the judge (one model call, sometimes three for variance reduction). Total per CI run: 500–1,000 model calls. At ~$0.005 per call average, that is $2.50–$5 per CI run. The team runs CI 80 times a day across the platform. That is $200–$400 per day, $6,000–$12,000 per month, just for evals. The cost is meaningful; the team begins to optimise — cache evals on unchanged prompts; sample subsets for CI and run the full set nightly; run the heavy distribution set weekly. After two weeks of optimisation, the eval cost is $1,500/month for the same coverage.

This chapter is the cost-and-throughput discipline. Not every team needs the full set on every CI run; not every change needs the deepest judging. The economics is a knob the platform tunes.

Where the cost is¶

Per eval run, costs come from:

Component	Per-call cost	Per run cost (250 cases)
System inference (the call being evaluated)	$0.001–$0.05 depending on model	$0.25–$12.50
Judge inference (LLM-as-judge per case)	$0.001–$0.05	$0.25–$12.50
Multi-judge variance reduction (3× judges)	3× above	$0.75–$37.50
Storage and infrastructure	Small per run	Negligible
Engineering attention (labelling, review)	Significant	Bounded by team time

The dominant costs are inference; storage and infra are usually small.

A regression set with rubric judging on every change can cost $5–$50 per run; running it 80 times a day produces a meaningful monthly bill.

Throughput¶

Throughput is how long the eval takes to complete. Three constraints:

System inference latency. Each system call takes 0.5–10 seconds.
Judge latency. Each judge call takes 0.5–5 seconds.
Parallelism. Cases can run in parallel; the limit is the rate limits of the underlying providers (gateway's quota plane from 02_ai_infrastructure/01 chapter 05).

For a 250-case set with full judging, sequential is 5–60 minutes; with 20× parallelism, 30 seconds–3 minutes. CI typically tolerates seconds-to-low-minutes; longer runs do not fit pre-merge gates.

Optimisation patterns¶

Six patterns that reduce cost or improve throughput.

1. Tiered set runs¶

Run a small fast subset on every change; run the full set nightly; run the distribution set weekly.

ci:
  on_every_pr:
    set: smoke           # 30 cases, ~30s, blocking
  nightly:
    set: regression      # 250 cases, ~10 min, alerting on regression
  weekly:
    set: distribution    # 2,500 cases, ~1h, reports trends

Most PR-level changes need fast feedback; most regressions of the full set are caught by the nightly. The smoke set focuses on the cases most likely to be affected by code changes (chapter 04's stratification helps pick).

2. Change-aware sampling¶

If a PR touches one feature, run only the cases for that feature pre-merge. The full regression suite runs on the main branch nightly. Saves cost on most PRs.

3. Cache eval results¶

If the system inputs and the prompts have not changed, the eval result is the same. Cache on (set_version, model_version, prompt_version, case_id). A change to one tool's prompt does not re-run cases that depend on other tools.

For most platforms, cache hit rate on incremental changes is 60–90%. Real cost reduction.

4. Cheaper judge for cheaper cases¶

A high-volume distribution set may use a cheaper judge (a smaller model or rules-based comparison) than the regression set. The cheaper judge has higher noise; aggregate signal is still useful for trends.

5. Single-judge by default, multi-judge for borderline¶

A case that scores 0.95 from one judge is unlikely to flip; do not bother running multi-judge. Cases scoring 0.5–0.7 from one judge are borderline; the second and third judge runs are valuable there. Targeted multi-judge.

6. Provider-side cache¶

The model gateway's caching (02_ai_infrastructure/01 chapter 08) can cache eval calls. Many eval cases are stable; cached responses cost a fraction of fresh calls. Wire the gateway's exact-match cache for evals.

Cost budget¶

Set an explicit eval budget per month. Track against it. Common allocations:

Pre-merge eval cost: $X per PR; budget per team
Nightly eval cost: $Y per day; budget per platform
Production-traffic eval cost: $Z per day; budget per feature

If the budget is exceeded, the cost optimisation patterns above are applied. If they cannot bring it within budget, the eval coverage is reduced — explicitly, with the trade discussed with stakeholders.

Throughput as a developer-experience problem¶

A CI gate that takes 30 minutes is, in practice, ignored. Engineers wait, switch context, lose the connection between the change and the result. A CI gate that takes 30 seconds is consulted.

The platform that wants engineers to use the eval gates designs for fast feedback. The fast smoke set is for the inner loop; the deeper checks are nightly. This is the same discipline as software CI: fast tests in the inner loop, slow tests in the outer.

When to invest more in evals¶

The cost is real; the trade-off with quality is also real. Spend more on evals when:

The system is in active development; iteration speed depends on fast, reliable signal
A new high-stakes feature is launching; the cost of a regression is high
A model migration is in flight; per-canary-step evaluation is needed
A regulatory or compliance gate requires evidence

Spend less on evals when:

The system is stable; cost reduction is rational
A specific eval set is no longer producing useful signal (cases all pass; refresh it)
The team is in a budget squeeze; explicit reductions with stakeholder buy-in

The cost is a budget the team manages, not a fixed overhead.

Common mistakes¶

No cost visibility. The bill arrives; the cost is a surprise. Wire cost dashboards per eval run.

Full set on every CI run. Slow feedback; high cost. Tier the runs.

No caching. Every CI run recomputes everything. Cache aggressively.

Multi-judge for everything. Cost multiplied; the precision was only needed for borderline cases.

Cheaper judge without calibrating against the expensive judge. The cheap judge produces a different score distribution; comparisons across tiers are not directly meaningful.

Interview Q&A¶

Q1. The eval cost is $400/day in CI. The team wants to reduce. What do you do? Inventory: per-run cost, runs per day, hot paths. Optimisation patterns: tier the runs (smoke for PRs, regression for nightly); cache results on unchanged prompts; cheaper judge for the distribution set; targeted multi-judge only for borderline cases; wire the model gateway's cache for eval calls. Most platforms get 50-80% reduction with these. The reduction does not lose coverage; it changes when the coverage runs. Wrong-answer notes: "cut the set" loses coverage; the right move is to tier and cache.

Q2. The CI eval gate takes 25 minutes. Engineers are bypassing it. What is the problem? The gate is too slow for the inner loop. Engineers will not wait 25 minutes for feedback on every commit. The fix is a tiered run: a fast smoke set (~30 seconds) gates pre-merge; the full set runs on main branch nightly; regressions caught nightly trigger investigation. The gate the engineer feels is fast and informative; the deeper check still happens, just not in the inner loop. Wrong-answer notes: "engineers should be patient" misses that bypass is the predictable outcome of slow gates.

Q3. Walk through how you would cache eval results. Cache key: (set_version, case_id, system_inputs_hash, model_version, prompt_version, judging_mechanism, rubric_version). If all are unchanged, the cached result is reusable. Cache hit returns the score without re-running the system or the judge. Invalidate when any key component changes. Hit rate on incremental PRs is typically 60-90%; full-recompute happens on model bumps, prompt bumps, or set version bumps. The cache itself lives in a small store (Redis or similar). Wrong-answer notes: "we cache the system call" is partial; the judge call is also cacheable.

Q4. The cheap judge produces scores 0.05 lower on average than the expensive judge. How do you handle this in dashboards? Calibrate. Run the cheap and expensive judges on the same subset; quantify the systematic bias. Either correct the cheap judge's scores by the bias factor when reporting, or report cheap-judge scores separately with a noted scale. The two scores are not directly comparable; the dashboard makes the difference visible. The dashboard never shows the cheap and expensive scores as the same line; they are different metrics. Wrong-answer notes: "treat them as the same" silently produces misleading trend lines.

What to do differently after reading this¶

Track eval cost per run, per day, per team. Make it visible.
Tier the eval runs: smoke for PRs, regression for nightly, distribution for weekly.
Cache aggressively. The model gateway's cache plus an eval-specific cache layer.
Targeted multi-judge: borderline cases only.
Calibrate cheap judges against expensive judges before using cheap judges for cost reduction.

Bridge. Cost is one operational concern. Privacy is another — the set itself carries personal data, has retention obligations, and is subject to right-to-be-forgotten. The next chapter is the privacy discipline applied to the set. → 09-privacy-in-the-golden-set.md

Component	Per-call cost	Per run cost (250 cases)
System inference (the call being evaluated)	\(0.001–\)0.05 depending on model	\(0.25–\)12.50
Judge inference (LLM-as-judge per case)	\(0.001–\)0.05	\(0.25–\)12.50
Multi-judge variance reduction (3× judges)	3× above	\(0.75–\)37.50
Storage and infrastructure	Small per run	Negligible
Engineering attention (labelling, review)	Significant	Bounded by team time