14. Latency and cost regressions — when the answer is right but the bill is wrong¶

~12 min read. Same correct output. Triple the cost. This is the silent bug.

Built on the ELI5 in 00-eli5.md. The crime statistics — rolled-up numbers across many requests — now hide a different kind of crime. The output looks right. But the bill, or the clock, doubled. No single case file screams. The crime lives only in the aggregate.

The picture before the details¶

Most bugs are "wrong output." This chapter is about "right output, wrong cost."

No user files a complaint slip here. Finance opens the monthly bill. Or on-call sees p95 latency double on the case board. Output is correct. The path to it became expensive.

See the cost composition first. Every agent turn has four parts:

                cost of one agent turn
                ┌──────────────────────────────────┐
                │                                  │
   ┌────────────┴───────┐         ┌────────────────┴──────┐
   │   prompt tokens     │         │   completion tokens   │
   │  (system + few-shot │         │  (model's reply +     │
   │   + tools + history)│         │   tool-call args)     │
   └────────────┬───────┘         └────────────────┬──────┘
                │                                  │
                ├──────────── + ───────────────────┤
                │                                  │
   ┌────────────┴───────┐         ┌────────────────┴──────┐
   │  tool-call overhead │         │   retry / fallback    │
   │  (extra round trips │         │   cost (failed tool   │
   │   per multi-step)   │         │    → second LLM call) │
   └────────────────────┘         └───────────────────────┘

When the bill rises, one of these four boxes grew. Your job is to find which. Cost is not a fog. It is a budget you can decompose.

The KPI that catches this — cost-per-conversation¶

Track cost-per-conversation as a first-class metric. Not cost-per-token. Not cost-per-request. Cost per full user task.

case board panel:
  cost_per_conversation_p50:  $0.08  ← alarm if > $0.12
  cost_per_conversation_p95:  $0.31  ← alarm if > $0.45
  tool_calls_per_conv_p95:        8  ← inflation alarm
  retry_rate:                  2.1%  ← retry storm alarm

The alarm bell rings on p50 or p95. p50 catches steady inflation. p95 catches tail-cost monsters — one user whose conversation cost $4.20 because the agent retried 14 times. Both page on-call.

Now we walk the seven patterns. Each is a real regression mode.

1. Token-bloat regression — the prompt grew¶

Three weeks pass. Someone added a few-shot. Someone added "always be concise." Someone appended a tool description. Prompt grew 1200 → 1900 tokens. Cost rose 58%. Nobody noticed.

Signal: tokens_in_p50 climbs week over week.
Elimination test: Diff system prompt against 30 days ago. The evidence tag for prompt-version lets you bisect.
Fix: Trim. Set a token budget. CI-fail any PR that exceeds it.

2. Tool-call inflation — the over-cautious agent¶

Someone added "verify before acting" to the prompt. The model now calls search twice, read_file twice, confirm once — where it used to call search once and act. Output identical. Cost 3x.

Signal: tool_calls_per_conv_p50 rose. Completion tokens rose faster than prompt tokens.
Elimination test: Pull two case files — before and after. Same user task. Compare tool sequences.

   before                       after
   ──────                       ─────
   search("invoice 42")         search("invoice 42")
   refund(id=42)                search("invoice details")  ← new
                                read_record(id=42)         ← new
                                refund(id=42)
                                confirm(action="refund")   ← new

Fix: Roll back the verification phrase. Or gate it — only verify when amount > $500. The suspects lineup points at prompt layer.

3. Retry storms — one flaky tool, p95 explodes¶

Downstream API hit 4% error rate, up from 0.5%. Your agent retries up to 3 times. Each retry costs another LLM round trip to re-plan. p50 fine. p95 tripled.

Signal: retry_rate ticks up. p95 latency and cost diverge from p50.
Elimination test: Filter traces where retry_count > 0. Group by tool. One tool dominates? That is the source.
Fix: Jittered backoff. Cap retries at 2. Surface errors gracefully. Ticket the tool owners.

4. Tail latency — p95 doubled, p50 untouched¶

No retries. But one tool's response slipped at p99 — maybe a larger table scan. Agents that hit it wait 8s instead of 2s. Most users do not hit it.

Signal: p50 latency flat. p95 doubled.
Elimination test: Plot per-tool latency percentiles weekly. Shifted tool stands out.
Fix: Tool owner adds index/cache. On agent side — per-tool timeout and fallback path.

5. Model upgrade tax — smarter, slower, pricier¶

You moved gpt-4o-mini → gpt-4o. Quality up 5%. Cost up 12x. For most queries, mini was fine. The upgrade was a tax.

Signal: Cost-per-conversation step-changed on the day of the upgrade.
Elimination test: Route 10% of traffic back to the old model. Compare quality and cost.
Fix: Tiered routing — cheap model for easy 80%, big model only for hard. The crime statistics split by tier.

6. Cache-miss regression — prompt structure changed¶

OpenAI, Anthropic, Bedrock — all charge less for cached prompt tokens. Caching hits when the prefix of your prompt is unchanged. Move one variable to the top, cache hit rate drops 60% → 10% overnight.

   cached (good)                    not cached (bad)
   ────────────────────────         ────────────────────────
   [static system prompt]           [user_id: 12345]        ← changes each call
   [static tool definitions]        [static system prompt]  ← never cached now
   [static few-shots]               [static tool definitions]
   [user message: ...]              [user message: ...]

Signal: Provider's cache-hit metric drops. Prompt-cost-per-call rises while token count is flat.
Elimination test: Inspect the template diff. Anything dynamic above the static block?
Fix: Static first, dynamic last. Track cache-hit rate on the case board.

7. Streaming vs non-streaming — perceived slowness¶

You disabled streaming because "JSON parse breaks on partials." First-token latency stayed the same. But perceived latency went "instant" → "8-second wait." Bill unchanged. User files a complaint slip about feeling slow.

Signal: time_to_first_token versus time_to_full_response diverged. Satisfaction drops.
Elimination test: Check the streaming code's change log. Was it disabled for an outdated reason?
Fix: Re-enable. If JSON is the constraint, parse incrementally or stream only the user-facing prose.

Worked example — the $0.18 → $0.62 mystery¶

Coding agent. Cost-per-PR-review climbed steadily.

   week 1   week 2   week 3   week 4
   $0.18    $0.24    $0.41    $0.62    ← 3.4x in 3 weeks

Step 1. Case board check. p50 and p95 rose proportionally. Not a tail spike. Steady inflation. Token-bloat is prime suspect.

Step 2. Decompose tokens-in versus tokens-out:

   week 1   week 2   week 3   week 4
   in:  2,100   2,400   3,100   3,800
   out:   620     650     670     690

Tokens-in tripled. Tokens-out flat. Crime is on the prompt side.

Step 3. Bisect against prompt-version timeline. Each version carries its evidence tag.

   v2.3 → v2.4: +50 tokens   (1 few-shot)
   v2.4 → v2.5: +600 tokens  ("review checklist" addition)  ← culprit
   v2.5 → v2.6: +100 tokens  (1 tool description)

Step 4. Verify the confession. Diff v2.4 vs v2.5. Run quality eval on both. Quality lift: 1.2%. Cost lift: 70%. Bad trade.

Step 5. Fix. Trim checklist to 5 items, 80 tokens. Quality holds. Cost drops to $0.22. The lock is a CI check — every prompt PR must report token-delta and cost-per-conversation eval.

Latency and cost regressions across deployed agents¶

Helicone cost-anomaly alerts — auto-fires when a model+prompt-version combo's average cost-per-request spikes >20% over the 7-day baseline.
OpenAI usage dashboards — per-API-key token and cached-token breakdown lets teams catch a cache-hit-rate drop within hours of a prompt deploy.
Anthropic console usage — per-request token + cost + cache-hit rate visible in workbench; the role is making prompt-cache regressions debuggable per request.
Cursor's token-usage tracking — per-user cost panel exposes which features drove monthly spend; users self-throttle expensive flows.
Vercel AI SDK observability — experimental_telemetry emits per-call token counts and tool-call counts; pipes into Datadog or Langfuse for p95 panels.
Custom Datadog LLM-cost monitors — emit llm.cost.usd tagged by prompt_version, model, tenant; alerts wire to PagerDuty on p50 budget crossings.
LangSmith cost metrics — per-trace token and cost rollup; the role is exposing which span in the trace ate the budget.
LangFuse usage analytics — open-source LLM cost tracking with model and user dimensions; the role is making per-tenant cost auditable without a vendor lock-in.
Comet Opik usage dashboards — eval-correlated cost metrics; the role is showing "did the cost rise because quality went up?" rather than treating cost in isolation.
AWS Cost Explorer for Bedrock — invocation-level cost with model dimension; the role is letting an SRE who already lives in AWS Cost Explorer inherit LLM cost views.
Azure cost management for OpenAI — service-level cost with deployment dimension; the role is enforcing chargeback for shared Azure OpenAI deployments.
GCP Vertex AI cost reporting — model-version-level breakdown; the role is making capability-cliff cost trade-offs visible to budget owners.
Honeycomb's p95/p99 alerts on LLM spans — latency outlier detection; the role is exposing tail-monster traces by structure, not just aggregate latency.
Datadog APM for LLM apps — span-level p99 with model dimension; the role is letting an existing SRE workflow inherit LLM latency triage.
Grafana LLM-cost dashboards — Prometheus-based panel templates; the role is fitting LLM cost monitoring into the team's existing dashboard culture.
BAML token budgeting — typed-DSL annotations that fail builds when a prompt exceeds a budget; the role is shifting the regression check left to compile time.
AgentOps cost-per-agent panels — per-agent breakdown in multi-agent runs; the role is exposing which agent ate the budget in a CrewAI-style system.
PromptHub cache-hit metrics — per-prompt cache statistics with version dimension; the role is making cache regressions a first-class metric.
LiteLLM proxy cost-aware routing — fallback to cheaper models with per-route cost ceilings; the role is enforcing budget at the proxy layer.
OpenRouter cost-per-call routing — multi-vendor route selection by cost; the role is exposing how vendor cost differences compound.
AWS Bedrock provisioned-throughput pricing — predictable cost per token; the role is exposing how the elastic-vs-provisioned trade-off changes regression sensitivity.
Anthropic prompt-cache discipline (5-min TTL) — static prefix cached for 5 minutes; the role is making cache-aware prompt design a documented best practice with measurable cost lift.
OpenAI batch API — async batched inference at 50% discount; the role is exposing that some "cost regressions" are workload shape mismatches, not bugs.

Recall — cost composition and the four growth patterns¶

Cost-per-conversation beats cost-per-token — name two reasons.
Tokens-in tripled but tokens-out flat — which two patterns fit?
p50 latency fine, p95 doubled — what two patterns explain this shape?
Cache-hit rate dropped 60% → 10% — what structural change likely caused it?

Interview Q&A¶

Q: Why is cost-per-token a misleading KPI compared to cost-per-conversation? A: Cost-per-token treats all tokens equal. A 2-token reply that triggered five tool calls costs more than a 50-token reply with no tools. Cost-per-conversation captures the whole task — prompt, completion, tool overhead, retries — in one number. That is what finance and product care about.

Common wrong answer to avoid: "Both are equivalent if you average correctly" — they are not. Cost-per-token hides retry storms and tool-call inflation. Two agents with identical cost-per-token can differ 3x in cost-per-conversation.

Q: Your p95 cost-per-conversation spiked but p50 is unchanged. Where do you look first? A: Tail-cost monsters. Filter traces where total cost is above the 95th percentile. Check retry rate and tool-call count on those traces. Almost always one of two things — a flaky tool causing retry storms, or a small class of complex queries that legitimately need more steps. The fix differs.

Common wrong answer to avoid: "Just upgrade infra" — the bug is in the agent's behavior on specific inputs, not raw throughput. Adding capacity masks the symptom while cost keeps climbing.

Q: A "smarter" model doubled your cost. Quality gained 3%. Defend or roll back? A: Depends on the value of 3%. For a high-stakes legal-review agent, 3% accuracy is worth 2x cost. For casual chat, no. The right play is tiered routing — cheap model for easy queries, expensive for hard — typically captures 90% of the quality at 30% of the cost.

Common wrong answer to avoid: "Smarter is always worth it" — model upgrades are not pure wins. If the lift is below your product's tolerance, the upgrade is a tax. Always run a holdout comparison.

Q: Why does prompt caching require a static prefix, and what breaks it? A: Providers cache the longest matching prefix. Match is token-by-token exact. If any dynamic content — user ID, timestamp, session variable — appears above the static system prompt, the prefix differs every call. Cache miss. The fix is structural: static first, dynamic last.

Common wrong answer to avoid: "Caching is automatic, the provider handles it" — providers cache only if your prompt has a stable prefix. Bad structure silently disables it. The crime statistics for cache-hit rate must be monitored.

Apply now (10 min)¶

Step 1 — model the exercise. Here is the cost-composition breakdown I would build for one expensive refund-bot trace:

Box	Tokens / cost	% of trace cost	Likely cause if growing
Prompt tokens	4,200 / $0.012	38%	token-bloat in retrieved chunks
Completion tokens	1,400 / $0.018	56%	verbose answer style (style drift)
Tool-call overhead	3 calls / $0.001	3%	normal
Retries	1 retry / $0.001	3%	flaky tool → retry storm if growing

Verdict: completion-token cost is the dominant box; the regression is in answer style or system prompt, not retrieval. Tail-cost monsters are the 5% of traces with > 6 retries; check the flaky tool first.

Step 2 — your turn. Pull your agent's last 7 days of usage. Compute cost-per-conversation at p50 and p95. Compare against the previous 7 days. Any drift over 15%? Bisect by prompt version, model version, and tool error rate. Then decompose one expensive trace — of the total cost, what percent was prompt tokens, completion tokens, tool-call overhead, retries? Which box is biggest, and is that intentional?

Step 3 — reproduce from memory. Draw the four-box cost composition diagram. Annotate each box with which pattern grows it — token-bloat → prompt; tool inflation → tool overhead; retry storms → retry box; model upgrade → all four.

What you should remember¶

This chapter explained why a green correctness eval and a red invoice can both be true at the same time. Latency and cost regressions are bugs in their own right, and they hide behind aggregate metrics that look fine. p50 cost can be unchanged while p95 doubled because a 5% slice of traces fell into retry storms. Cost-per-token can look stable while cost-per-conversation tripled because a new tool call now happens on every turn. The diagnostic moves from aggregate to composition: break cost into four boxes (prompt, completion, tool overhead, retries) and the growing box names the pattern.

You also learned that cost is also a quality signal. A prompt-cache hit rate dropping from 60% to 10% is not just an expensive bug — it is structural evidence that something put dynamic content above the static prefix. The fix is structural too: static first, dynamic last, and a cache-hit metric on the kitchen log dashboard alongside the quality metric.

Carry this diagnostic forward: when an invoice doubles with no shipped change, suspect cache-hit-rate collapse before suspecting traffic growth. Cache regressions often happen when a downstream service starts injecting per-request headers into the prompt assembly path nobody thought of as cache-relevant.

Remember:

Cost-per-conversation beats cost-per-token. The former captures retries and tool overhead.
Tail-cost monsters live above p95. Filter the top 5% of traces by cost and inspect retry rate and tool-call count.
Prompt caching needs a static prefix. Any dynamic content above the system block kills the cache.
A model upgrade is rarely a pure win. Tiered routing usually captures 90% of the lift at 30% of the cost.
Cost regressions need their own alert on the kitchen log, separate from quality alerts. The two metrics can disagree.

Design-side prevention — cross link¶

This chapter catches regressions after ship. The upstream prevention — set budgets before ship — is 16_agents_tool_calling chapter 11 — cost-latency-budgets. Every agent gets a token budget, tool-call budget, latency budget. Two chapters pair — design budgets in 16, detect breaches here.

Bridge. We have named every failure mode — wrong output, hallucinated tools, runaway loops, drift, slow, expensive. Eight chapters of patterns and elimination tests. But which tools actually let you do this work? Next — LangSmith, Phoenix, Braintrust — concrete debugging loops in each. → 15-debugging-tools-workflow.md