01. Cost anatomy — count the whole workflow before optimizing the token price¶

~22 min read. A lead AI engineer does not optimize a single number; they make the tradeoff visible, measured, and reversible.

Builds on 00-eli5.md. The meter ticks are not only prompt tokens; they include output, cached input, retries, tool calls, guardrails, and shadow traffic. The fuel ledger is the receipt that joins those ticks to an outcome.

What previous chapters solved before this pressure appears¶

The ELI5 taxi fleet gave us the operating picture: every ride has a meter, a dispatcher, an ETA, shared lanes, luggage, and a receipt. That picture solved the first confusion — cost is not a property of one model call. What still breaks is the spreadsheet that reports only the main generation and declares the product cheap while retries, tools, evals, and abandoned work accumulate off to the side. This chapter turns the taxi meter into an auditable fuel ledger.

The accumulated lesson is already visible in the taxi fleet. meter ticks expose money, the ETA call exposes perceived wait, the dispatch board exposes route choice, and the fuel ledger keeps those choices honest. This file adds the next constraint without forgetting the earlier ones: every optimization relieves one pressure and creates another that some subsystem must absorb.

What this file solves¶

A support bot looks affordable at $0.008 per generation, then the monthly invoice lands 70% above forecast. This file shows how to build the receipt for one successful user outcome: fresh input, cached input, output, retries, tool-result tokens, verifier calls, and shadow traffic tied to the same request ID.

The opening failure shows up in a concrete artifact¶

The failure is not abstract: a monthly invoice is higher than the launch model even though traffic matched forecast. Here is the early artifact a reviewer can inspect.

Cost ledger excerpt for request sup-4481:

Stage	Model	Fresh input	Cached input	Output	Reason	Cost
classify	small	420	0	12	route	$0.0002
retrieve+answer	strong	1,850	900	460	first answer	$0.0083
schema repair	strong	2,760	0	180	invalid JSON	$0.0069
verifier	small	980	0	20	safety check	$0.0005
shadow judge	judge	1,600	0	60	rollout eval	$0.0020
workflow					resolved ticket	$0.0179

A smart team might try to fix the most visible line in that artifact. That is tempting, and it is incomplete. The root cause is boundary mismatch: finance pays for every token the workflow burns, while the dashboard reports only the happy-path answer call. So how do we count cost at the same boundary where the user receives value? This is the root-cause pivot: not a local metric problem, a boundary-and-pressure problem.

A tiny version exposes the whole mechanism¶

One answer with 2,400 input tokens and 450 output tokens costs $0.0084 at sample prices. One repair that resends the prompt adds $0.0064. A single hidden retry turns an eight-tenths-cent feature into a one-and-a-half-cent workflow.

Rule: The cost unit is the successful user outcome, not the cheapest API call inside it.¶

Why this rule exists. Tokens are the primitive, but user value arrives after a workflow succeeds. The constraint is that retries, tools, fallbacks, and evals can multiply token use without creating a second user-visible outcome. Counting only the main model call makes the naive optimization target the wrong object. The fuel ledger matters because it shows whether the new pressure landed in cost, latency, memory, quality, or operator attention.

1) Build the receipt at the request boundary¶

Start with the workflow, not the vendor feature. In Maya's review, the team takes one request and follows it from API ingress to model call, tool call, runtime behavior, and resource consequence. That cross-layer trace is the shortest path from symptom to lever. If the symptom is cost, the trace follows meter ticks. If the symptom is silence, it follows the ETA call. If the symptom is serving pressure, it follows the carpool lane and boot space.

user request
    │
    ▼
API gateway ── route/version ──► model/runtime
    │                              │
    │                              ├─ tokens / queue / KV / output
    │                              ▼
    └──────── outcome ◄──────── fuel ledger row

The counterintuitive part is that the most obvious metric can improve while the product gets worse. A smaller bill can hide more failed outcomes. Higher tokens/sec can hide longer queueing. A shorter prompt can hide missing evidence. The mechanism in this chapter is useful only when the trace keeps the relieved pressure and the newly created pressure in the same picture.

2) The workflow receipt before the unit price¶

Picture the chapter as a pressure transfer, not a free lunch.

Before optimization                    After optimization
┌──────────────────────┐              ┌──────────────────────┐
│ visible pain          │              │ relieved pain         │
│ cost / wait / memory  │──change──►   │ lower local metric    │
└──────────┬───────────┘              └──────────┬───────────┘
           │                                      │
           ▼                                      ▼
 hidden cause not named                 new pressure appears
 retries, route mix, context,           quality, queueing, cache,
 output, provider limits                memory, fallback, ops

The diagram is the reason this module keeps returning to the fuel ledger. The ledger is where the second box becomes visible instead of surfacing as an invoice surprise, p99 incident, or quality complaint weeks later.

3) Maya reviews the support bot forecast¶

Maya threads one workload through the design review: a production assistant with real traffic, route versions, prompt versions, and outcome labels.

Attempt A — optimize the visible line¶

The first attempt changes the local knob that seems responsible for a monthly invoice is higher than the launch model even though traffic matched forecast. The local dashboard improves. The team celebrates too early because the request boundary is still broken: retries, quality loss, queueing, cache misses, or memory pressure move elsewhere.

Attempt B — optimize with the pressure chain¶

The second attempt keeps the artifact, the rule, and the guardrail together. Maya writes the expected improvement, the pressure that may worsen, the owner of that pressure, and the rollback trigger. The dispatch board may change a route, the memorized route may change a prompt prefix, or the carpool lane may change scheduling, but the same request ID proves whether the user outcome survived.

4) Why a per-call average loses to a workflow ledger¶

The plausible alternative is attractive because it is simpler to explain in a status update: change one knob, quote one percentage, and move on. That works for demos. It fails for lead-level ownership because it cannot answer which workload benefits and which workload pays.

Use this chapter's mechanism when the workload has the shape named in the opening artifact. Use the alternative when the product is small enough, stable enough, or low-risk enough that the extra machinery would cost more than it saves. The decision is not about elegance; it is about whether the signal-to-operator cost is worth it.

5) Retry probability is the multiplier that changes the bill¶

Concrete numbers make the tradeoff review honest. The sample prices and memory figures below are illustrative; replace them with the provider, hardware, and workload numbers in your own stack.

Scenario	Fresh input	Cached input	Output	Extra condition	Lesson
Happy path	2400	0	450	0%	$0.0084
Cached system prompt	1200	1200	450	0%	$0.0066
One schema repair	5160	0	630	100%	$0.0148
Small model plus 20% repair	1800	700	430	20%	$0.0079
Shadow-eval rollout	2400	0	450	+25% judge	$0.0105

The table teaches the design habit: every row says what improved and what might have worsened. If a row cannot name both, the proposal is not ready for production review.

6) The cheap schema prompt that doubled spend¶

Walk the failure from top to bottom. The user action enters the API. The application builds a prompt or route. The runtime spends tokens, queue time, cache memory, or output steps. The dashboard records a local improvement. Then the user-visible metric moves the wrong way.

That failure is not bad luck. It is what happens when the optimization changes one layer and the observation stops one layer too early. In a review, Maya asks for the missing link: where did the pressure go after the local metric improved? If nobody can answer, the change ships behind a small canary or does not ship.

7) Signals that reveal whether cost anatomy is healthy¶

Healthy behavior: cost per successful ticket flat or falling while resolution rate is stable.
First degrading metric: retry-token share rises before the invoice looks scary.
Misleading beginner metric: average cost per call, because route mix can hide workflow inflation.
Expert graph: cost per outcome sliced by route, prompt version, retry reason, and tenant.

Mini-FAQ. "Why not watch the simplest metric?" Because the simplest metric is often the one the optimization directly manipulates. You need a paired guardrail that shows whether the system merely moved pain into another layer.

8) Boundaries where the chapter's lever works and where it turns pathological¶

Strong fit: high-volume workflows with clear success outcomes and repeated routes.
Pathology: one-off research work where value is ambiguous and manual review dominates.
Scale or workload limit: when offline eval traffic, human review time, or vendor minimums exceed token spend.

This boundary is not a disclaimer. It is a routing rule for engineering attention. The best optimization in one endpoint can be the wrong default for another endpoint with different latency tolerance, risk, context length, or outcome value.

9) Wrong mental model to replace¶

The seductive mistake is believing the cheapest call is the cheapest product. The replacement model is receipt thinking: every extra attempt, verifier, tool summary, and shadow route belongs to the same outcome until the user receives value.

The replacement model should change how you speak in design review. Do not say, "this reduces cost" or "this improves latency" without naming the request slice, expected magnitude, guardrail, and rollback trigger. Say which meter ticks, ETA call, carpool lane, or boot space pressure changed.

10) Other failure shapes you will recognize¶

Tool failure. tool outputs pasted back into prompts without truncation.
Shadow failure. shadow eval traffic excluded from product margin.
Fallback failure. fallback routes counted as new requests instead of repair attempts.
Tenant-specific failure. tenant-specific prompts preventing cache attribution.
Human failure. human retry loops missing from the cost row.
Success failure. success metric absent, making cost cuts indistinguishable from quality cuts.
Model-price failure. model-price updates not versioned in historical dashboards.

11) Cross-topic reinforcement — the same pressure shape returns¶

Latency anatomy uses the same request boundary but spends time rather than money.
Prompt caching only proves value when cached-token share appears in the ledger.
Model routing turns the ledger into a weighted average that can be invalidated by repair rate.
Cost dashboards later automate this receipt so finance does not become the monitoring system.

12) Design-review questions that catch shallow plans¶

Can one request ID join every model call, tool call, retry, and outcome?
Can the dashboard split fresh input, cached input, output, and retry tokens?
Can you compute cost per resolved ticket, accepted edit, or completed report?
Can you explain which subsystem pays when an optimization saves one bucket?

Where this shows up in production¶

Enterprise support bot — turns route, token, cache, retry, and outcome rows into cost per resolved ticket rather than model spend per message.
Coding assistant — separates inline completions from agentic edits because typing flow, repo context, and repair loops have different budgets.
Search answer product — pays for rewriting, retrieval, reranking, synthesis, citations, and judge calls as one user-visible answer.
Voice assistant — treats dead air, cancellation, and local fallback as product features because users notice 100 ms gaps.
Back-office summarizer — uses larger queues and batches because humans care about daily throughput more than first-token immediacy.
Commerce assistant — protects purchase-changing actions with stronger routes while letting read-only advice run cheaper.
Internal data copilot — attributes spend by tenant, dataset, prompt version, and tool path so one team cannot hide another team's budget.
Education tutor — spends tokens on safety and pedagogy rules, then watches whether shorter answers still teach well.
Legal review workflow — keeps evidence and citation context even when compression pressure is high because unsupported claims are worse than cost.
Healthcare intake helper — uses conservative routing and buffered streaming because safety checks are part of the latency path.
Marketing content tool — controls output length and variant count because creative generation can silently explode spend.
Incident-response copilot — prefers predictable latency and logs over clever savings during high-severity operations.

Recall — rebuild cost anatomy from memory¶

What concrete failure opened this chapter, and which artifact made it inspectable?
What root cause made the naive fix insufficient?
State the rule in one sentence without using vendor language.
Which pressure does the mechanism relieve, and which new pressure can it create?
Which operational signal degrades first when the mechanism is misapplied?
Where is the boundary where this lever becomes pathological?
How does this chapter reuse the fuel ledger or dispatch board from earlier chapters?
What would you put in the rollback trigger for this optimization?

Interview Q&A¶

Q: Why measure cost per successful workflow instead of cost per API call?

A: Because user value and hidden spend live at the workflow boundary. Retries, verifier calls, fallbacks, tool loops, and shadow evals may all occur before one visible answer, so per-call averages understate unit economics.

Common wrong answer to avoid: Per-call cost is enough if the average is low.

Q: Why split fresh input, cached input, output, and retry tokens?

A: Each bucket has a different price and lever. Fresh input points to prompt and retrieval size, cached input to stable prefixes, output to answer contracts, and retries to reliability defects.

Common wrong answer to avoid: All tokens are basically the same once counted.

Q: Why can a cheaper model increase total cost?

A: If it fails often, emits longer answers, or needs repair by a stronger model, the workflow can cost more than one correct call on the stronger route.

Common wrong answer to avoid: Lower per-token price always means lower product cost.

Q: Why include shadow and eval traffic in the ledger?

A: Those calls consume real tokens and often spike during launches. Excluding them hides rollout cost and makes product margin look healthier than it is.

Common wrong answer to avoid: Eval traffic is not user traffic, so it does not matter.

Q: What metadata belongs on a cost row?

A: Request ID, feature, tenant, route, model, prompt version, token buckets, retry reason, cache hit, latency, and outcome. Without joins, attribution collapses.

Common wrong answer to avoid: A model name and token count are enough.

Q: How do you defend a more expensive route?

A: Show that it improves the outcome metric enough to reduce retries, escalations, refunds, or churn. The comparison is cost per successful outcome, not sticker price.

Common wrong answer to avoid: The expensive model is always a luxury.

Q: What is the first cost regression you investigate?

A: A rise in retry-token share or output length by prompt version, because both can grow spend while traffic appears normal.

Common wrong answer to avoid: Vendor prices must have changed.

Apply now (10 min)¶

Step 1 — model the exercise. Draw one real workflow from user request to outcome. Use the table above as the modeled receipt, then fill fresh input, cached input, output, retry probability, verifier calls, and cost for each stage. Reproduce the final cost from memory and write the one VP-level metric you would report: cost per resolved ticket, accepted edit, completed report, or safe action.

Step 2 — your turn. Pick a real LLM feature and write the same artifact with your own rough numbers. Name the pressure relieved, the pressure created, the owner, and the metric that would prove the change unsafe.

Step 3 — reproduce from memory. Close the file and redraw the two diagrams: request trace and pressure transfer. Then restate the rule and the first degrading metric without looking.

What you should remember¶

This chapter explained why the opening failure is not solved by changing one local knob. The useful move is to make the request boundary inspectable, apply the topic rule, and watch the paired guardrail so the optimization cannot hide its cost in another subsystem.

You learned to describe the lever as pressure movement: what it relieves, what it creates, and which team or resource absorbs the new cost. That is the difference between a trick and an operating practice.

Carry the diagnostic forward: if the dashboard cannot show the artifact, the route or version, the user outcome, and the first degrading signal in one place, the optimization is not yet reviewable.

Remember:

The user outcome is the accounting boundary.
Retries and output length can dominate before prompt tokens do.
Every cost row needs route, version, bucket, reason, and outcome metadata.
A cheaper model is cheaper only if the workflow still succeeds.
The fuel ledger is a production artifact, not a finance afterthought.

Bridge. Money is now counted at the same boundary as user value, but users also spend time while the workflow runs. The next chapter separates queueing, prefill, first-token silence, decode speed, and total completion time so the fuel ledger can explain slowness as well as spend.

→ ./02-latency-anatomy.md