12. Pricing anatomy — reading the invoice the supplier sends¶

~17 min read. Per-token pricing looks simple. It is not. Input and output are priced differently. Cached and uncached are priced differently. Batch and realtime are priced differently. Vision tokens, audio tokens, video tokens, reasoning tokens — each has its own multiplier. Reading the invoice properly is how applied AI leads find the line item that pays for itself in a week of optimisation.

Builds on 11-on-prem-vs-managed-economics.md. Whether you rent or own the kitchen, the bill still arrives with structure. This chapter is the structure.

1) Hook — the agent loop where 70% of the cost lived in the planner¶

A planning agent runs against a multi-tool environment. Each user turn triggers an average of eight LLM calls — one planner call, then six specialised sub-calls, then a final synthesiser call. The team prices the workload by computing per-call cost and multiplying by call count. Monthly projection — $18,000. Actual first month — $42,000.

The applied AI lead pulls the invoice. The breakdown stuns the team.

LINE ITEM                                MONTHLY COST    SHARE
─────────                                ────────────    ─────
Planner call (Opus 4.7 reasoning)        $29,400         70%
Sub-tool calls (Haiku 4.5)                $4,300         10%
Synthesiser (Sonnet 4.6)                  $5,100         12%
Vision tokens on document tool             $2,800          7%
Other                                        $400          1%
Total                                    $42,000        100%

The planner is burning the budget because every call is full reasoning mode on Opus 4.7, and the planner's prompt — which contains the full toolbelt schema, the system instructions, the conversation history, and the task description — is largely the same across consecutive calls within a session.

The fix takes three days. Prompt caching gets turned on for the planner's static prefix — toolbelt schema, system instructions, the parts of conversation history older than the current turn. The cached portion qualifies for the 90% discount. Per-planner-call cost on the prefix drops by an order of magnitude. The reasoning portion (the new turn's specific input plus the reasoning tokens) is not cached, but the prefix is most of the input.

New monthly projection — $14,200. The optimisation took a week including testing. Annual savings — $334,000. The cost was always there. The team just had not read the invoice carefully enough to see where it lived.

That story is this chapter. The invoice has structure. The structure is where the optimisations live.

2) The metaphor — reading the supplier's bill line by line¶

The kitchen manager who reads each line on the supplier's invoice runs a different kitchen than the manager who just looks at the total. Onion cost is up 18% — switch suppliers for onions only. Olive oil cost has a new "premium grade" tier added — the supplier shifted us up without asking. Bulk discount kicked in this month — re-negotiate the contract to lock the threshold lower.

None of these are total-cost questions. They are line-item questions. You only see them if you read the invoice.

The same applies to LLM bills. The total is usually one number. The structure underneath has eight to twelve distinct dimensions. Each dimension hides a different optimisation.

3) Input vs output — why output is more expensive¶

The first dimension. Input tokens are cheaper than output tokens. Typically 3x-5x cheaper. Sometimes more.

2026 INPUT/OUTPUT MULTIPLES (rough)
────────────────────────────────────
Anthropic Sonnet 4.6     :  $3 in / $15 out  →  5x
OpenAI GPT-5             :  $5 in / $20 out  →  4x
Google Gemini 2.5 Pro    :  $2 in / $10 out  →  5x
Anthropic Haiku 4.5      :  $0.8 in / $4 out →  5x
OpenAI GPT-4o-mini       :  $0.15 in / $0.6 →  4x
Google Gemini 2.5 Flash  :  $0.30 in / $1.2 →  4x

(Numbers illustrative for 2026; exact prices change. The shape is durable.)

Why the multiple. Input tokens are processed in parallel — one forward pass through the transformer handles the entire input at once, with attention computed across all positions simultaneously. Output tokens are autoregressive — each token requires a full forward pass that attends to all previous tokens. The compute cost per output token is genuinely higher.

The implication. Workloads with long inputs and short outputs (RAG synthesis, classification, extraction) are cheap. Workloads with short inputs and long outputs (creative writing, long reports, verbose agents) are expensive. The arithmetic gets dominated by output volume.

The optimisation. Push for terser outputs when the use case allows. The same task can often produce 30-50% fewer output tokens with a slightly tighter prompt, and the cost saving lands on every call.

4) Prompt caching — the 90% discount most teams underuse¶

Prompt caching is the single largest optimisation lever available to most production workloads in 2026. The mechanism — the supplier recognises a prefix it has seen before, reuses the KV cache from the prior call, and bills the cached portion at a steep discount.

Typical 2026 economics.

PROVIDER         CACHE WRITE   CACHE READ      TTL
────────         ───────────   ──────────      ───
Anthropic        Same as input 10% of input    5 minutes default,
                                               1 hour at premium
OpenAI           Same as input 50% of input    Implicit, ~10 min
Google Gemini    Premium       25% of input    Configurable

(Numbers illustrative; verify per-vendor at deployment time.)

What gets cached. Anthropic's cache is explicit — the request marks specific blocks (system prompt, tool definitions, long context documents) as cache breakpoints. OpenAI's cache is implicit — prefixes of 1024+ tokens that recur are cached automatically. Both work; the mechanism differs.

The TTL matters. Anthropic's 5-minute default TTL is short enough that serial calls within a user session usually hit the cache, but rare batch workloads or low-volume tools may not. The 1-hour premium TTL is ~2x the cache-write cost but extends the window for less-frequent workloads.

When caching pays. Any prompt with a stable prefix (system instructions, toolbelt schema, document context, conversation history) that is called repeatedly. Agent loops are the canonical winners. RAG systems with a stable instruction header are also major beneficiaries. Caching does not help one-shot calls with fully unique prompts.

The architectural rule that follows — put static content at the front of the prompt, dynamic content at the back. The supplier caches prefixes, not suffixes. A prompt that puts the user's question first and the system instructions last cannot be cached. Reverse the order and the same content qualifies for the discount.

5) Batch API — 50% off for 24-hour patience¶

The batch API is the second large lever. The bargain — submit jobs that do not need realtime results; receive results within 24 hours; pay 50% less.

PROVIDER         BATCH DISCOUNT   COMMITMENT
────────         ──────────────   ──────────
Anthropic        50%              24h
OpenAI           50%              24h
Google Gemini    50%              24h

When batch wins. Embedding generation for a corpus, periodic eval runs, large-scale data labelling, content moderation backfill, periodic summarisation jobs, training data generation. Any workload that is offline by nature gets the discount for free.

When batch loses. Anything user-facing. Anything where 24-hour latency breaks the product. Anything where the cost of waiting (downstream pipeline blocked, decisions delayed) exceeds the 50% saving.

A common pattern that mixes both. Realtime user queries run on the synchronous API at full price. Embedding refreshes, eval runs, and analytical workloads run on the batch API at half price. The same account, the same model, two different cost lines. Many teams have 30-40% of their token volume eligible for batch and only run 5% there.

6) Vision, audio, video, reasoning — the multipliers nobody warns you about¶

Multimodal tokens have their own multipliers, and reasoning tokens have their own arithmetic. None of these show on the top-line price page.

Vision tokens. An image is converted to a token sequence. Typical counts in 2026 — 1,000 to 6,000 tokens per image depending on resolution and provider. A high-resolution image at 6,000 vision tokens costs the same as 6,000 text tokens. Heavy image workloads burn through input budgets faster than the price page suggests.

Audio tokens. Audio input on multimodal models is priced per second or per token, depending on provider. A 10-minute meeting recording can produce 30,000+ audio tokens of input. Workloads that process call recordings or voice messages need to model this explicitly.

Video tokens. Video is the most expensive multimodal modality. Gemini's video input is priced per frame extracted, and a 60-second video at 1 fps can land at 100,000+ tokens of input. Video summarisation workloads can cost an order of magnitude more than the equivalent text summarisation.

Reasoning tokens. Extended thinking on Opus 4.7. reasoning_effort: high on GPT-5. The thinking budget on Gemini 2.5 Pro. These tokens are generated internally by the model before the user-visible output. They are billed at output rates. A reasoning call may produce 200 user-visible output tokens preceded by 4,000 reasoning tokens — and you pay output rate for all 4,200.

CALL TYPE             TOKEN COUNT BILLED
─────────             ──────────────────
Standard text         Input + output
Vision input          Input (with image multiplier) + output
Audio input           Input (with audio multiplier) + output
Video input           Input (with video multiplier) + output
Reasoning text        Input + reasoning + output (all billed)

The reasoning multiplier is the one that surprises teams most. A prompt that costs 200 tokens of output on a non-reasoning model can cost 5,000 tokens of equivalent billing on a reasoning model. Cost projections built on the non-reasoning baseline miss this entirely.

Mid-content recall¶

Why are output tokens priced 3-5x higher than input tokens?
What is the architectural rule that follows from how prompt caching works?
When does the batch API win and when does it lose?

7) Tiered pricing and provisioned throughput¶

Two more dimensions hide on most vendor price pages.

Tiered pricing. Spend more, pay less per token. Typical 2026 tiers on the major suppliers — discounts begin around $10K monthly spend, with deeper tiers at $50K, $250K, and bespoke at $1M+. Discounts are typically 5-20%, sometimes higher for negotiated enterprise contracts.

The tier mechanic that matters — discounts apply to future tokens at that tier, not retroactively to the month's volume. A workload that spikes one month to $60K does not retroactively get the $50K-tier discount on the whole month; it only gets the tier rate on tokens beyond the threshold. Some vendors smooth this; many do not. Read the contract.

Provisioned throughput. Some suppliers offer a different billing mode entirely — pay for committed capacity, not per token. AWS Bedrock calls it Provisioned Throughput. Azure OpenAI offers Provisioned Throughput Units (PTUs). Anthropic and OpenAI offer enterprise capacity reservations.

The math. Per-token billing is variable cost — usage drives cost. PTU is fixed cost — capacity drives cost. PTU wins when usage is high and stable. PTU loses when usage is bursty (you pay for peak capacity even at trough).

WORKLOAD SHAPE              BILLING MODE THAT WINS
──────────────              ──────────────────────
Steady, high-volume         PTU / provisioned
Bursty, peaky               Per-token (with maybe smaller PTU floor)
Low-volume, irregular       Per-token
Predictable, contractual    PTU often wins after negotiation

The crossover threshold for PTU vs per-token depends on the model and the vendor, but rough 2026 numbers — PTU starts to win above roughly 60-70% sustained utilisation of a single PTU. Below that, per-token wins. Many enterprise teams run a small PTU base plus per-token burst.

8) The multi-call pattern — why per-turn cost matters more than per-call¶

Agentic workloads do not make one call per user turn. They make many.

USER TURN
   │
   ├─ Planner call           (frontier, reasoning enabled)
   │   $0.08 per call
   ├─ Tool call 1            (mid)
   │   $0.012 per call
   ├─ Tool call 2            (mid)
   │   $0.012 per call
   ├─ Tool call 3            (small)
   │   $0.002 per call
   ├─ Tool call 4            (small)
   │   $0.002 per call
   ├─ Tool call 5            (small)
   │   $0.002 per call
   ├─ Tool call 6            (mid)
   │   $0.012 per call
   └─ Synthesiser            (mid)
       $0.018 per call
       ────────────
       Per-turn cost: $0.140

The per-call costs look small. The per-turn cost is what the user generates. A workload at 100,000 turns per day burns $14,000 per day or $420,000 per month — regardless of how cheap any individual call looks.

The optimisations that matter at the per-turn level. Caching the planner's static prefix (chapter 12 example, 70% savings on the planner line). Replacing tool calls that always succeed deterministically with deterministic code (some "tool calls" do not need an LLM at all). Reducing the number of sub-calls by giving the planner more decisive prompts. Routing easier turns to a smaller cook (per chapter 04's routing matrix).

The principle — per-call cost is the metric for benchmarks. Per-turn cost is the metric for production.

9) Reading the invoice — where the surprises live¶

The mature monthly invoice review looks for these patterns.

Cached vs uncached share. If your cached share is below 50% on a workload with stable prefixes, the caching configuration has a bug worth investigating.

Input vs output share. If output is more than 30-40% of input by token count on workloads that should produce terse output, the prompt is allowing verbose output.

Tier mismatch. If the small-cook line is small and the frontier-cook line is large, the routing layer may be over-promoting tickets to the frontier. Pull the traces and check.

Reasoning token share. If you are using a reasoning model and the reasoning-token share is above 70-80% of total output cost, ask whether the reasoning is earning its price on every call or whether you should turn down reasoning_effort on routine turns.

Vision / multimodal lines. If you handle images and the image-token line is larger than expected, the resolution being sent may be higher than the use case needs. Downsample.

Spike weeks. If one week is 30%+ higher than the trailing average on the same workload, investigate. Causes — silent change in user behaviour, a regression that increased prompt length, a bug that re-runs calls.

Most experienced teams do this review monthly. It is one of the highest-leverage hours an applied AI lead spends.

10) Failure modes — where pricing assumptions hurt¶

LEAK                                     FIX
────────────────────────────────────     ──────────────────────────────────
Treating input == output in projections  Apply the 3-5x multiplier; rebuild
                                         the projection

Static prefix at the end of the prompt   Static at front, dynamic at back;
                                         prefix caching activates

Running batch-eligible work on sync API  Identify offline workloads; route
                                         them to the batch API

Ignoring reasoning-token cost            Project on reasoning-enabled cost,
                                         not non-reasoning baseline

Per-call cost as the only metric         Per-turn cost is what production
                                         pays; aggregate up before deciding

Tiered pricing applied retroactively in  Read the contract; tiers usually
the spreadsheet                          apply prospectively at the threshold

PTU for bursty workload                  PTU wins for steady; per-token
                                         wins for bursty

Vision tokens projected at text rates    Use the image-token multiplier;
                                         downsample where possible

Eight common mistakes. The thread — the price page is the start of the arithmetic, not the end. Every other dimension changes the bill.

Where this lives in the wild¶

Pricing structure is provider-specific but the dimensions are universal.

Anthropic API — explicit prompt caching with cache_control breakpoints; 90% discount on cached tokens at 5-minute default TTL; batch API at 50%; extended thinking burns reasoning tokens billed at output rates.
OpenAI API — implicit prompt caching on prefixes 1024+ tokens; ~50% discount on cached portions; batch API at 50% on 24h; reasoning models (o-series, GPT-5) bill reasoning tokens explicitly.
Google Gemini API — context caching with configurable TTL and per-token cache pricing; batch API at 50%; thinking budget on Gemini 2.5 Pro bills the thinking tokens.
Mistral La Plateforme — direct per-token pricing; competitive on open-weight cooks served by Mistral itself.
DeepSeek API — aggressive per-token pricing on V3 and R1; off-peak discount tiers (DeepSeek's distinctive pricing innovation).
AWS Bedrock — on-demand per-token, batch at 50%, Provisioned Throughput for capacity reservation; multi-vendor billing under one invoice.
Azure OpenAI Service — Provisioned Throughput Units (PTUs) as the enterprise capacity model; per-token Standard tier for variable workloads.
Vertex AI — per-token pricing on Gemini and partner models; Provisioned Throughput available for committed capacity.
Together AI — per-token pricing on open-weight cooks; typically significantly cheaper than closed-weight frontier per token.
Fireworks AI — per-token pricing; speculative-decoding-optimised inference often lowers effective output cost.
Replicate, Modal — per-second or per-request pricing on open-weight cooks, with cold-start considerations.
Groq, SambaNova, Cerebras — per-token at distinctive latency profiles; throughput-optimised.
OpenRouter — aggregated pricing across many suppliers with unified billing; passes through underlying pricing structure.
LiteLLM — router that exposes consistent pricing telemetry across providers; useful for cost normalisation.
Helicone — usage analytics and cost attribution per request, user, or feature; common home for invoice review at scale.
Langfuse — open-source observability with cost-per-trace computation; integrates with the major suppliers' pricing.
LangSmith — cost tracking per run within LangChain workflows; useful for per-feature cost attribution.
Vellum, Braintrust, PromptLayer, Pezzo — prompt and cost analytics platforms with per-prompt cost rollups.
Vercel AI SDK — client-side aggregation of token counts; useful for client-side cost estimation before sending.
Stripe usage-based billing, Metronome, Orb — usage-based billing platforms where many AI products in turn bill their users; the LLM cost flow-through pattern.

Pause and recall¶

Why is output token pricing typically 3-5x input token pricing?
What architectural rule about prompt structure follows from how prompt caching works?
When does the batch API win, and what is the typical discount?
What four token types beyond standard text have distinct multipliers, and which one most often surprises teams?
Why is per-turn cost a better metric for production than per-call cost?
When does provisioned throughput beat per-token billing?
What six patterns does a mature invoice review look for monthly?

Interview Q&A¶

Q1. Your invoice shows 70% of cost on the planner in an agent loop. What do you do first? A. Turn on prompt caching for the planner's stable prefix — toolbelt schema, system instructions, conversation history older than the current turn. The savings are typically 60-90% on the cached portion, which is usually most of the planner's input. Cost of the optimisation is a few days of testing. ROI is immediate. Then look at whether the reasoning effort is justified on every turn or whether routine turns can use a smaller cook. Trap: Jumping to model downgrade before checking caching. The caching win is bigger and lower-risk.

Q2. Explain why input tokens and output tokens are priced differently. A. Input tokens are processed in parallel through one forward pass — the entire input attends to itself simultaneously. Output tokens are autoregressive — each new token requires a forward pass attending to all previous tokens. The compute cost per output token is higher, and the price reflects it. Typical 2026 multiple is 3-5x. The implication for workload economics — long inputs with short outputs (RAG, classification) are cheap; short inputs with long outputs (creative, verbose agents) are expensive. Trap: "Output costs more because it is more important." It is not about importance; it is about compute.

Q3. When does prompt caching not help? A. When the prompt is fully unique on every call. One-shot generation without a stable system prompt, ad-hoc analysis on unique documents with no recurring instruction header, throwaway tooling. Most production workloads do not look like this — most have a stable prefix. But when caching does not help, alternative levers (smaller model, terser prompts, batch API) become the primary saves. Trap: Treating caching as universal. It is the biggest lever for workloads it fits; useless for those it does not.

Q4. Should you use the batch API? A. Yes for any workload where 24-hour latency is acceptable — embedding refreshes, eval runs, data labelling, content moderation backfill, periodic summarisation. The 50% discount applies. No for user-facing realtime workloads. Many teams find 30-40% of their volume is batch-eligible but they run almost everything synchronous. Trap: "Batch is for big jobs." Batch is for offline jobs; the qualifier is patience, not size.

Q5. What is the difference between per-token pricing and provisioned throughput, and when does each win? A. Per-token is variable cost; you pay only for what you use. Provisioned throughput is fixed cost; you pay for capacity whether you use it or not. PTU wins for steady, high-volume workloads at ~60-70%+ sustained utilisation. Per-token wins for bursty or low workloads. Many enterprise teams run a small PTU base plus per-token burst — the PTU covers the steady demand at a lower per-token effective rate, and per-token absorbs the spikes without paying for oversized capacity. Trap: "PTU is cheaper at scale." It depends on utilisation profile, not just volume.

Q6. What is the reasoning-token gotcha and how do you handle it? A. Reasoning models (Opus 4.7 extended thinking, GPT-5 with reasoning_effort, Gemini 2.5 Pro with thinking budget) generate internal reasoning tokens before the user-visible output. These are billed at output rates and can be 10-30x larger than the visible output. A workload that costs $X on a non-reasoning model can cost $5X-$15X on a reasoning model. Handle by — projecting cost on reasoning-enabled rates, tuning reasoning_effort per turn (high for hard turns, low for routine), and including reasoning-token share in the monthly invoice review. Trap: Projecting reasoning-model cost on non-reasoning rates.

Q7. What goes wrong with putting the user's question first and the system prompt second? A. Prompt caching keys on stable prefixes. If the user's question (unique per call) comes first and the system prompt (stable across calls) comes second, the cache never matches because the prefix is not stable. The fix is purely architectural — put static content (system prompt, toolbelt schema, document context) at the front, dynamic content (the current user input) at the back. Same content, different order, dramatically different bill. Trap: Designing for readability of the prompt rather than for the cache's prefix-matching behaviour.

Q8. You see a 30% cost spike compared to the trailing average on the same workload. Walk me through your investigation. A. Pull the invoice breakdown by line item. Compare each line to its trailing average. The line that grew tells you what changed. Cached share dropped? Caching configuration regression. Output share grew? Prompt change increasing verbosity, or a downstream loop calling more often than expected. Reasoning share grew? reasoning_effort change or a routing change that promoted more turns to the reasoning model. Then pull traces from the spike days to confirm root cause. Then fix the underlying change. Trap: "Just bump the budget." Cost spikes have causes; finding the cause is the work.

Apply now (5 min)¶

Step 1 — find your cached share. Pull last month's invoice from your primary supplier. Find the cached vs uncached split. If cached share is below 50% and you have stable prefixes, the next week's project is caching configuration.

Step 2 — identify batch-eligible workloads. List your distinct LLM call patterns. For each, mark whether 24-hour latency would break the use case. The ones that pass become candidates for the batch API.

Step 3 — compute per-turn cost. For your main agent workload, list every LLM call that happens per user turn. Multiply each by its per-call cost. Sum. The total is the per-turn cost. The biggest line in that sum is your highest-leverage optimisation target.

Bridge. Twelve chapters of model selection and vendor strategy have given the matching habit, the bake-off, the routing matrix, the dual-sourcing chain, the upgrade playbook, the TCO math, and the invoice anatomy. The final chapter is what model selection still cannot answer cleanly — the honest list of open problems that mature applied AI leads sound calm about because they have internalised the limits.

→ 13-honest-admission.md