Skip to content

07. Cost attribution and budgets

Credentials authorise the call. Cost decides whether the call should be made at all. The gateway turns provider invoices — which arrive monthly, with provider's own dimensions — into per-tenant, per-feature, per-agent visibility, in close to real time. Budgets are the enforcement layer.


A finance partner at a Mumbai consumer-tech startup walks into engineering's weekly review with a question: "Why did our Anthropic bill jump 40% from last month?" The engineering team has no immediate answer. The provider's bill shows total token consumption broken down by model and account. There is no per-product breakdown. There is no per-tenant breakdown. The team spends three days reconstructing usage by cross-referencing application logs, deploy timelines, and incident reports. The conclusion is that a single product team rolled out an experimental feature that issued an unbounded chain of tool calls per user interaction. The investigation could have taken three minutes if the gateway's audit log carried feature_id and a cost per call. The team adds that field and a daily dashboard. The next bill jump, two months later, is investigated and root-caused in twenty minutes.

This chapter is the discipline. Per-call cost. Per-call attribution. Per-period budgets enforced in real time. The provider's invoice is a reconciliation artefact; the gateway is the source of truth.


The price book

The price book is the gateway's table of unit costs. Every cost calculation reads from it.

price_book:
  version: "2026-05-25"
  prices:
    "anthropic:claude-sonnet-4-6":
      input_per_1m_tokens_usd:  3.00
      output_per_1m_tokens_usd: 15.00
      cache_write_per_1m_tokens_usd: 3.75
      cache_read_per_1m_tokens_usd: 0.30
    "anthropic:claude-haiku-4-5":
      input_per_1m_tokens_usd:  0.80
      output_per_1m_tokens_usd: 4.00
    "openai:gpt-4o":
      input_per_1m_tokens_usd:  2.50
      output_per_1m_tokens_usd: 10.00
    "anthropic:claude-opus-4-7":
      input_per_1m_tokens_usd:  15.00
      output_per_1m_tokens_usd: 75.00

Three rules:

  • Versioned. Every change is a new version with an effective date. Historical cost computations use the price book valid at the call's timestamp.
  • Includes cache pricing. Cache reads and writes are billed differently than fresh calls; the price book reflects this.
  • Single source. Engineering and finance both read from this; reconciliation against the provider invoice happens at the price-book level.

The price book is updated as providers update their prices. Out-of-date prices produce attribution that the finance team cannot reconcile.


Per-call cost computation

On every call, after the provider response is received and transformed (chapter 02), the gateway computes:

cost_usd = (
    (input_tokens  / 1_000_000) * price_book.input_per_1m_tokens_usd  +
    (output_tokens / 1_000_000) * price_book.output_per_1m_tokens_usd +
    (cache_write_tokens / 1_000_000) * price_book.cache_write_per_1m_tokens_usd +
    (cache_read_tokens  / 1_000_000) * price_book.cache_read_per_1m_tokens_usd
)

The cost is stamped on the audit record alongside the call. The audit also carries the price-book version used, so changes do not retroactively alter historical numbers.


What the audit carries for attribution

The per-call record (chapter 11 details) includes the attribution dimensions:

{
  "audit_id": "aud_...",
  "ts": "2026-05-25T11:14:02Z",
  "tenant_id": "acme-corp",
  "feature_id": "summary-card",
  "caller_identity": "service.support-agent.production",
  "model_alias": "smart-reasoner",
  "model_used": {
    "provider": "anthropic",
    "model_version": "claude-sonnet-4-6"
  },
  "usage": {
    "input_tokens": 1200,
    "output_tokens": 312,
    "cache_read_tokens": 800,
    "cache_write_tokens": 0
  },
  "cost_usd": 0.00808,
  "price_book_version": "2026-05-25"
}

Five attribution dimensions are first-class:

Dimension What it tells you
tenant_id Which customer paid (or is paying) for this call
feature_id Which product feature consumed it
caller_identity Which service or agent made the call
model_alias What capability was requested
model_used What concrete model served it

Every cost dashboard groups by one or more of these.


Reporting cadences

The audit feeds three cadences:

Real-time (under a minute). A streaming aggregation pipeline rolls calls up by tenant and feature into current-period spending. Used for budget enforcement and on-call dashboards.

Daily. A batch job summarises the prior 24 hours by every dimension. Powers finance dashboards and tenant invoicing.

Monthly. A reconciliation against the provider's invoice. Per-provider, per-model token totals from the audit are compared to the invoice's figures. Discrepancies above a small tolerance (say, 1%) trigger investigation — usually a price-book lag or a missed audit emission.

The monthly reconciliation is the integrity check on the rest of the system. If audit-derived costs match provider invoices within tolerance every month, the rest of the cost dashboards can be trusted.


Budgets

Budgets are caps. They are enforced before a call is made.

A budget hierarchy mirrors the attribution dimensions:

budgets:
  tenants:
    acme-corp:
      monthly_usd: 50000
      hard_cap: true
      features:
        summary-card:
          monthly_usd: 8000
          on_breach: degrade_to_fast_summariser
        chat-agent:
          monthly_usd: 30000
          on_breach: refuse
        indexing:
          monthly_usd: 10000
          on_breach: queue_for_overnight

On each call:

  1. Compute the call's estimated cost from the price book and the request size.
  2. Check the tenant's remaining monthly budget; if insufficient, apply on_breach policy.
  3. Check the feature's remaining budget; same.
  4. Proceed.

The on_breach policy decides what happens when a budget is exhausted:

  • refuse — return BUDGET_EXCEEDED with retry_after = <start of next period>
  • degrade_to_<alias> — re-route to a cheaper alias for the rest of the period
  • queue_for_overnight — buffer the call for batch processing when budget refreshes (only for batch-tolerant workloads)
  • notify_only — emit an alarm but proceed; for soft budgets used for visibility

Hard caps (hard_cap: true) cannot be exceeded; finance has audited the limit. Soft caps are warnings; ops dashboards surface them but calls proceed.


Estimation before the call

Budget enforcement requires estimating cost before the call returns. The price book gives the unit prices; the call's input tokens are knowable (by counting); the output tokens are the unknown.

Practical estimation: assume max_output_tokens worth of output, since that is the worst case the caller has authorised. The estimate is an upper bound — calls that complete with fewer output tokens "refund" the difference back to the budget.

Some platforms use a learned estimator (average output for this alias on this feature is N tokens) for tighter estimates. For most platforms the worst-case estimate is sufficient and avoids the failure mode where a low estimate slips a call past the budget.


What to do when budgets are exhausted

Three failure modes, picked per on_breach:

Refuse cleanly. Return BUDGET_EXCEEDED with retry_after_ms. Caller informs the user that the feature is unavailable until the budget refreshes. Most production policies use this for hard caps.

Degrade. Re-route to a cheaper alias for the remainder of the period. The product surface continues to work but with reduced capability. Useful for tenant-facing features where availability matters more than capability.

Queue for the next period. Hold the call until budget refreshes. Only sensible for batch workloads where latency in days is acceptable.

The BUDGET_EXCEEDED error follows the structured-error pattern of module 19 chapter 05:

{
  "ok": false,
  "error": {
    "code": "BUDGET_EXCEEDED",
    "retriable": true,
    "retry_after_ms": 86400000,
    "human_hint": "Your AI assistant has reached its monthly usage limit. Contact your admin to increase the limit.",
    "model_action": "Surface the message; do not retry until retry_after.",
    "fields": {
      "budget_scope": "tenant=acme-corp,feature=summary-card",
      "period_start": "2026-05-01",
      "period_end": "2026-06-01"
    }
  }
}

Anomaly detection

Budgets enforce known limits. Anomalies catch the unknown — a sudden spending spike that is technically within budget but not normal.

Useful signals:

  • Spend per hour for a tenant deviates by >3σ from the trailing 30-day baseline → page
  • Average cost per call for a feature doubles overnight → investigate
  • Token count per call for an alias jumps → likely prompt change or context drift
  • Spending share of a tenant suddenly dominates → investigate

The alarms are not just for finance. They catch product regressions (a prompt update that inflates context length, a runaway agent loop that fails to terminate, a misconfigured retry that hits the cost ceiling repeatedly).


How cost interacts with the other surfaces

  • Routing (chapter 03)cost_ceiling_usd is a routing filter; the routing plane refuses candidates exceeding the ceiling.
  • Fallback (chapter 04) — each fallback step checks against the call's remaining budget; cheap fallbacks are preferred when budget is tight.
  • Quota (chapter 05) — budget exhaustion and quota exhaustion are distinct refusals (BUDGET_EXCEEDED vs RATE_LIMIT_EXCEEDED).
  • Cache (chapter 08) — cache hits cost a small fraction of fresh calls; the price book captures this and the cost dashboards reflect savings.
  • Audit (chapter 11) — cost is a first-class audit field.

How to recognise broken cost discipline in the wild

  • "Why did the bill jump?" requires more than ten minutes to answer
  • Per-tenant cost cannot be reported
  • Per-feature cost is approximate, derived from heuristics
  • Budgets exist on paper but are not enforced
  • A runaway feature can spend unbounded amounts in a day
  • The price book is out of date and reconciliation is loose

Interview Q&A

Q1. Why is per-call cost stamped at call time, not derived from the provider invoice? Because the invoice arrives monthly and lacks the dimensions you need — tenant, feature, agent, caller. Real-time stamping using the price book gives every audit record an attributable cost in the same dimensions as the rest of the audit. The invoice becomes the integrity check (monthly reconciliation) rather than the source of truth. Without per-call cost, the chapter-opening question ("why did the bill jump?") cannot be answered in finite time. Wrong-answer notes: "we reconcile from the invoice" misses real-time enforcement and dimensional attribution.

Q2. The estimated cost (using max_output_tokens) overestimates; tenants see budgets exhausted earlier than they should. What do you do? Two options, picked by policy. One: live with the conservative estimate; refunds happen at call completion when actual usage is known, restoring budget headroom for subsequent calls. Two: a learned estimator that predicts output tokens per (alias, feature) from history, giving tighter estimates. The first is simpler and pessimistic; the second is more aggressive and risks slipping calls past hard caps. Most platforms start with one and adopt two only when the over-estimation cost becomes material. Wrong-answer notes: "always use exact cost" is impossible without completing the call; the estimation is fundamentally about pre-call enforcement.

Q3. A tenant's chat-agent feature is over budget; the on_breach policy is degrade_to_fast_summariser. The tenant complains that quality dropped. How do you respond? The product had selected the policy. The trade is documented: when budget is reached, the feature switches to a cheaper model rather than going dark. The tenant has two options — pay more (raise the budget) or accept degraded service. The degrade is visible in the response (degraded: true, chapter 04), and the product can choose to surface a notice ("you have reached your usage limit; we are using a lighter model for the rest of the month"). The policy is the contract; the cost discipline pre-empted the surprise. Wrong-answer notes: "we'll raise the budget silently" undermines the discipline.

Q4. How does monthly reconciliation work, and what does a failed reconciliation tell you? At month end, the gateway's audit aggregates per-provider, per-model input/output tokens. The aggregate is compared to the provider's invoice. A match within ~1% is normal (provider rounding, in-flight call boundary effects). A larger discrepancy indicates one of: a price book lagging an actual price change; missed audit emissions (a code path bypassed the gateway); cache token accounting drift; provider re-classification of usage. Each is investigated. The reconciliation is the integrity check on every other cost claim — if it passes monthly, the dashboards are trustworthy. Wrong-answer notes: skipping reconciliation produces dashboards that drift silently.


What to do differently after reading this

  • Stand up the price book as a versioned, signed artefact. Engineering and finance both read from it.
  • Stamp cost_usd and price_book_version on every audit record.
  • Build the real-time aggregation pipeline. Tenant and feature dashboards should be near-real-time.
  • Enforce budgets with explicit on_breach policies. No "soft" hard caps; the policy is the contract.
  • Schedule monthly reconciliation. Investigate any failure beyond tolerance.
  • Wire anomaly alarms for per-tenant and per-feature spend deviations.

Bridge. Cost is one optimisation surface. Caching is another — and the two are tightly linked, because every cache hit is a cost not paid. The next chapter builds the caching layer: exact-match cache, semantic cache, eligibility rules, invalidation, and the economics of hit rates. → 08-prompt-and-response-caching.md