09. Cost and Latency Management — Token Budgets, Caching, Model Routing¶

~11 min read. A system that works but costs ten times too much or responds too slowly will not survive in production.

Built on the ELI5 in 00-eli5.md. The blueprint set the cost and latency constraints. The foundation is now under financial pressure. This file teaches you to engineer within those constraints, not hope your way through them.

Picture the cost first, then optimise¶

See. Most people think about cost after the system is built. That is too late. Draw your cost model before you write code.

Cost Components in a RAG Pipeline:
───────────────────────────────────────────────────────
Component           Unit cost          Volume/day
──────────────────────────────────────────────────────
Embedding (query)   $0.00002 / query   5 000 queries
Embedding (ingest)  $0.0001 / chunk    600 chunks/day
LLM input tokens    $0.15 / 1M         5 000 × 1 250 = 6.25M tokens
LLM output tokens   $0.60 / 1M         5 000 × 500 = 2.5M tokens
Vector DB           $70 / month flat

Daily LLM cost:    (6.25 × 0.15) + (2.5 × 0.60) = $0.9375 + $1.50 = $2.44
Monthly LLM cost:  $2.44 × 30 = $73.20
Monthly total:     $73.20 + $70 (DB) + ~$6 (embedding) = ~$149 / month

Simple, no? This cost model takes 20 minutes to build. Build it before you start. Update it as your design changes. The blueprint said cost ≤ $0.002 per call. Check: $149 / month ÷ 150 000 calls/month = $0.001 per call. Under budget. Good.

Token budget management¶

Token budget = the total token limit per request. This is the most important cost control lever.

Budget allocation (4 096-token model):
───────────────────────────────────────────
System prompt:         300 tokens  (fixed)
Context block:         800 tokens  (controlled)
User query:             50 tokens  (variable, capped)
Response:              500 tokens  (controlled with max_tokens)
──────────────────────────────────────────
Input total:         1 150 tokens
Output max:            500 tokens
Grand total:         1 650 tokens  ← well under 4 096 limit

Why leave headroom? Because queries can be longer than expected. Because retrieved chunks can have metadata that adds tokens. Because you will add few-shot examples later.

Set max_tokens explicitly in every API call. Never let the model decide its own output length. A verbose model will exceed your token budget and double your cost.

Caching: the most underused cost control¶

Caching is free cost reduction. If the same query is asked twice, you pay for it once.

Without cache:           10 000 identical queries
                         10 000 × $0.00049 = $4.90

With exact-match cache:  10 000 queries
                         First 1 000 are unique → pay $0.49
                         Remaining 9 000 are cache hits → pay $0.00
                         Total: $0.49  (90% savings)

Types of caching:

Exact-match cache: Cache (prompt + query) → response. Best for high-frequency identical queries (FAQs, fixed prompts). Implementation: Redis with TTL of 24 hours.

Semantic cache: Cache query embeddings → response. If the new query embedding is within cosine distance 0.05 of a cached query, return cached response. Best for paraphrased versions of the same question. Implementation: GPTCache, Momento Vector Index.

Partial cache: Cache retrieved chunks for frequent topics. Embed the query, retrieve chunks, cache the (embedding, chunk_list) pair. Reuse chunk retrieval without re-embedding every time.

Look. Even a 20% cache hit rate on LLM calls saves real money. Build caching into the foundation early.

Model routing: match cost to task complexity¶

Not every query needs GPT-4. Not every query needs a full RAG pipeline.

Query classification  →  Route to
──────────────────────────────────────────────────────────
Simple FAQ lookup      →  gpt-4o-mini + cache  ($0.00005)
Complex multi-step     →  gpt-4o               ($0.00049)
Restatement of cache   →  exact cache hit      ($0.000)
Off-topic (refuse)     →  classifier only      ($0.000)

Model routing reduces cost by using expensive models only when needed. A classifier (cheap, fast) routes each query to the right pipeline branch.

Worked routing calculation:

Query distribution:
  40% simple FAQ    →  $0.00005 × 40% = $0.000020
  30% complex       →  $0.00049 × 30% = $0.000147
  20% cache hit     →  $0.000   × 20% = $0.000000
  10% off-topic     →  $0.000   × 10% = $0.000000

Blended cost per query: $0.000020 + $0.000147 = $0.000167

Without routing (all gpt-4o): $0.00049 per query
Savings: ($0.00049 - $0.000167) / $0.00049 = 66% cost reduction

See. Model routing gives you a 66% cost reduction on the same traffic. This is the single highest-leverage cost optimisation available.

Latency management¶

Cost and latency are often in tension. Smaller models → cheaper and faster. Larger models → more accurate but slower and costlier.

Latency breakdown in a RAG pipeline:
────────────────────────────────────────────────────────
Query embedding:      50 ms   (local model faster than API)
Vector retrieval:     80 ms   (index size matters)
Context assembly:     10 ms   (pure CPU)
LLM call p50:        450 ms   (gpt-4o-mini)
LLM call p95:        720 ms   (gpt-4o-mini)
Total p95:           860 ms   ← over 800 ms SLA!

The LLM call is the bottleneck. Three strategies for latency:

1. Streaming responses. Return tokens as they arrive. User sees first token in ~200 ms even if full response takes 720 ms. Perceived latency drops dramatically.

2. Prefetching. If you can predict the next query (e.g., from a conversation thread), start the retrieval before the user submits.

3. Model downgrade for latency. gpt-4o-mini p95: 720 ms. Open-source model (llama-3-8b) on-premise: p95 450 ms. Trade some accuracy for latency compliance.

Where this lives in the wild¶

Klarna AI — semantic cache on customer FAQs; 60%+ cache hit rate; saves millions of API calls monthly.
Perplexity.ai — model routing: simple queries use smaller model; complex synthesis uses larger; routed by query classifier.
Notion AI — streaming responses; user sees words appearing; perceived latency is much lower than actual generation time.
Intercom Fin — partial retrieval cache; frequently accessed KB chunks are pre-cached in memory.
Cohere Coral — reranker model (smaller, faster) used instead of full LLM for retrieval scoring; saves 40% latency.

Pause and recall¶

In the cost model example, what was the total monthly cost? Was it under the per-call budget?
What is the difference between exact-match cache and semantic cache?
In the model routing example, what was the percentage cost reduction?
Name three strategies for reducing LLM call latency.

Interview Q&A¶

Q: "Our LLM costs are $5 000/month and growing. How would you reduce them?"

A: Three steps. First, audit the token budget — are we using more tokens than needed? Reduce context block size. Set max_tokens explicitly. Second, add semantic caching for frequent queries — even a 20% cache hit rate saves 20% of LLM cost. Third, implement model routing — use a cheaper model for simple queries and the expensive model only for complex ones.

Common wrong answer to avoid: "Use a cheaper model for everything." Blanket model downgrade hurts quality. Routing preserves quality where it matters.

Q: "How do you design a system to meet a strict 500 ms latency SLA for an AI feature?"

A: I measure every component's latency first. The LLM call is usually the bottleneck. I enable streaming so users see tokens immediately. I use a smaller, faster model for the majority of queries. I cache frequent queries. I move embedding to a local model to cut the embedding API round-trip.

Common wrong answer to avoid: "I increase infrastructure resources." More servers do not make the LLM faster. The bottleneck is model inference time, not infrastructure.

Q: "What is model routing and why does it matter for cost management?"

A: Model routing uses a cheap classifier to direct each query to the appropriate model or pipeline. Simple queries go to cheap, fast models. Complex queries go to the capable expensive model. Cached responses skip the model entirely. This can reduce blended cost per query by 50–70% without sacrificing quality on complex tasks.

Common wrong answer to avoid: "Use the biggest model you can afford for everything." Quality uniformity does not require cost uniformity. Most queries do not need the most capable model.

Q: "How do you decide between exact-match caching and semantic caching?"

A: Exact-match cache is best for high-frequency identical queries — FAQ-style systems where the same query is asked repeatedly. Semantic cache is best when users paraphrase the same intent in different words. I start with exact-match (simpler, zero false-positive risk) and add semantic caching once I have data showing paraphrase frequency.

Common wrong answer to avoid: "Semantic cache is always better because it handles more cases." Semantic cache has a false-positive risk: similar-looking queries may need different answers. Exact-match has zero false-positive risk.

Apply now (5 min)¶

Build the cost model for your capstone project. Estimate daily query volume, system prompt size, context size, and response size. Calculate daily and monthly LLM cost. Identify which cost lever (caching, model routing, token budget) would have the largest impact.

Sketch from memory: Draw the token budget table with four rows (system prompt, context, query, response). Fill in the numbers from the worked example from memory.

Bridge. Cost and latency are managed. The system is production-ready internally. Now we make it real for users. That is the move-in day. → 10-deployment-strategy.md