00. Cost & Latency Optimization for LLM Applications — The Five-Year-Old Version¶
You already know how to make LLM apps useful. This module teaches you how to make them affordable, fast, and explainable in a design review.
Imagine you run a busy taxi fleet. Some riders need a quick trip to the metro. Some need an airport run with luggage. Some repeat the same pickup instructions every morning. Some panic if they do not receive an ETA within a minute, even when the final arrival time is unchanged. A production LLM system behaves the same way. Every request consumes distance on the meter, time on the road, capacity in the car, and attention from the dispatcher.
A weak team treats every ride the same. It sends a luxury cab to every request, lets the passenger describe the route from scratch, waits until the full trip is complete before sending an update, and only reads the monthly fuel bill after finance complains. The product may work for a demo. It will not survive real traffic, long prompts, retries, tool loops, and impatient users.
A strong AI engineering team manages the fleet deliberately. It counts input, output, cached, retry, and shadow-eval tokens. It separates time-to-first-token from total completion time. It caches stable prompt prefixes. It routes easy work to smaller models while protecting hard or risky work. It streams useful progress early. It batches where the product can tolerate queueing. It watches KV cache memory when long sessions fill the GPU. It compresses prompts and controls outputs without deleting the safety or evidence that make the answer trustworthy.
This module is a cost-and-latency design guide, not a bag of tricks. Each chapter asks the same lead-level question: which pressure are we relieving, which pressure moves somewhere else, and what production signal proves the tradeoff is still worth it?
One picture for the whole module¶
user request
│
▼
┌────────────────────┐
│ dispatch board │ choose model, route, region, and fallback
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ prompt enters cab │ input, cached input, retrieval, history
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ prefill + cache │ TTFT, prefix reuse, KV memory
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ streamed answer │ useful first token, output budget, cancel
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ fuel ledger │ cost, latency, success, route, version
└────────────────────┘
The picture matters because optimization is coupled. A shorter prompt can reduce cost and TTFT. A cheaper model can increase retries. A larger batch can lower infrastructure cost and raise P95 latency. A local model can remove network delay and lose task quality. Senior engineers do not optimize one line item in isolation; they optimize the workflow under constraints.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| meter ticks | Token charges: fresh input, cached input, output, retries, tool calls, eval traffic. |
| dispatch board | Routing logic that chooses model, region, local/cloud path, fallback, and queue policy. |
| memorized route | Prompt caching and stable prefix reuse. |
| ETA call | Streaming and progressive rendering that reduce perceived dead air. |
| carpool lane | Batching and scheduling that improve throughput by grouping work. |
| boot space | KV cache and memory pressure carried by active sessions. |
| fuel ledger | Dashboards, budgets, alerts, and versioned attribution for spend and latency. |
Top resources¶
- Provider pricing pages for the models you actually run. Always replace sample numbers in this module with current vendor prices.
- Inference-serving docs for vLLM, TensorRT-LLM, Ray Serve, or your chosen managed provider when you need batching, KV cache, and throughput details.
- Your own traces. Cost and latency optimization becomes real only when token counts, route decisions, retries, cache hits, and user outcomes share one request ID.
What's coming¶
- 01-cost-anatomy.md — count input, output, cached, retry, and workflow costs.
- 02-latency-anatomy.md — separate TTFT, tokens/sec, queueing, and total latency.
- 03-prompt-caching.md — design stable prefixes that providers can reuse.
- 04-model-routing.md — route by difficulty, risk, confidence, and business value.
- 05-streaming-first-token.md — make the first visible token useful and cancellable.
- 06-batching-strategies.md — trade small queue waits for higher throughput where product budgets allow it.
- 07-kv-cache-optimization.md — understand why memory, not FLOPs, often limits long sessions.
- 08-prompt-compression.md — shrink prompts while preserving behavior, evidence, and safety.
- 09-output-length-control.md — bound answers with schemas, stop rules, staged detail, and cancellation.
- 10-cost-dashboards.md — build the ledger that catches regressions before invoices arrive.
- 11-edge-deployment.md — decide when local or edge models beat remote API calls.
- 12-capacity-planning.md — forecast tokens, peaks, limits, route mix, and headroom.
- 13-honest-admission.md — name the limits that still make optimization empirical.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| Token bill | tokenization, prompts | cost | FinOps, routing, output control | product → API → finance |
| TTFT / TPS / total latency | autoregressive decoding | latency | streaming, batching, capacity | UX → serving → hardware |
| Prompt caching | stable templates | cost + latency | prefix caching in serving engines | prompt → provider cache |
| Model routing | evals, confidence, risk | cost + quality | fallbacks, incident response | product → model gateway |
| KV cache | transformer attention | memory | inference serving engines | runtime → GPU memory |
| Dashboards | tracing, metadata | operator attention | MLOps and FinOps | traces → budgets → alerts |
Bridge. Start with the bill. Before choosing a clever optimization, count every token and every retry in one real workflow. → 01-cost-anatomy.md