12. Cost Optimization in Serving — GPU money burns while you sleep¶

~15 min read. Cheap AI serving starts with system design, not heroic quantization alone.

Built on the ELI5 in 00-eli5.md. The assembly line — the flow that moves work through machines — now becomes a cost-shaping serving design.

1) First picture: cost comes from the route, not only the chip¶

See.

request
  │
  ▼
┌──────────────┐   easy?    yes ──→ small model
│ route gate   ├──────────────────────────────┐
└──────┬───────┘                              │
       │ no                                   ▼
       ▼                              ┌──────────────┐
┌──────────────┐                     │ cache / batch │
│ large model  │                     └──────┬───────┘
└──────┬───────┘                            ▼
       ▼                              ┌──────────────┐
┌──────────────┐                     │ right GPU     │
│ output limit │                     └──────┬───────┘
└──────┬───────┘                            ▼
       ▼                              ┌──────────────┐
┌──────────────┐                     │ cost + SLA    │
│ business win │                     └──────────────┘
└──────────────┘

Many teams start with one question. Which GPU is cheapest? That is too late.

The earlier question is better. Why is this request reaching that GPU at all? What route, batch, cache, and latency promise forced that decision?

So what to do? Start with architecture. Then optimize the hardware choice inside that design. Simple, no?

The assembly line matters because every stage changes spend. Routing changes which model runs. Caching changes how much repeated work is skipped. Batching changes utilization. Output limits change token cost.

The production monitor helps here too. It shows whether the cost-saving idea really preserved latency and quality. The quality gate checks safe promotion. But unit economics become visible only in live traffic.

2) The biggest levers usually come before quantization¶

Quantization is useful. But teams often grab it too early. There are bigger levers sitting in plain sight.

Route easy tasks to cheaper models¶

Not every request needs the smartest model. Password reset questions do not need the same route as complex tax analysis. Use a router, a classifier, or confidence thresholds.

Reduce prompt length¶

Long prompts are silent tax. Cut repeated instructions. Move stable prefixes into cached sections. Remove decorative examples that add little value.

Cache prefixes and repeated results¶

If the first 2,000 tokens repeat across requests, cache them. If identical queries repeat, cache full responses when safe. The warehouse stores approved versions. Caching makes those approved assets cheaper to use.

Tune output token limits¶

Many apps overspend on unnecessary verbosity. Set maximum tokens based on task shape. A classification task does not need a 700-token essay.

Batch smarter¶

Batching improves throughput when latency budgets allow it. But blind batching can hurt interactive experience. Separate latency-critical traffic from offline or queueable traffic. Yes?

Right-size the GPU¶

Some models are tiny on the wrong machine. Some workloads choke a cheap GPU and create queueing waste. Hardware price per hour is only one number. Useful work per hour is the real number.

3) Ballpark prices mean little without utilization¶

Rough on-demand serving prices often look like this. These are ballparks, not purchase orders.

L4: about $0.7 to $1.2 per hour
A10G: about $1.0 to $1.8 per hour
A100: about $3 to $5 per hour
H100: about $8 to $12+ per hour

Now the trap. A cheaper GPU is not always cheaper per useful request. If the cheap card runs at poor throughput, your true cost per request can be worse.

Worked example. Suppose an L4 costs $1.00 per hour. Suppose it serves 360 requests per hour at your SLA. Cost per request = $1.00 / 360 ≈ $0.0028.

Now suppose an A10G costs $1.50 per hour. Suppose it serves 900 requests per hour at the same SLA. Cost per request = $1.50 / 900 ≈ $0.0017.

The higher hourly price still wins. See. Utilization and throughput matter more than sticker price.

Now another example. Suppose an H100 costs $10 per hour. If you use only 12% of its capacity, you are paying mostly for empty air. That is why utilization is the real metric.

So what should you watch? Watch tokens per second, batch occupancy, queue time, GPU memory headroom, and cost per successful task. The production monitor should show all of them together.

4) Three practical stack shapes and their cost curves¶

There is no universal best stack. The right stack depends on latency, traffic pattern, and task difficulty. Look at three common shapes.

Stack A: real-time chat assistant¶

Typical design:

router sends easy tasks to a small hosted model,
hard tasks escalate to a larger model,
prefix caching is enabled,
strict output limits protect cost,
interactive traffic gets low-latency dedicated capacity.

Cost shape: Low to medium per request. Sensitive to prompt bloat and verbose outputs. Biggest wins come from routing and prefix caching.

Stack B: internal batch document processing¶

Typical design:

work is queued,
large batches are allowed,
latency is relaxed,
spot capacity may be acceptable,
retries can be scheduled intelligently.

Cost shape: Low cost per item when batching is healthy. Biggest wins come from batch size, cheaper windows, and small model routing. The upgrade without downtime matters less here than utilization discipline.

Stack C: high-stakes agent workflow¶

Typical design:

retrieval plus tool calls,
multiple model hops,
verification or self-check step,
strict business SLAs,
fallback to premium routes when confidence drops.

Cost shape: High and spiky. Biggest wins come from cutting failed branches, shrinking context, and separating premium tasks from ordinary ones.

Simple, no? Each stack has a different enemy. Chat suffers from token waste. Batch suffers from idle windows and poor grouping. Agent systems suffer from compounding steps.

5) A serving cost playbook that actually works¶

Start with one question. What does the business need from this endpoint? Fast answer, cheap answer, or best possible answer? Sometimes you cannot maximize all three.

Then follow this order.

classify traffic by difficulty and latency sensitivity,
split interactive paths from batch paths,
cap output length by task type,
cache stable prefixes and repeated results,
batch only where delay budgets allow,
right-size hardware after measuring throughput,
keep the upgrade without downtime ready for safe experiments.

Notice what is missing. Quantize everything is not step one. It may help later. But architecture usually dominates early savings.

Look. A good assembly line reduces wasted work before expensive work starts. That is the heart of serving economics. The warehouse, quality gate, and production monitor all support it, but the route design creates the biggest curve change.

Where this lives in the wild¶

Perplexity inference team — an ML systems engineer balances answer quality, latency, and token costs with routing and caching.
GitHub Copilot serving platform — an infrastructure engineer separates latency-critical completions from background model work.
OpenAI API platform — a capacity engineer watches throughput, output-token inflation, and utilization across premium and standard routes.
Swiggy support automation — an applied AI engineer pushes simple intents to cheaper models and keeps complex escalations on premium paths.
NVIDIA enterprise inference stack — a performance engineer right-sizes GPU classes based on occupancy, memory pressure, and SLA targets.

Pause and recall¶

Why does cost optimization start with architecture before hardware choice?
Name five higher-leverage cost controls before deep quantization work.
Why can a more expensive GPU still be cheaper per request?
How do the cost shapes differ for chat, batch, and agentic stacks?

Interview Q&A¶

Q: Why is utilization a better metric than hourly GPU price alone? A: Hourly price ignores how much useful work the machine completes at the required latency. Throughput, occupancy, and queueing decide the real cost per successful task.

Common wrong answer to avoid: "The lowest hourly price is always the cheapest option." Underpowered hardware often loses after throughput is measured.

Q: Why should routing come before aggressive model compression in many systems? A: Because the cheapest request is the one that never touches the expensive path. Routing removes unnecessary premium inference entirely, while compression only shrinks the cost of requests already on that path.

Common wrong answer to avoid: "Quantize first because it is the most technical lever." Technical does not mean highest impact.

Q: Why separate latency-critical traffic from batch traffic? A: Their optimal serving strategies differ. Interactive traffic wants predictable response time. Batch traffic wants high occupancy and relaxed scheduling.

Common wrong answer to avoid: "A single shared queue is simplest, so it is best." Simple queues often mix incompatible goals and waste capacity.

Q: Why are output limits such a strong cost control? A: Output tokens scale directly with billable generation and latency. Many applications lose money because answers are longer than the task truly needs.

Common wrong answer to avoid: "Output length is mostly a UX issue." It is also a direct cost and throughput issue.

Apply now (5 min)¶

Take one AI endpoint you know. List its easy path, hard path, batch path, and cacheable prefix. Then estimate whether it really needs premium hardware.

Now sketch from memory:

the route-to-GPU diagram,
the seven-step cost playbook,
and the three stack shapes with their main enemy.

Say aloud why utilization beats sticker price, and why architecture beats premature quantization.

Bridge. Good infrastructure helps a lot, but some MLOps pain remains stubborn. Next we end the module honestly: where tooling is fragmented, where observability is immature, and where small teams must resist building a moon mission for a bicycle. → 13-honest-admission.md