08. Serving infrastructure — where latency, quality, and cost argue politely¶

~16 min read. Serving is where compute, product, and finance negotiate every minute.

Built on the ELI5 in 00-eli5.md. The production monitor — the person watching the factory floor for slowdowns and trouble — depends on serving infrastructure exposing the right signals.

Serving is a system, not one lonely model call¶

See.

People say, "We deployed the model," as if one process did all the work.

In production, a request usually crosses several layers before a token reaches the user.

A gateway authenticates and rate-limits. A router chooses the right model path.

A model server batches, caches, schedules, and uses GPU workers. Then the response returns.

So serving is where product latency, model quality, and hardware cost meet.

Picture first.

request
  │
  ▼
┌─────────┐   ┌────────┐   ┌──────────────────────────┐   ┌─────────┐
│ gateway │─→│ router │─→│ model server             │─→│ response │
└─────────┘   └────────┘   │ batching                │   └─────────┘
                           │ caching                 │
                           │ scheduler               │
                           │ GPU workers             │
                           └──────────────────────────┘

Simple, no?

The gateway protects the service and handles basic traffic policy.

The router decides whether a request goes to a small model, a large model, or a fallback path.

The model server does the heavy lifting. This is where throughput and latency are mostly decided.

The production monitor needs visibility into every stage, not only final HTTP status.

A 200 response can still hide bad batching, queue buildup, or GPU memory waste.

Batching is powerful, but the type of batching matters¶

Look.

Serving efficiency often depends on batching because GPUs like parallel work.

But not all batching works the same way.

Static batching waits for a fixed number of requests or a fixed window. It is simple, but it can add latency.

Dynamic batching groups requests opportunistically as they arrive. It is more flexible and often better for uneven traffic.

Continuous batching is especially important for LLM serving. It lets new requests join while old ones are still generating tokens.

That improves GPU utilization without waiting for every sequence to finish together.

Here is the picture.

static batch       dynamic batch        continuous batch
A B C D run        A B run              A B run
next batch waits   C joins next         C joins while A B continue
                    D joins next         D joins on later decode step

Yes?

For LLMs, QPS alone is a weak measure of load.

Ten requests asking for ten tokens each are cheap. Ten requests asking for three thousand tokens each are not.

So what to do?

Track token-level work, active sequences, queue time, and GPU memory pressure.

The production monitor should surface prefill load, decode load, tokens per second, and queue delay.

A tiny example makes this clear.

Suppose Service A gets 20 requests per second with 50 output tokens each.

That is roughly 1,000 output tokens per second.

Service B also gets 20 requests per second, but each request asks for 700 output tokens.

That is roughly 14,000 output tokens per second.

Same QPS. Very different GPU reality.

If autoscaling watches only QPS, it will react too late or scale the wrong tier.

vLLM, TGI, Triton, and caching choices¶

Now which serving engines matter?

vLLM is popular for LLM serving because continuous batching and paged attention improve memory efficiency and throughput.

TGI, or Text Generation Inference, is widely used when teams want a strong Hugging Face oriented serving path.

Triton is broader. It handles many model types and inference backends, not only text generation.

So the choice depends on workload shape, ecosystem fit, and platform team comfort.

Simple, no?

If you serve mostly open LLMs with token-heavy traffic, vLLM is often attractive.

If your stack already leans heavily into Hugging Face workflows, TGI may feel natural.

If you need one inference platform across NLP, CV, and ensemble pipelines, Triton can be compelling.

No engine is magic. The wrong workload or weak observability can make any choice look bad.

Caching also changes economics a lot.

Prompt cache stores reused prompt prefixes so the system avoids repeated work.

Response cache returns the full answer for exact or near-exact repeats when safe.

Embedding or retrieval cache avoids recomputing upstream context repeatedly.

KV cache keeps attention state during generation so decoding stays efficient.

Look at the map.

request
  ├── response cache hit ──→ return answer fast
  ├── prompt prefix hit  ──→ reuse part of compute
  ├── retrieval hit      ──→ reuse context objects
  └── no hit             ──→ full model execution

Yes?

Caching must respect correctness. Personalized or time-sensitive requests may need weaker caching or none.

That is why serving is not only systems work. It is product semantics too.

Autoscaling and GPU orchestration need richer signals¶

Autoscaling a normal web API can rely heavily on CPU or QPS.

Autoscaling LLM serving needs richer signals because work per request varies wildly.

Useful signals include tokens per second, queue length, queue age, active sequences, GPU memory usage, and time to first token.

The production monitor should show these before incidents, not only during them.

GPU orchestration basics matter too.

Which model fits on which GPU memory tier? Which requests need tensor parallelism? Which pods should stay warm?

Can the scheduler avoid moving huge models around too often?

Can low-priority batch jobs stay away from latency-sensitive interactive traffic?

That is orchestration discipline.

Look at one compact control loop.

traffic arrives
     │
     ▼
observe queue + tokens + GPU memory
     │
     ▼
scale replicas / adjust routing / protect priority tier
     │
     ▼
serve request with stable latency

A small example helps.

Suppose one GPU replica can sustain 3,500 generated tokens per second at your target latency.

Traffic rises from 6,000 to 10,500 generated tokens per second.

Two replicas were enough before. Now you need at least three, and maybe four for headroom.

If you watched only request count, you might miss the jump until queues spike.

So what to do?

Scale from token work and queue pain, not from request count alone.

That keeps the service cheaper and calmer.

Where this lives in the wild¶

ChatGPT-style assistants — inference platform engineer: tunes batching, routing, and GPU utilization so interactive chats stay responsive.
Perplexity answer engine — serving engineer: balances retrieval-heavy requests, long generations, and cache policy across mixed workloads.
GitHub Copilot — platform engineer: routes coding prompts through serving stacks where latency, model choice, and cost must stay aligned.
Midjourney or image APIs — inference engineer: uses broader serving systems and GPU scheduling rules because generation jobs have uneven resource shapes.
Stripe internal support copilots — ML infrastructure engineer: tracks token-level load and queue delay instead of trusting QPS alone.

Pause and recall¶

Why is QPS alone a weak autoscaling signal for LLM serving?
What jobs do gateway, router, and model server each handle?
How do static, dynamic, and continuous batching differ?
When might vLLM, TGI, or Triton each be the better fit?

Interview Q&A¶

Q: Why is serving infrastructure for LLMs more than just an HTTP wrapper around a model? A: Real serving needs routing, batching, caching, scheduling, GPU management, and observability across the whole request path. Those layers decide latency, cost, and reliability as much as the model weights do. Common wrong answer to avoid: "Because models need HTTPS." Security matters, but it is far from the full serving problem.

Q: Why is continuous batching especially valuable for LLM workloads? A: It lets new sequences join ongoing decode work, which improves GPU utilization and throughput compared with waiting for fixed batches to complete together. Common wrong answer to avoid: "Because it makes every request faster automatically." It improves efficiency, but tradeoffs still exist.

Q: Why should autoscaling watch tokens and queue signals, not only request count? A: LLM work varies dramatically by prompt and generation length, so equal QPS can hide massively different compute demand. Token and queue metrics better reflect actual load. Common wrong answer to avoid: "Because QPS is an old metric." QPS still helps; it is just incomplete.

Q: How do cache choices affect serving economics? A: Different caches cut different kinds of repeated work, which can reduce latency and GPU cost when request patterns repeat. The right cache depends on product semantics and correctness constraints. Common wrong answer to avoid: "Turn on every cache everywhere." Wrong cache policy can serve stale or incorrect results.

Apply now (5 min)¶

Exercise. Sketch one serving path for your product with a gateway, router, and model server. Write which metrics you would watch at each stage.

Then pick one batching strategy and explain why it fits your traffic shape.

Sketch from memory. Draw the stack from request to response and label one cache, one autoscaling signal, and one thing the production monitor must alert on.

Bridge. Once the server can run reliably, the next challenge is changing versions without breaking trust. So now we study deployment strategies for safe upgrades. → 09-deployment-strategies.md