04. TensorRT-LLM compilation — pay at build time so you don't pay every token¶

~22 min read. Roofline said batch the weight-reads. Fusion said stop round-tripping intermediates. NCCL said keep the collective on NVLink. Each was a separate fix you'd apply by hand. TensorRT-LLM is the compiler that applies all of them at once — and adds the two serving moves that finally fill the batch: in-flight batching and paged KV cache. The price is a build step and a loss of flexibility.

Built on roofline, fusion, and NCCL collectives. The invariant is still feed the beast. This file is where the module's separate levers — arithmetic intensity via batching, fused kernels, efficient tensor-parallel collectives — get packaged into one compiled engine, and where two new mechanisms appear: in-flight (continuous) batching and paged KV cache.

What the per-kernel fixes left on the table¶

The last three files each moved one wall by hand. Roofline named the memory wall and told you to batch. Fusion deleted intermediate round-trips inside attention and the feed-forward block. NCCL put the per-layer all-reduce on the fast wire. Apply all three perfectly and the endpoint is still slow, because two things are unaddressed.

First, the batch is small. Roofline's central prescription — share each weight-read across many requests — only pays if many requests are actually in flight together. A naive server batches requests that arrive in the same window, runs them to completion, then starts the next batch. But LLM requests finish at wildly different times (one wants 5 tokens, another wants 800), so the batch empties out and the GPU re-reads weights for a shrinking handful of survivors. Second, the KV cache — the per-request attention memory that grows with every generated token — is allocated as one big contiguous block per request, sized for the worst case. That wastes memory, and wasted KV memory directly caps how many requests fit in the batch. The two problems compound: bad batching and fragmented KV cache each cripple the other.

This file shows how TensorRT-LLM solves both — in-flight batching keeps the batch full by swapping finished requests for waiting ones every step, and paged KV cache stops fragmentation from capping the batch — and what you pay for it: a build step that compiles and autotunes an engine specialized to your model, shapes, and GPU, which is rigid in exchange for speed.

What this file solves¶

Our 70B endpoint, even with fused kernels on NVLink, sits well below target tokens/sec because the decode batch is small and the KV cache wastes memory. The naive read is "we need more GPUs." The real cause is two serving inefficiencies: requests finish at different times so the batch drains, and contiguous KV allocation fragments memory so a big batch never fits. This file teaches you to compile the model into a TensorRT-LLM engine that fuses kernels and tensor-parallelism, batches requests continuously (adding new ones mid-flight), and pages the KV cache so the batch can grow until the GPU is genuinely full — and to weigh that against the build-time and rigidity costs you take on.

Why a compiler, and not just a faster loop¶

A naive serving loop in eager PyTorch does the obvious thing: take a batch, run the forward pass op by op, sample tokens, repeat. It pays every cost the earlier files warned about, every step, forever: it launches each kernel separately (launch overhead), it doesn't fuse the element-wise epilogues (HBM round-trips), it picks generic kernel implementations that aren't tuned for your exact shapes, and it batches naively so the GPU drains to a trickle.

A compiler attacks this by doing expensive work once, ahead of time, so runtime is cheap. TensorRT-LLM takes your model definition, the target GPU, the dtypes (BF16, FP8), the tensor-parallel layout, and the max shapes you'll serve, and it builds an engine — a serialized, optimized execution plan. During the build it fuses operations, picks the fastest kernel for each shape by actually benchmarking candidates (autotuning), captures CUDA Graphs to kill launch overhead, and wires in the tensor-parallel collectives. The result runs the same math far faster, because the optimization happened at build time instead of every forward pass.

Teacher voice. This is the same trade as a compiled language versus an interpreter. The interpreter (eager PyTorch) is flexible — change a line, run it immediately — but it re-decides everything every time. The compiler (TensorRT-LLM) spends minutes deciding the best plan once, then executes that plan at full speed. You give up the freedom to change the model on a whim; you get a runtime that doesn't re-pay the optimization cost on every token.

The naive batching that drains the GPU¶

A team builds a server that does static batching: collect up to 32 requests, run them together until all finish, then take the next 32. It should give the roofline's batching win. It doesn't, and the profile shows why.

The requests have wildly different output lengths. At step 1, all 32 are decoding — great, one weight-read serves 32 tokens. By step 50, twenty requests have hit their stop token and finished, but the batch can't take new work until all 32 are done, so it runs 12 active requests, then 5, then 1. The long-tail request decodes alone for hundreds of steps, re-reading the full 140 GB of weights to produce one token at a time. Average batch size over the request's life is maybe 6, not 32. The roofline win evaporated into the tail.

The visible break: throughput is a fraction of what batch-32 promised, and a latency histogram shows the GPU mostly running tiny effective batches. Adding GPUs doesn't help — each still drains the same way.

So the real problem is not the batch size you start with; it is that a static batch can't admit new requests until the slowest one finishes, so it spends most of its life nearly empty. The fix isn't a bigger starting batch — it's a batch that refills continuously.

So how do we keep the batch full when requests are constantly finishing and arriving at different times?

When one slow request holds 31 seats empty¶

Take the smallest case. Two requests batch together. Request A wants 10 tokens; request B wants 500. Static batching runs both for 500 steps — but after step 10, A is done and its slot is dead weight, so for 490 steps the GPU re-reads the weights to serve one request (B) while pretending to serve two. A third request C arrives at step 11, ready to go, but it waits in a queue until step 500 because the batch is locked.

In-flight batching fixes exactly this: at step 11, A's finished slot is freed and C is slotted in immediately. The batch stays at two active requests instead of decaying to one, and C doesn't wait 490 steps to start. The weight-read keeps serving a full batch the whole time.

Rule: refill the batch every step, and never let memory layout cap it¶

The serving rule. Keep the decode batch as full as the GPU's memory allows, every single step, by (1) admitting waiting requests and evicting finished ones at every generation step — in-flight (continuous) batching — and (2) allocating KV cache in small fixed-size blocks so memory doesn't fragment and the batch can grow until HBM is genuinely full — paged KV cache. Throughput is set by the average batch over time, not the peak you started with.

Why the rule exists. The primitive is the roofline's: a memory-bound weight-read is wasted unless many tokens share it. The constraint is that requests have unpredictable, wildly different lengths, and contiguous KV allocation forces you to reserve worst-case memory per request. Static batching breaks because the slowest request holds the batch hostage, draining the average batch toward one; contiguous KV breaks because reserved-but-unused memory caps how many requests fit. In-flight batching keeps the average batch high; paged KV frees the memory that was capping it. They are two halves of one fix.

1) In-flight batching — refilling the seats every step¶

In-flight batching (NVIDIA's term; the broader community calls it continuous batching) treats the batch as a living set of slots, reconsidered at every generation step rather than every request lifetime.

Each step, the scheduler does three things: run a decode step for every active request, retire any request that just emitted its stop token (free its slot and its KV cache), and admit waiting requests into the freed slots — including running their prefill (prompt processing) interleaved with others' decode. A request that finishes at step 12 doesn't leave a dead slot until step 500; its slot is immediately reused. New arrivals don't wait for the current batch to drain; they join at the next step.

STATIC BATCHING                          IN-FLIGHT (CONTINUOUS) BATCHING
───────────────                          ───────────────────────────────
step  active requests                     step  active requests
  1   [A B C D]  ← full                     1   [A B C D]
 12   [_ B _ D]  ← A,C done, slots dead     12  [E B F D]  ← A,C retired, E,F admitted
 60   [_ B _ _]  ← only B left               60  [G B H I]  ← always refilled
500   [_ B _ _]  ← B's long tail, GPU idle  500  [.....]    ← batch stays full throughout
       avg batch ≈ small                          avg batch ≈ near capacity

The payoff is the average batch size over time climbs toward capacity, because empty seats get refilled instead of sitting dead through the long tail. For a memory-bound decode workload, average batch size is throughput — it's the number of tokens each weight-read serves. In-flight batching is the single largest serving-layer throughput lever, and it's exactly the roofline's prescription made operational: maximize how many tokens share each weight-read, continuously.

Mini-FAQ. "Doesn't interleaving prefill and decode hurt latency?" It can, which is why the scheduler balances them — chunked prefill splits a long prompt's prefill across steps so it doesn't block decode of in-flight requests for too long. The point is that the slot is never idle: if there's prefill to do, do it; if not, decode. The GPU stays fed.

2) The picture — paged KV cache vs contiguous¶

The KV cache is the per-request attention memory: for every token generated, the model stores that token's key and value vectors so future tokens can attend to them. It grows by one token's worth every step, and you don't know the final length in advance.

CONTIGUOUS KV CACHE                          PAGED KV CACHE (PagedAttention)
───────────────────                          ──────────────────────────────
reserve max-length block per request          allocate small fixed-size blocks on demand
                                               (e.g. 16 tokens/block), like OS pages

 req A: [████░░░░░░░░░░░░]  ← uses 4, reserves 16    req A: [B0][B1][B2][B3]   ← 4 blocks, exact
 req B: [██░░░░░░░░░░░░░░]  ← uses 2, reserves 16    req B: [B4][B5]            ← 2 blocks, exact
 req C: cannot fit — no contiguous 16-block hole     req C: [B6][B7][B8]...     ← fills the gaps
                                                     shared prefix: A,B point to same [B0]
 wasted = reserved − used  (huge; fragments)    wasted ≈ < one block per request (near zero)
 batch capped by fragmentation                  batch grows until blocks genuinely run out

Contiguous allocation reserves the maximum possible length for every request, because the cache must be one unbroken block and you can't know the real length up front. Most requests use a fraction of that, so most of the reserved memory is wasted — and worse, it fragments, so even when total free memory is large, there's no contiguous hole for a new request. The batch gets capped not by real memory pressure but by fragmentation.

Paged KV cache (the PagedAttention idea, born in vLLM, now standard in TensorRT-LLM) borrows the operating-system trick: chop the cache into small fixed-size blocks and allocate them on demand, like virtual-memory pages. A request grows by adding blocks as it generates tokens; it only ever wastes up to one partial block. There's no fragmentation because any free block fits any request. And shared prefixes — the same system prompt across many requests — can point at the same physical blocks, so identical prefixes are stored once. Near-zero waste means the batch grows until the GPU is genuinely full.

3) The 70B endpoint compiled — the running example climbs¶

Our endpoint serves Llama-3-70B, tensor-parallel-4 on NVLink, with fused attention. We compile it into a TensorRT-LLM engine: trtllm-build with the tensor-parallel layout, FP8 weights, and max-batch / max-sequence-length set to our serving envelope. The build takes minutes and autotunes kernels for the H100 and our exact shapes.

At runtime, two things change the numbers. In-flight batching keeps the average decode batch near capacity instead of draining into the long tail — so the 140 GB weight-read that produced ~1 token per step now produces dozens. Paged KV cache (FP8 weights also halve the bytes read, a memory-roofline win from file 01) frees the memory that contiguous allocation was wasting, letting the batch grow further before HBM fills. The combination is where the endpoint makes its biggest jump toward 2000 tokens/sec: the roofline's batching lever, finally pulled all the way, on a compiled engine that isn't re-paying fusion and launch costs every step. NVIDIA's published numbers put a single H100 with FP8 well into the thousands of output tokens/sec for models in this class — the regime our endpoint targets.

Mini-FAQ. "Is the engine just the model in a different file format?" No. It's a plan specialized to a fixed envelope: this GPU, this tensor-parallel degree, these dtypes, these max shapes. Change the GPU generation or exceed the max sequence length you built for, and you must rebuild. The engine bakes in decisions that eager PyTorch re-makes every run — that's the source of both the speed and the rigidity.

4) Why a compiled engine and not just vLLM or eager PyTorch?¶

The plausible alternatives are eager PyTorch (flexible, slow) and a Python serving framework like vLLM (which pioneered paged KV and continuous batching). Why reach for TensorRT-LLM?

Because under our workload — a fixed model, a known GPU, a stable serving envelope, and a hard throughput target — the build-time specialization pays for itself many times over. TensorRT-LLM's autotuning picks the fastest kernel for your exact shapes and dtype, fuses aggressively, captures CUDA Graphs, and integrates FP8 deeply; the result is typically the top end of single-GPU throughput for that hardware. vLLM is excellent and far more flexible — it owns paged KV cache conceptually and is easier to iterate with — so the honest comparison is: TensorRT-LLM when you've frozen the model and want maximum throughput per GPU on NVIDIA hardware and can absorb the build/ops complexity; vLLM (or eager) when you're iterating, switching models often, or want simpler ops. This is a genuine buy-vs-build-vs-tune decision, not a default — file 06 (NIM) and file 09 revisit it.

Why this instead of eager, under our workload? Our endpoint serves one frozen 70B at a fixed tensor-parallel layout to a known traffic shape, with a six-figure GPU bill riding on tokens/sec. That's exactly the case where paying minutes of build time to shave every per-token cost is the right trade. If we were swapping models weekly or prototyping, the rigidity would cost more than the throughput is worth.

5) The property that decides the win: how stable your serving envelope is¶

The one dimension that decides whether compilation is worth it is how fixed your deployment is — model, GPU, dtype, and max shapes. Compilation trades flexibility for speed, so the more stable the envelope, the better the trade.

Serving envelope	Build cost amortized over	Compilation verdict
One frozen model, fixed GPU, stable max-seq-len, high QPS	billions of tokens	strong — build once, win every token
Frequent model swaps / fine-tune iterations	a few hours before rebuild	weak — rebuild churn eats the gains; prefer vLLM/eager
Many small models, low traffic each	little traffic per engine	weak — per-engine build cost dominates; consider Triton multi-model
Variable, very long contexts exceeding build max	must rebuild or over-provision shapes	mixed — set generous max shapes, accept the memory cost

The asymmetry to remember: build time is a fixed cost paid once; the per-token speedup is a recurring gain. At high traffic the fixed cost is rounding error and compilation is obviously right. At low traffic or high churn, the fixed cost dominates and a more flexible runtime wins. The throughput numbers are identical kernels; what differs is whether you amortize the build.

6) The failure walked through: the engine that broke on a long prompt¶

A team builds a TensorRT-LLM engine with max_input_len set to 2048, serves happily, then a user sends a 6000-token document to summarize. The request errors out or gets truncated. Worse, when they raised max_input_len to 8192 and rebuilt, the engine reserved far more KV cache headroom, the max batch size the GPU could hold dropped, and throughput on normal short requests fell.

Trace it. The engine's max shapes are baked in at build time — they size the memory plan, including KV cache reservations and CUDA Graph captures. Setting them too low rejects long requests; setting them too high reserves memory for a worst case that rarely happens, shrinking the batch for the common case. The fix is to size the envelope to the real traffic distribution (e.g., max-seq-len at the 99th percentile, not the absolute max), use paged KV so unused length isn't reserved per request, and route the rare giant request to a separate engine built for long context. The lesson: the build-time envelope is a real design decision with a throughput cost, not a formality — the rigidity the compiler buys you has teeth.

7) Cost movement: what compilation buys and what it costs¶

What it fixes: packages fusion (file 02), efficient tensor-parallel collectives (file 03), CUDA Graphs, and autotuned kernels into one engine, and adds in-flight batching + paged KV to keep the decode batch near capacity. Raises arithmetic intensity (more tokens per weight-read) and frees KV memory — the roofline's two levers, fully pulled.
What it costs: build time (minutes per engine), rigidity (the engine is specialized to model/GPU/dtype/shapes and must be rebuilt when those change), and ops complexity (engine artifacts to manage, version, and deploy). FP8 also costs a small accuracy validation step.
Which subsystem pays: the build pipeline and the ops/MLOps team — they own engine builds, versioning, and the model-to-engine lifecycle. The reward lands at runtime as a large throughput and latency win, and in the budget as fewer GPUs for the same traffic. This is file 02's "pay at build time, win at runtime" trade, scaled up to the whole model.

For the running example: compiling the 70B costs minutes of build and a rebuild discipline whenever the model changes, and in return the endpoint serves multiples more tokens/sec on the same four GPUs — the single biggest jump in the module's throughput climb.

8) Signals: healthy, first to degrade, and the liar¶

Healthy: average in-flight batch size stays near capacity through traffic swings; KV cache occupancy is high with little reserved-but-unused memory; tokens/sec/GPU near the engine's benchmarked peak; build succeeds and the engine's max shapes cover the real traffic distribution.
First metric to degrade: average batch size drops (often because a long-tail request or a too-small KV pool starves admission), or KV-cache-free-fraction approaches zero and new requests start queuing. Either shows up as falling tokens/sec before latency visibly spikes.
The misleading metric: peak/instantaneous batch size — it can look healthy at the start of each window while the average over the long tail is poor. Watch the time-averaged active batch, not the max.
The graph an expert opens first: the active-batch-size-over-time plot and the KV cache utilization gauge. A batch that decays toward 1 means static-batching behavior leaked in (or admission is starved); a KV pool pinned at 100% with requests queuing means the cache, not compute, is capping the batch — raise kv_cache_free_gpu_mem_fraction or shorten max-seq-len.

9) Boundary: where compilation shines and where it hurts¶

TensorRT-LLM shines for a frozen model on a fixed NVIDIA GPU at high traffic with a stable serving envelope — the production endpoint that will serve billions of tokens. There, build time is rounding error, in-flight batching and paged KV pull the roofline's levers fully, and FP8 + autotuning push single-GPU throughput to the top of what the hardware allows.

It becomes pathological when the envelope is unstable. Frequent model changes mean constant rebuilds whose cost outpaces the per-token gains. Many low-traffic models each pay a build cost they never amortize. Workloads with unpredictable extreme-length contexts force either rejection or memory-wasting over-provisioning of build-time max shapes. The scale limit that invalidates naive intuition: "compile everything for max speed" is wrong when iteration speed or model diversity matters more than per-GPU throughput — there, a flexible runtime (vLLM, eager) that needs no rebuild is faster in wall-clock-to-results, even if it's slower per token.

10) Wrong model: "compiling just makes the same thing run faster, free of charge"¶

The seductive wrong idea is that compilation is a pure speedup — flip it on, get faster tokens, no downside. The downside is real and structural: the engine is frozen to a specific model, GPU, dtype, and shape envelope, and any change to those means a rebuild.

Replace it with: compilation moves cost from runtime to build time and trades flexibility for speed. You're not getting free speed; you're pre-paying optimization once and accepting that the result is rigid. That trade is excellent for a stable, high-traffic endpoint and poor for a fast-iterating one. The engine's max shapes, dtype, and tensor-parallel degree are design commitments, not toggles.

11) Other failure shapes to recognize¶

Static-batching behavior in disguise. A misconfigured scheduler or a backend that doesn't truly support in-flight batching drains to small effective batches. Fix: confirm continuous batching is active; watch average batch size.
KV cache starvation. kv_cache_free_gpu_mem_fraction set too low (or max-seq-len too high) so the KV pool can't hold a full batch; requests queue. Fix: tune the KV fraction and max-seq-len to the real traffic.
Build-envelope mismatch. Real requests exceed max_input_len/max_seq_len; rejections or truncation. Fix: size shapes to the traffic distribution; route giant requests to a long-context engine.
Rebuild churn. A team fine-tunes weekly and rebuilds the engine each time, spending more on builds than they save. Fix: reconsider whether a flexible runtime fits this iteration pace.
FP8 accuracy regression. FP8 quantization shifts outputs enough to fail an eval the team didn't run. Fix: validate quality on FP8 before shipping; calibrate properly.
Version skew. Engine built with one TensorRT-LLM version, served with another; subtle incompatibilities. Fix: pin and co-version the build and serving stack.

12) Pattern transfer¶

Same trade as file 02's fusion, scaled up. Fusion paid at build/compile time for runtime speed at kernel granularity; TensorRT-LLM does it for the whole model — "expensive once, cheap every token." The compiler-vs-interpreter shape recurs at every layer of the stack.
Same amortization as the roofline's batching. In-flight batching is the operational form of "share each weight-read across many requests" (file 01), kept full continuously. The amortize-a-fixed-cost pattern appears as weight-reads here, kernel launches in file 02, and collective bandwidth in file 03.
Same idea as OS virtual memory. Paged KV cache is literally paging — fixed-size blocks allocated on demand, shared via pointers, no fragmentation. The decades-old memory-management pattern solves the exact KV fragmentation problem one layer up.

13) Design test — five questions before you compile¶

Is your model frozen and your serving envelope (GPU, dtype, max shapes) stable enough that build-time cost amortizes over the traffic?
Does your runtime actually do in-flight (continuous) batching, and can you watch the average active batch size, not just the peak?
Is your KV cache paged, and is the KV pool sized so the batch fills the GPU rather than being capped by fragmentation or a low free-fraction?
Are your build-time max shapes sized to the real traffic distribution, with a plan for the rare giant request?
If you're iterating on the model frequently, have you compared the rebuild cost against a flexible runtime like vLLM before committing to compilation?

Where this appears in production¶

The engine and its mechanisms

NVIDIA TensorRT-LLM — compiles an LLM into an optimized engine with fused kernels, autotuned shapes, CUDA Graphs, FP8, in-flight batching, and paged KV; the throughput ceiling for LLMs on NVIDIA GPUs.
trtllm-build — the build tool that bakes tensor-parallel degree, dtype, and max shapes into the engine; its flags (max_batch_size, max_num_tokens, max_input_len, max_seq_len) are the rigidity made concrete.
In-flight batching — TensorRT-LLM's continuous batching; admits/retires requests every generation step to keep the average batch near capacity.
Paged KV cache — TensorRT-LLM's PagedAttention-style block allocation (8/16/32/64/128 tokens per block, default on); near-zero fragmentation and prefix sharing.
vLLM — pioneered PagedAttention and continuous batching; the flexible counterpart and the standard buy-vs-tune comparison to TensorRT-LLM.
SGLang — another high-throughput serving engine with continuous batching and prefix caching; one of the engines NIM can wrap.

Where the trade shows up

NVIDIA NIM — ships TensorRT-LLM engines prebuilt inside containers so you skip the build step (file 06); the "buy" side of this build decision.
Triton Inference Server — hosts a TensorRT-LLM engine as a backend and adds serving-layer dynamic batching and multi-model hosting (file 05).
FP8 on Hopper/Blackwell — TensorRT-LLM's FP8 path halves weight bytes (a memory-roofline win) and uses FP8 Tensor Cores; the headline throughput numbers ride on it.
Chunked prefill — splits long-prompt prefill across steps so it doesn't block in-flight decode; balances the prefill/decode interleave.
Speculative decoding (draft models, Medusa, EAGLE) — TensorRT-LLM supports drafting multiple tokens and verifying them in one pass, raising tokens-per-weight-read further.
Prefix caching / KV reuse — shared system prompts stored once via paged blocks; cuts prefill cost for repeated prefixes across requests.
Production LLM platforms (Fireworks, Baseten, Together, Anyscale) — run TensorRT-LLM or vLLM with continuous batching and paged KV as the core of their per-GPU economics.

Pause and recall¶

Why does static batching drain to a small average batch even when it starts full?
What does in-flight (continuous) batching do at every generation step?
Why does contiguous KV cache allocation waste memory and cap the batch?
How does paged KV cache eliminate fragmentation, and how does it share a system prompt across requests?
What does a TensorRT-LLM engine bake in at build time, and why does that make it rigid?
State the trade compilation makes in one sentence.
When would you pick vLLM or eager PyTorch over a compiled TensorRT-LLM engine?
Why does raising max_seq_len to a huge value hurt throughput on normal short requests?

Interview Q&A¶

Q1. Your endpoint uses batch-32 static batching but throughput is far below what batch-32 should give. Why, and what's the fix? A. Requests finish at different times, but static batching can't admit new work until the slowest finishes, so the batch decays through the long tail — average batch size is far below 32, and a memory-bound decode's throughput tracks the average batch, not the peak. Switch to in-flight (continuous) batching: retire finished requests and admit waiting ones every generation step, keeping the average batch near capacity. Common wrong answer to avoid: "Increase the batch size to 64." A bigger starting batch still drains the same way; the problem is the draining, not the start size.

Q2. Why does paged KV cache let you fit a larger batch than contiguous allocation on the same GPU? A. Contiguous allocation reserves the max possible sequence length per request and fragments memory, so most reserved memory is unused and there's often no contiguous hole for a new request — the batch is capped by fragmentation, not real usage. Paged KV allocates small fixed-size blocks on demand, wasting at most one partial block per request and never fragmenting, so the batch grows until HBM is genuinely full. Shared prefixes also dedupe via shared blocks. Common wrong answer to avoid: "Paged KV compresses the cache." It doesn't compress; it eliminates reservation waste and fragmentation through on-demand block allocation.

Q3. What exactly do you give up by compiling with TensorRT-LLM, and when is that a bad trade? A. You give up flexibility: the engine is frozen to a specific model, GPU, dtype, and max-shape envelope, and any change requires a rebuild (minutes plus ops). That's a bad trade when you iterate on the model frequently, serve many low-traffic models, or face unpredictable extreme-length contexts — the rebuild/over-provisioning cost can outpace the per-token gains, and a flexible runtime like vLLM gets you to results faster. Common wrong answer to avoid: "Compiling is strictly better, just turn it on." It moves cost to build time and bakes in rigidity; for fast-iterating or diverse workloads that's a net loss.

Q4. How do in-flight batching and paged KV cache relate to the roofline from file 01? A. The roofline said decode is memory-bound and the fix is to share each weight-read across many tokens — i.e., maximize batch size. In-flight batching keeps the average batch near capacity continuously (more tokens per weight-read), and paged KV frees the memory that was capping the batch. Together they pull the roofline's batching lever as far as the GPU's memory allows; they're the serving-layer realization of "raise arithmetic intensity." Common wrong answer to avoid: "Batching is a serving detail unrelated to the roofline." It's the direct operational mechanism for the roofline's central prescription.

Q5. A user's 6000-token prompt errors on your engine built with max_input_len=2048. You raise it to 8192 and rebuild — now short requests are slower. Diagnose. A. Max shapes are baked into the engine's memory plan; too small rejects long inputs, too large reserves more KV headroom and shrinks the batch the GPU can hold for the common short case, lowering throughput. Size the envelope to the real distribution (e.g., a high percentile, not the absolute max), rely on paged KV so unused length isn't reserved per request, and route rare giant requests to a separate long-context engine. Common wrong answer to avoid: "Just set max_input_len to the largest possible value to be safe." That over-reserves memory and degrades the common case.

Q6. (Cumulative.) For the 70B endpoint, order the contributions of NCCL placement (file 03), fusion (file 02), and in-flight batching + paged KV (this file) to reaching 2000 tokens/sec. A. All three are necessary, but in-flight batching + paged KV is the biggest single jump because it pulls the roofline's batching lever fully — turning a draining static batch into a near-full continuous one, which is the dominant lever for memory-bound decode. Fusion cuts per-step bytes (especially attention's O(N²)), and NCCL placement keeps the per-layer all-reduce on NVLink so multi-GPU doesn't add latency. Without correct NCCL placement the batch gains would be eaten by collective stalls; without fusion each request costs more bytes; but the batching is what most directly raises tokens-per-weight-read. Common wrong answer to avoid: "Just buy more GPUs to hit 2000." The same GPUs reach the target once the batch is kept full and the collective is on the fast wire; more GPUs without these fixes still drain.

Design/debug exercise (10 min)¶

Step 1 — Model it. Two requests batch together: A wants 10 tokens, B wants 500. Under static batching, compute the average active batch size over B's 500 steps (full for 10 steps, then 1 for 490) — roughly (2×10 + 1×490)/500 ≈ 1.02. Under in-flight batching, if a steady queue refills A's freed slot, the average stays near 2. Write both averages and note that decode throughput scales with this average, not the peak.

Step 2 — Your turn. For the 70B endpoint with a realistic mix (median 80 output tokens, p99 800), estimate how much in-flight batching raises the average batch versus static batching at the same capacity, and argue why paged KV is required to realize that higher average (the freed slots need KV memory that contiguous allocation would have reserved away). Tie the result to the roofline: more average batch = more tokens per 140 GB weight-read = higher tokens/sec on the same four GPUs.

Step 3 — Reproduce from memory. Redraw both diagrams from sections 1 and 2: the static-vs-in-flight batch-occupancy timeline, and the contiguous-vs-paged KV layout. Then state in one sentence how this file connects to file 01 (it operationalizes the roofline's batching lever) and to file 02 (it's the same build-time-for-runtime trade, scaled from a kernel to the whole model).

Operational memory¶

This chapter explained why an endpoint with fused kernels on NVLink still serves far below target: the decode batch drains to a trickle because static batching can't admit new requests until the slowest finishes, and contiguous KV cache fragments memory so a large batch never fits. The important idea is that throughput for memory-bound decode tracks the average batch over time, and two mechanisms keep that average high — in-flight batching refills the seats every step, and paged KV cache frees the memory that was capping the batch.

You learned to compile the model into a TensorRT-LLM engine that packages fusion, tensor-parallel collectives, CUDA Graphs, and FP8, and adds in-flight batching + paged KV — pulling the roofline's batching lever as far as the GPU's memory allows. That solves the opening failure because the 140 GB weight-read now serves a near-full batch continuously instead of decaying into a single-request tail. The price is build time and rigidity: the engine is frozen to a model, GPU, dtype, and shape envelope, and changing any of those means a rebuild.

Carry this diagnostic forward: when throughput is low, watch the average active batch size and KV cache occupancy before adding GPUs. If the batch decays toward one, suspect static-batching behavior or KV starvation; if the KV pool is pinned at 100% with requests queuing, the cache is capping the batch, not compute.

Remember:

Memory-bound decode throughput tracks the average batch over time, not the peak you start with.
In-flight (continuous) batching admits/retires requests every generation step so the batch never drains into the long tail.
Paged KV cache allocates fixed-size blocks on demand — near-zero fragmentation, prefix sharing, batch grows until HBM is full.
Compilation moves cost from runtime to build time and trades flexibility for speed; the engine is frozen to model/GPU/dtype/shapes.
Build-time max shapes are a real throughput decision — too small rejects long inputs, too large shrinks the batch for the common case.
Next pressure: one compiled engine serves one model well, but production runs many models, mixes frameworks, and needs request routing, versioning, and preprocessing — a serving layer above the engine.

Bridge. TensorRT-LLM gives us one blazing-fast engine for one model. But production is never one model on a clean socket: you have a 70B chat model, an embedding model, a safety classifier, and a tokenizer step, all needing to be served together, versioned, routed, and batched at the request boundary — possibly across frameworks. The engine doesn't do that; it just runs a forward pass fast. The next file is the serving layer that wraps engines like this — Triton (now Dynamo-Triton) — handling dynamic batching at the request edge, multi-framework backends, model ensembles that chain preprocessing→model→postprocessing server-side, and running many models concurrently on the same GPUs. → 05-triton-inference-server.md