01. GPU execution and the roofline — which ceiling are you actually hitting?¶

~20 min read. Eight H100s, a 70B model, and 38% utilization. Before you touch a single tool, you need the one number that tells you whether the GPU is starving for compute or starving for memory. Get that wrong and every optimization after it is aimed at the wrong wall.

Built on the first-principles overview. The module's invariant is feed the beast — a GPU delivers peak only when its compute units never stall. This file gives you the instrument that reads the stall: the roofline, built from arithmetic intensity, separating memory-bound from compute-bound work. Every later file moves one of these two walls.

What you can already assume and what is still dark¶

You have used a GPU. You know it has thousands of cores, that you ship it big matrices, and that it returns answers faster than a CPU. You may know the H100 datasheet number — roughly 989 TFLOPS in BF16. So the natural expectation is simple: feed it matrix multiplies, watch it run near that number.

It does not. Our Llama-3-70B endpoint runs at 38% utilization and a fraction of peak throughput, and the model is doing exactly the matrix multiplies the benchmark does. The datasheet number is real but it is one of two ceilings, and for the work an LLM does most of the time, it is the wrong one. This file builds the second ceiling, shows you how to tell which one binds you, and explains the result that reshapes the whole module: token-by-token decode is limited by memory bandwidth, not by compute. After this you can look at any kernel and predict its wall before you profile it.

What this file solves¶

The endpoint is slow and the GPU looks idle, and the team's instinct is "buy faster cards" or "the model is too big." Both are usually wrong. This file gives you the roofline diagnostic: compute arithmetic intensity (FLOPs per byte moved), compare it against the hardware's compute-to-bandwidth ratio, and read off whether you are compute-bound or memory-bound. That one move tells you which optimizations can possibly help and which are wasted effort — and it explains why a 70B model serving one request at a time uses a sliver of the silicon you paid for.

The chip underneath: SMs, warps, and the memory cliff¶

An H100 is not "thousands of cores" in any flat sense. It is around 132 streaming multiprocessors (SMs), each a small independent processor with its own scheduler, register file, fast on-chip memory, and Tensor Cores for matrix math. Work arrives as a grid of threads; the hardware groups threads into warps of 32 that execute in lockstep, and each SM juggles many warps at once. When one warp stalls waiting on memory, the SM swaps in another warp that has its data ready. That swap is the GPU's whole trick for hiding latency: keep enough warps in flight and the memory delay of any single one disappears behind the work of the others.

That trick only works if there is data to compute on. And data lives in a hierarchy with brutal cliffs between levels:

        ┌──────────────────────────────────────────────────────────────┐
        │  REGISTERS        ~hundreds of KB/SM    ~free, per-thread      │
        │  ─────────────────────────────────────────────────────────    │
        │  L1 / SHARED MEM  ~256 KB/SM            ~20–30 cycles          │
        │  ─────────────────────────────────────────────────────────    │
        │  L2 CACHE         ~50 MB shared         ~200 cycles            │
        │  ════════════════ THE CLIFF ═══════════════════════════════   │
        │  HBM3 (VRAM)      80 GB    3.35 TB/s    ~400–600 cycles        │
        │  ─────────────────────────────────────────────────────────    │
        │  NVLink to peer GPU       900 GB/s      ~microseconds          │
        │  PCIe / host memory       ~64 GB/s      ~slow, avoid           │
        └──────────────────────────────────────────────────────────────┘
              Up = small, fast, scarce.  Down = large, slow, plentiful.

The line marked THE CLIFF is the whole story. On-chip memory (registers, shared memory, L2) is fast enough to keep Tensor Cores busy. HBM — the 80 GB of VRAM on the card — is enormous but an order of magnitude slower per byte. The instant a kernel's working set spills past L2 and has to stream from HBM, the SMs spend their time waiting for bytes instead of multiplying them. The Tensor Cores can consume data far faster than HBM can deliver it. That mismatch is the source of nearly every "why is my GPU idle" question in this module.

Teacher voice. A GPU has two completely different speed limits, and they live in different units. Compute speed is in FLOPs per second — how fast it multiplies. Memory speed is in bytes per second — how fast it reads. A kernel is only as fast as whichever limit it hits first. The mistake is assuming you always hit the FLOPs limit. Most LLM serving hits the bytes limit.

The naive fix that aims at the wrong wall¶

A smart engineer sees 38% utilization and reaches for compute. They quantize the weights to make the matmuls cheaper. They look at a bigger card with more TFLOPS. They consider an H200 or a B200 "because it computes faster." The throughput barely moves.

Why doesn't more compute help? Because during decode the GPU was never short on compute. Watch one decode step. To generate a single next token for one request, the model reads all 70 billion weights out of HBM, multiplies the current token's small activation vector through them, and produces one token. The arithmetic per weight is tiny — a couple of FLOPs per parameter. The data movement per weight is the whole weight. The SMs finish their multiply and sit idle waiting for the next slab of weights to arrive from HBM.

So the real problem is not that the chip computes too slowly; it is that the chip reads memory too slowly relative to how little arithmetic each byte earns. Buying more FLOPs is buying more of the resource that was already sitting idle. The wall is bandwidth.

So how do we know — before profiling, before buying anything — which wall any given kernel will hit?

When a matrix multiply tells you its own wall¶

Take the smallest concrete case. A matrix multiply of an M×K activation by a K×N weight does about 2·M·K·N floating-point operations and must read about (M·K + K·N + M·N) numbers from memory. Divide one by the other and you get arithmetic intensity: FLOPs performed per byte moved.

Prefill / training, big batch: M is large (many tokens at once). The weight slab gets reused across all M tokens, so FLOPs grow with M while the weight bytes read stay fixed. Arithmetic intensity is high — hundreds of FLOPs per byte. This is compute-bound: the Tensor Cores are the limit, exactly as the datasheet promises.
Decode, one request: M = 1. You read the entire weight matrix to multiply a single token vector through it. FLOPs per byte collapse to roughly 2. This is deeply memory-bound: you spend essentially all your time reading weights and almost none multiplying.

Same operation, same hardware. The only thing that changed is how many tokens share each weight-read. That ratio decides the wall.

Rule: the lower number wins, and decode's number is tiny¶

The roofline rule. Achievable throughput = min(peak compute, arithmetic intensity × peak bandwidth). A kernel is compute-bound only when its arithmetic intensity exceeds the hardware's compute-to-bandwidth ratio. Below that ridge point, it is memory-bound, and more FLOPs buy nothing.

Why the rule exists. The primitive is that every FLOP needs operands, and operands must travel the memory hierarchy. The constraint is that HBM delivers a fixed bytes/sec. If your kernel earns too few FLOPs per byte, the FLOP units starve no matter how many you own. The naive "add compute" fix breaks because it scales the resource that wasn't the bottleneck.

The H100's ridge point is sharp. With ~989 TFLOPS BF16 and ~3.35 TB/s bandwidth, the compute-to-bandwidth ratio is roughly 989e12 / 3.35e12 ≈ 295 FLOPs per byte (and near 590 for FP8, which doubles compute). Any kernel below ~295 FLOPs/byte is memory-bound on this card. Decode sits near 2. It is not a little below the ridge — it is two orders of magnitude below it. The GPU is starving for bytes by a factor of ~100.

1) The roofline picture — read the wall off one graph¶

Plot achievable FLOPs/sec (vertical) against arithmetic intensity (horizontal), both log scale. You get two line segments forming a roof.

 FLOP/s
 (log)
  989T ┤                      ┌────────────────────────  ← compute ceiling (flat roof)
       │                     ╱       COMPUTE-BOUND
       │                    ╱        (prefill, big-batch matmul)
       │                   ╱
       │   MEMORY-       ╱
       │   BOUND        ╱  slope = peak bandwidth (3.35 TB/s)
       │              ╱
       │   decode    ╱
       │   (AI≈2)   ╱
       │     ●─────╱
       │    ╱      ▲
       └───┴───────┴──────────────────────────────────  arithmetic intensity
          2       ~295 (ridge)                          (FLOPs / byte, log)

The slanted part is the memory roof: throughput rises with arithmetic intensity because each byte you read does more work. The flat part is the compute roof: past the ridge, you are limited by raw FLOPs no matter how much more reuse you get. Your kernel is a dot. Its horizontal position — its arithmetic intensity — tells you which roof it sits under. Decode's dot (AI ≈ 2) sits far down the slanted memory roof. To move it up you either raise arithmetic intensity (read each weight fewer times per token produced) or raise the slope (more bandwidth). You cannot move it by raising the flat ceiling, because it is nowhere near the flat ceiling.

This single graph is the diagnostic the rest of the module uses. Kernel fusion (file 02) pushes the dot rightward by removing redundant HBM reads. Batching (files 04–05) pushes it rightward by sharing each weight-read across many requests. Faster interconnect (file 03) raises a different roofline for multi-GPU collectives. Every layer is "move the dot up the roof or change which roof you're on."

2) The 70B endpoint on the roofline — the running example begins¶

Our endpoint serves Llama-3-70B on H100s. Let's put concrete numbers on the starving.

In BF16, 70B parameters occupy ~140 GB of weights — already more than one 80 GB H100, so the model is sharded across at least two cards (that sharding is file 03's problem). Focus on one decode step for one request. The GPU must read ~140 GB of weights from HBM to produce one token. At 3.35 TB/s per card, even ignoring the sharding, reading 140 GB takes about 140 / 3.35 ≈ 42 ms of pure memory traffic. The arithmetic on those weights for a single token is a few hundred GFLOPs — microseconds of Tensor Core time. So ~42 ms of reading wraps a sliver of computing. That is a per-token decode rate around 24 tokens/sec for a single request, and the compute units are idle ~99% of that window.

Mini-FAQ. "If it's reading 140 GB per token, how does anyone serve 70B fast?" You stop reading the weights once per token and start reading them once per batch of tokens. If 30 requests decode together, one 140 GB weight-read produces 30 tokens instead of 1. The bytes are amortized 30 ways; arithmetic intensity climbs ~30×; the dot slides right up the memory roof. That is the entire reason continuous batching exists — and why our 280 tokens/sec endpoint is leaving most of the GPU on the table by under-batching.

This is the bottleneck file 04 and file 05 will move. We are flagging it here because the roofline is what explains why batching is the dominant lever: it is the cheapest way to raise arithmetic intensity for a memory-bound workload.

3) Why roofline, and not just "look at GPU utilization %"¶

The tempting alternative diagnostic is the utilization number every dashboard shows: nvidia-smi says 38%. Why not just optimize until that number is high?

Because GPU utilization as reported is a liar for this question. It measures whether any kernel was running on the SMs during a sampling window — not whether the SMs were doing useful math. A memory-bound decode kernel can report 100% "utilization" while the Tensor Cores sit idle, because a kernel is resident and is issuing memory loads; it just isn't computing. You can be "100% utilized" and 1% efficient. The roofline asks a sharper question — what fraction of peak FLOPs or peak bandwidth are you achieving — and it tells you which of the two to chase.

Why this instead of raw utilization under our workload? For LLM serving, the failure mode is high reported utilization with low effective throughput. Roofline catches it because it forces you to name the ceiling and measure against it. Utilization % hides it. Use utilization to spot idle GPUs (file 08); use roofline to spot starving ones.

The honest alternative is a Tensor Core utilization metric (NVIDIA's profilers expose this as the fraction of cycles the Tensor Cores were active, sometimes called pipe utilization or SM activity in Nsight Compute). That is the right number to watch. Plain "GPU util" is the wrong one.

4) Batch size: the dimension that changes everything¶

For a memory-bound matmul, the one property that changes the design is how many tokens share each weight-read — effectively the batch size during decode. Sweep it and watch the wall move:

Decode batch (concurrent requests)	Weight-reads per token produced	Arithmetic intensity	Wall	Effective regime
1	1 full read per token	~2	hard memory-bound	~24 tok/s, Tensor Cores idle
8	1 read serves 8 tokens	~16	memory-bound	~8× the single-request throughput
32	1 read serves 32 tokens	~64	still memory-bound, approaching ridge	near-linear gains continue
256	1 read serves 256 tokens	~300+	crossing into compute-bound	gains flatten; now FLOPs limit

The surprise is the asymmetry. Going from batch 1 to batch 32 is nearly free throughput — you were paying for the weight-read anyway, and now it serves 32× the tokens. But going from 32 to 256 hits diminishing returns because you cross the ridge into compute-bound territory, where each extra request now costs real FLOPs. The sweet spot for decode is "the largest batch that still fits in memory before you hit the ridge or run out of KV cache room." That tension — batch bigger for throughput vs. memory and latency limits — is the recurring decision in files 04 and 05.

5) The failure walked through: the H200 that didn't help¶

A team upgrades from H100 to H200 to fix slow decode. H200 has the same compute as H100 but more bandwidth (~4.8 TB/s vs 3.35 TB/s) and more memory. Decode throughput rises ~40%. The team is confused: they expected the "faster" card to be much faster.

Trace it on the roofline. Decode is on the memory roof, whose slope is bandwidth. H200 raised the slope by ~43%, so memory-bound throughput rose ~43% — exactly tracking bandwidth, not the headline "it's a newer GPU" expectation. Had the workload been compute-bound (large-batch prefill), the H200 would have helped far less, since its peak FLOPs barely changed. The roofline predicted the precise gain before the purchase. A team without it sees "newer card, modest gain" and concludes hardware is disappointing; a team with it knew to expect a bandwidth-proportional gain and to also pursue batching, which is free.

6) The cost movement: what roofline thinking actually buys¶

Roofline is a diagnostic, not an optimization, so its "cost" is analysis time, and its payoff is not wasting money on the wrong wall.

What it fixes: it stops you from spending on compute (bigger cards, aggressive weight quantization for speed) when bandwidth is the wall, and it points you at batching and fusion, which are far cheaper.
What it costs: you must instrument arithmetic intensity and effective throughput (Nsight Compute, or back-of-envelope FLOP/byte counts). That is engineer time and a learning curve.
Which subsystem pays: the profiling and the mental discipline. The reward lands in the budget — a correctly-diagnosed memory-bound endpoint reaches target throughput by batching on the same hardware instead of buying 2× the cards.

For the running example: diagnosing decode as memory-bound is what justifies investing in continuous batching (files 04–05) rather than a hardware upgrade. The diagnosis is worth six figures a year in cards not bought.

7) Signals: healthy, first to degrade, and the liar¶

Healthy: for decode, high bandwidth utilization (close to 3.35 TB/s effective on H100) with a large in-flight batch; for prefill, high Tensor Core utilization. Each regime maxes a different ceiling.
First metric to degrade: effective tokens/sec/GPU drops as concurrent requests fall — because small batches re-read weights per token. You see throughput sag during low-traffic windows even though latency looks fine.
The misleading metric: nvidia-smi "GPU-Util %." It can read 90%+ while Tensor Cores idle. Teams celebrate it and stay slow.
The graph an expert opens first: Nsight Compute's memory-vs-compute roofline chart for the hot kernel, or DCGM's DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (Tensor Core active fraction) alongside DCGM_FI_PROF_DRAM_ACTIVE (memory active fraction). When DRAM-active is high and Tensor-active is low, you are confirmed memory-bound.

8) Boundary: where roofline is sharp and where it lies¶

Roofline is sharpest for single dense kernels — a matmul, an attention block — where you can count FLOPs and bytes cleanly. It is the right first instrument for any GPU performance question.

It becomes pathological when the bottleneck is not inside a kernel at all. If your time is going to kernel launch gaps (file 02), to all-reduce over a slow interconnect (file 03), to the scheduler holding a tiny batch (file 04), or to a whole idle GPU (file 08), the roofline of the hot kernel looks fine while the system is slow. Roofline measures a kernel; it does not measure the spaces between kernels or the traffic between GPUs. At cluster scale, the binding constraint is often utilization and interconnect, which roofline cannot see. The naive intuition it invalidates: "the kernel is efficient, so the system is fast." A perfectly roofline-optimal kernel launched 10,000 times with launch gaps between each is still slow — which is exactly the next file's problem.

9) Wrong model: "more FLOPs = more speed"¶

The seductive wrong idea is that GPU performance is one number — TFLOPS — and that buying more of it makes everything faster. It is the mental model the datasheet encourages and the one that wastes the most money in LLM serving.

Replace it with: a GPU has two ceilings, and the binding one depends on arithmetic intensity. Training and prefill are usually compute-bound (TFLOPS matter). Single-stream decode is brutally memory-bound (bandwidth and batching matter, TFLOPS barely). The same card is "fast" or "slow" depending entirely on which kind of work you hand it. Performance is a property of the workload-on-hardware pair, not of the hardware alone.

10) Other failure shapes to recognize¶

Quantizing weights for "speed" and getting only a memory win. FP8 weights halve bytes read, so memory-bound decode speeds up ~2× — but the gain is from bandwidth, not from the extra FLOPs. People misattribute it to compute.
Profiling prefill, optimizing decode. Benchmarking with one long prompt (compute-bound prefill) and concluding the system is compute-bound, then under-investing in batching that only helps decode.
Believing a small model is "easy." A 7B model still reads 14 GB/token at batch 1 — memory-bound, idle Tensor Cores. Small ≠ compute-bound.
Latency-tuning by shrinking batch. Smaller batches cut queueing latency but crater throughput by re-reading weights; the roofline shows the trade is steep.
Ignoring KV-cache reads. Long contexts read a large KV cache per token, adding memory traffic that can dominate even at good batch sizes.
Assuming NVLink fixes a memory-bound kernel. Interconnect helps multi-GPU collectives, not the per-GPU HBM wall; mixing the two ceilings is common.

11) Pattern transfer¶

Same shape as the CPU memory wall. Decades before GPUs, CPUs hit the same cliff: arithmetic outran DRAM, and caches existed to raise effective arithmetic intensity. GPU shared memory and L2 play the identical role. The constraint — bandwidth, not FLOPs — is the same physics one layer down.
Same pressure as database working-set vs. cache. A query that fits in buffer cache flies; one that spills to disk crawls. "Does the working set fit in the fast tier?" is the roofline question wearing a different hat.
Optimized-by the same lever as request batching in serving. Amortizing a fixed cost (weight-read here, RPC overhead there) across many units of work is the recurring move; you will see it again as dynamic batching in file 05.

12) Design test — five questions before you optimize¶

For your hottest kernel, can you state its arithmetic intensity (FLOPs/byte) within an order of magnitude? If not, you cannot name your wall.
Is your serving workload dominated by prefill (compute-bound) or decode (memory-bound)? Your optimization budget should follow the answer.
Are you measuring effective bandwidth/Tensor-Core utilization, or just nvidia-smi GPU-Util? If the latter, you are flying blind.
Before buying a faster card, can you predict the gain from its bandwidth-vs-compute delta relative to your wall?
Is your decode batch large enough to climb the memory roof, or are you re-reading weights per token at batch 1?

Where this appears in production¶

vLLM — its PagedAttention and continuous batching exist precisely to raise decode arithmetic intensity; the whole design is a roofline answer.
NVIDIA TensorRT-LLM — in-flight batching and FP8 both target the memory roof for decode; the docs frame gains in bandwidth terms.
FlashAttention — keeps attention's intermediate matrices in SRAM to avoid HBM round-trips, raising arithmetic intensity; a roofline optimization by construction.
NVIDIA Nsight Compute — ships a built-in roofline chart for exactly this diagnosis.
NVIDIA DCGM — exposes DCGM_FI_PROF_PIPE_TENSOR_ACTIVE and DCGM_FI_PROF_DRAM_ACTIVE so you can read the two ceilings in production.
NVIDIA H200 / B200 launches — marketed heavily on bandwidth (HBM3e), because the buyers running decode are bandwidth-bound and know it.
Meta Llama serving guides — recommend large decode batches specifically to escape the memory wall.
AWS Inferentia / Trainium — designed around high memory bandwidth per chip because they target memory-bound inference.
Google TPU — its large on-chip memory and systolic array are an explicit bet on raising arithmetic intensity.
Character.AI / inference-heavy products — public engineering notes describe aggressive batching and KV-cache compression, both memory-roof moves.
MLPerf Inference — separates offline (throughput, compute-bound friendly) from server (latency-bound) scenarios because the binding ceiling differs.
Roofline model origin (Williams, Waterman, Patterson) — the model is cited because it is the standard tool, not for history; every HPC perf team uses it.
PyTorch profiler / torch.compile — surfaces memory-bound vs compute-bound kernels so you fuse the right ones.
NVIDIA cuBLAS / CUTLASS — autotune matmuls differently for tall-skinny (memory-bound) vs square (compute-bound) shapes.
Cohere / Anthropic / OpenAI serving — batch many users per forward pass; the economics only work because decode amortizes weight-reads across the batch.
Modal / Baseten / Fireworks — inference platforms whose pricing assumes high batch utilization, i.e., bet on staying off the memory floor.

Pause and recall¶

Name the two ceilings a GPU has, and the units each is measured in.
What is arithmetic intensity, and what does it predict?
Why is single-request LLM decode memory-bound while large-batch prefill is compute-bound?
State the roofline rule in one sentence.
Why can nvidia-smi show 90% utilization while the Tensor Cores are idle?
Roughly where is the H100's ridge point, and where does decode (batch 1) sit relative to it?
Why does going from decode batch 1 to 32 give near-linear gains, but 32 to 256 does not?
A team upgrades to a higher-bandwidth, same-compute card and decode speeds up ~40%. Explain why using the roofline.

Interview Q&A¶

Q1. Your 70B endpoint shows 95% GPU utilization but low tokens/sec. Where do you look? A. High reported GPU-Util with low throughput is the classic memory-bound-decode signature: a kernel is resident (so util reads high) but the Tensor Cores are idle waiting on HBM. Check Tensor Core active fraction vs DRAM active fraction in DCGM/Nsight. If DRAM-active is high and Tensor-active is low, raise the decode batch to amortize weight-reads before touching anything else. Common wrong answer to avoid: "Utilization is high, so the GPU is maxed out — we need more GPUs." That buys more of an already-idle resource.

Q2. Why won't quantizing weights from BF16 to FP8 give you a 2× FLOPs speedup on decode? A. Decode is memory-bound, so its speed tracks bytes read, not FLOPs available. FP8 weights halve the bytes read per token, so you get roughly a 2× bandwidth win — but the doubled FP8 compute mostly goes unused because you were never compute-bound. The win is real but it comes from the memory roof, not the compute roof. Common wrong answer to avoid: "FP8 doubles the Tensor Core throughput, so decode doubles." It conflates the compute ceiling with the binding memory ceiling.

Q3. How do you decide between buying H200s and investing engineering time in batching? A. Both raise memory-bound decode throughput, but batching is nearly free (same hardware, raises arithmetic intensity by sharing weight-reads) and bounded by memory/KV room, while H200 raises the bandwidth slope ~40% at hardware cost. Do batching first because it's cheap; buy bandwidth only when batching is maxed and you still need more. Common wrong answer to avoid: "Buy the faster card — it's simpler." It spends capital before exhausting the free lever and may not even target the binding ceiling.

Q4. When is the roofline the wrong tool? A. When the bottleneck lives between kernels or between GPUs: launch overhead, all-reduce on slow interconnect, scheduler starving the batch, or a fully idle GPU. Roofline measures a single kernel's efficiency; it cannot see launch gaps, collective traffic, or fleet utilization. A roofline-perfect kernel can sit inside a slow system. Common wrong answer to avoid: "Roofline always tells you the system bottleneck." It only tells you the kernel's wall.

Q5. Why is a 7B model still memory-bound at batch 1, despite being "small"? A. Batch 1 reads all ~14 GB of weights to produce one token — arithmetic intensity ≈ 2, far below the ridge. Smallness changes how much memory you read, not the FLOPs-per-byte ratio. Small models are memory-bound for the same structural reason large ones are. Common wrong answer to avoid: "Small models are compute-bound because they fit in memory." Fitting in memory has nothing to do with the per-token reuse ratio.

Q6. (Cumulative.) Given this file's diagnosis, predict which later layer moves the decode bottleneck and which does not. A. Continuous batching (file 04) and dynamic batching (file 05) move it directly by raising arithmetic intensity. Kernel fusion (file 02) helps by cutting redundant HBM reads inside attention. NCCL/interconnect (file 03) helps multi-GPU collective time but not the single-GPU HBM wall. MIG (file 08) does not help a 70B model that already needs multiple full GPUs — it's for packing small models. The roofline tells you which is which. Common wrong answer to avoid: "MIG will improve our 70B decode by sharing the GPU." MIG subdivides a card; a 70B needs more than one whole card, so MIG is irrelevant to it.

Design/debug exercise (10 min)¶

Step 1 — Model it. For Llama-3-70B in BF16 (140 GB weights) on one H100 (3.35 TB/s), compute the lower bound on per-token decode latency at batch 1, ignoring sharding: 140 GB / 3.35 TB/s ≈ 42 ms, so ≈ 24 tokens/sec, with Tensor Cores idle ~99% of the window. Note that this is set by bandwidth, not compute.

Step 2 — Your turn. Redo the estimate at decode batch 32. The weight-read is amortized across 32 tokens, so per-token effective bandwidth cost drops ~32×; estimate the new aggregate tokens/sec and check where batch 32 sits relative to the ridge (AI ≈ 64, still below ~295). Conclusion: you are still on the memory roof, so larger batches keep helping — until KV-cache memory runs out.

Step 3 — Reproduce from memory. Redraw the roofline graph: the slanted memory roof (slope = bandwidth), the flat compute roof, the ridge near AI ≈ 295 on H100, and decode's dot near AI ≈ 2. Then state in one sentence how this connects to the running example's path to 2000 tokens/sec: every later layer either slides decode's dot rightward (fusion, batching) or fixes a cost the roofline can't see (interconnect, idle GPUs).

Operational memory¶

This chapter explained why a 70B endpoint can sit at low throughput while the GPU "looks busy," and why faster compute usually does nothing about it. The important idea is that a GPU has two ceilings — compute (FLOPs/sec) and memory bandwidth (bytes/sec) — and the binding one is decided by arithmetic intensity, the FLOPs you do per byte you read. LLM decode does almost no arithmetic per byte, so it is pinned against the memory roof, far below the compute ceiling the datasheet advertises.

You learned to diagnose the wall by counting FLOPs and bytes, computing arithmetic intensity, and comparing it to the H100's ridge point near 295 FLOPs/byte. That solves the opening failure because it tells you the fix for low throughput is to raise arithmetic intensity — chiefly by batching, so one weight-read serves many tokens — rather than buying compute you weren't using.

Carry this diagnostic forward: when an endpoint is slow, ask "which ceiling am I against?" before optimizing. If you see high GPU-Util with low tokens/sec, inspect Tensor Core active vs DRAM active before blaming the model or buying cards.

Remember:

A GPU has two ceilings; the binding one depends on arithmetic intensity, not on the datasheet TFLOPS.
Decode is memory-bound (AI ≈ 2); prefill/training is compute-bound (AI in the hundreds). Same card, opposite walls.
nvidia-smi GPU-Util can read high while Tensor Cores idle — measure DRAM-active vs Tensor-active instead.
Batching is the cheapest lever for memory-bound decode: it raises arithmetic intensity by sharing each weight-read.
The next pressure is the gaps between kernels and the redundant reads inside them — costs the roofline can't see, which kernel fusion attacks.

Bridge. Roofline tells you a kernel is memory-bound and that you must read fewer bytes per result. But it assumes the kernel runs as one clean block. Real models run thousands of tiny kernels back to back, each launched separately, each writing its output to HBM only for the next to read it right back — paying the memory tax we just learned to fear, over and over. The next file shows what a kernel actually is, why launching many small ones wastes both launch time and bandwidth, and how fusing them into one — as FlashAttention does — slides decode's dot rightward on the roofline we just drew. → 02-cuda-kernels-and-fusion.md