02. CUDA kernels and fusion — the tax you pay between operations¶

~20 min read. The roofline said decode is memory-bound. Now watch a model that should be one big computation shatter into ten thousand tiny ones, each launched separately, each round-tripping its result through HBM. The waste isn't in the math. It's in the seams between the math.

Built on the first-principles overview and roofline. The invariant is still feed the beast. This file attacks two sources of stall the roofline of a single kernel can't see: launch overhead (the fixed cost of starting a kernel) and the HBM round-trip between unfused operations. Both burn the memory bandwidth roofline told us was scarce.

What roofline left unsolved¶

Roofline gave you a verdict on one kernel: count its FLOPs and bytes, find its wall. But it quietly assumed the model is one kernel. It is not. A forward pass through a transformer layer is a long chain of small operations — a matmul, then add a bias, then apply a nonlinearity, then scale, then another matmul, then a softmax, then a dropout mask. Each of those, in a naive implementation, is a separate kernel: a separate program the CPU asks the GPU to run, with its own start cost, that reads its inputs from HBM and writes its output back to HBM before the next one begins.

So even when every individual kernel is roofline-efficient, the system bleeds time in two places roofline cannot see: the microseconds spent launching each tiny kernel, and the bandwidth spent writing an intermediate result to HBM only to read it straight back into the next kernel. This file shows where that tax comes from, why it dominates for the small operations LLM decode is full of, and how fusing the chain into one kernel — keeping intermediates in fast on-chip memory — removes both costs at once. FlashAttention is the canonical example, and we'll trace it.

What this file solves¶

Our 70B endpoint's profile shows the GPU "busy" but with long stretches where it is neither computing nor reading useful data — it's between kernels. The naive read is "the model is just big." The real cause is granularity: too many tiny kernels, each paying a fixed launch cost and an HBM round-trip. This file teaches you to recognize that pattern in a profile and to fix it by fusion — collapsing a chain of memory-bound element-wise ops, or a whole attention block, into a single kernel that touches HBM once.

What a kernel actually is¶

A kernel is a single function compiled to run across many GPU threads at once. You launch it from the CPU: "run this function on this grid of threads, with these input pointers." The GPU schedules the thread blocks onto its SMs, they all execute the same code on different data, and when the grid finishes, the kernel is done and its output sits in HBM.

Two facts about that lifecycle matter for performance. First, launching a kernel is not free — the driver has to set up the grid, push the launch onto a stream, and the GPU has to pick it up. That overhead is on the order of a few microseconds per launch. Second, a kernel's inputs and outputs live in HBM by default. Whatever a kernel produces, it writes to HBM; whatever the next kernel needs, it reads from HBM. On-chip memory (registers, shared memory) is scratch — it does not persist across kernel boundaries.

   CPU                         GPU
    │  launch K1  ───────────▶  read A from HBM ─▶ compute ─▶ write B to HBM
    │  launch K2  ───────────▶  read B from HBM ─▶ compute ─▶ write C to HBM
    │  launch K3  ───────────▶  read C from HBM ─▶ compute ─▶ write D to HBM
    │             ▲                        ▲
    │       launch overhead          B and C cross the HBM cliff twice each
    │       (µs per kernel)          even though no one else needs them

Look at B and C in that picture. They are intermediates — produced by one kernel, consumed only by the next. Yet each crosses the HBM cliff twice: once written, once read. For element-wise operations (add a bias, apply GELU, scale), the arithmetic is trivial and the bytes are everything, so this round-trip is the cost. Roofline already told us bytes are the scarce resource for these ops. Unfused chains spend that scarce resource on intermediates nobody outside the chain will ever look at.

The naive implementation that bleeds bandwidth¶

Write a transformer feed-forward block the obvious way and you get something like: y = matmul(x, W1); y = y + b1; y = gelu(y); z = matmul(y, W2); z = z + b2. Five operations, five kernels. The two matmuls are real compute. The bias-adds and the GELU are element-wise — pure memory traffic. Each reads a large activation tensor from HBM, does one cheap operation per element, and writes the same-sized tensor back.

The visible break: profile this and you find the element-wise kernels, which do almost no math, take a surprising share of wall-clock time — and the SMs report "busy" the whole time because they're streaming bytes. You add a faster GPU; the element-wise kernels barely speed up because they were bandwidth-bound, and the launch gaps don't speed up at all because launch cost is fixed.

So the real problem is not that GELU or bias-add are slow operations — each is the cheapest thing on the chip. It is that materializing their inputs and outputs in HBM, and launching a separate kernel for each, pays the two taxes roofline warned about, once per operation. The arithmetic is free; the seams are expensive.

So how do we do the same math without crossing the HBM cliff between every step?

When two ops should have been one¶

Take the smallest case: y = gelu(x + b). Two kernels naively. Kernel one reads x and b from HBM, writes x + b to HBM. Kernel two reads x + b from HBM, writes gelu(x + b) to HBM. The tensor x + b made a full round-trip to HBM and back for nothing — no one but kernel two ever wanted it.

Fuse them: one kernel reads x and b, computes x + b into a register, immediately applies GELU to that register, and writes only the final result. Same math. The intermediate never leaves the chip. You read the input once, write the output once. The HBM traffic for the intermediate — half the memory traffic of the whole operation — is gone. And you launched one kernel instead of two.

Rule: fuse to make each byte cross the HBM cliff once¶

The fusion rule. Chain memory-bound operations into a single kernel so intermediates stay in registers/shared memory and never round-trip to HBM; this raises arithmetic intensity (more work per byte read) and cuts launch count. Fuse where ops are element-wise or where a small tile can be reused; don't fuse where the intermediate is genuinely needed elsewhere or won't fit on-chip.

Why the rule exists. The primitive is that HBM bandwidth is the scarce resource for these ops and launch cost is fixed per kernel. The constraint is that on-chip memory is tiny and doesn't persist across kernel boundaries. Fusion sidesteps both: it keeps the intermediate in the persistent scope of one kernel, so the cliff is crossed once and the launch is paid once. The naive per-op implementation breaks because it treats kernel boundaries as free when each boundary is an HBM round-trip plus a launch.

1) Launch overhead — when tiny kernels drown in their own setup¶

The second tax is launch overhead, and it dominates a different regime: many small kernels. Each launch costs a few microseconds regardless of how much work the kernel does. If a kernel runs for 500 µs, a 3 µs launch is noise. If a kernel runs for 4 µs — common in decode, where batch is small and tensors are tiny — the launch is comparable to the work itself. You can spend nearly half your time starting kernels.

Decode is the worst case. At small batch, every operation handles a one-token-wide slice, so each kernel is tiny and fast, and a single decode step fires dozens of them. Multiply dozens of kernels × thousands of decode steps × a few µs launch each and you get milliseconds of pure launch overhead per request, with the GPU idle in the gaps.

The standard defense, beyond fusion, is CUDA Graphs: capture the whole sequence of kernel launches once, then replay the entire graph with a single submission. The CPU stops issuing launches one-by-one; it tells the GPU "run this whole recorded sequence." Launch overhead drops from per-kernel to per-graph.

Without CUDA Graph:  launch─gap─launch─gap─launch─gap─...   (CPU issues each, GPU waits)
With CUDA Graph:     [────────── replay recorded graph ──────────]   (one submission)

For the running example, decode at small batch is exactly where launch overhead bites; TensorRT-LLM (file 04) uses CUDA Graphs internally for precisely this reason. We flag it here so that when you see it later, you know what problem it solves.

Teacher voice. There are two different wastes hiding in "the GPU is busy but slow." One is bandwidth on intermediates — fixed by fusion. The other is launch overhead on tiny kernels — fixed by fusion and by CUDA Graphs. They feel the same in a dashboard. They are different walls. Fusion happens to help both, which is why it's the highest-leverage kernel optimization.

2) The picture — unfused vs fused attention¶

The mental model that lands fusion is attention, because attention's naive form materializes a giant intermediate that doesn't need to exist.

UNFUSED ATTENTION (naive)                FUSED ATTENTION (FlashAttention)
─────────────────────────                ──────────────────────────────
 Q,K in HBM                               Q,K,V in HBM
   │ read                                   │ read a TILE of Q,K,V
   ▼                                         ▼  (into SRAM / shared memory)
 S = Q·Kᵀ   ──write──▶ HBM   ← N×N!        compute partial scores for the tile
   │ read                                    │ running softmax in SRAM
   ▼                                         │ accumulate output for the tile
 P = softmax(S) ─write─▶ HBM ← N×N!          ▼  (intermediate S, P never hit HBM)
   │ read                                  next tile ... repeat
   ▼                                         ▼
 O = P·V    ──write──▶ HBM                 O = result ──write──▶ HBM (once)

 HBM traffic ∝ N²  (the score matrix)     HBM traffic ∝ N  (tiles only)

The naive version computes the full N×N attention score matrix S, writes it to HBM, reads it back to softmax it, writes that to HBM, reads it back to multiply by V. For a 4096-token context, N×N is 16 million entries per head per layer — written and read multiple times. That matrix is the single biggest memory hog in attention, and no one outside the attention block ever needs it.

FlashAttention never materializes it. It walks the sequence in tiles small enough to fit in SRAM, computes scores for a tile, runs an online (streaming) softmax that updates a running normalizer, accumulates the output, and moves to the next tile. The full score matrix exists only as a sequence of small tiles in on-chip memory. HBM traffic drops from O(N²) to O(N). Same attention output, dramatically fewer bytes across the cliff — the dot slides right on the roofline.

3) FlashAttention through the running example¶

Our 70B endpoint serves chat, and chat means growing contexts — a long conversation can run thousands of tokens. With unfused attention, every decode step's attention cost scales with the square of context length because of that N×N score matrix round-tripping through HBM. A 4k-token conversation is paying ~16× the attention memory traffic of a 1k one.

FlashAttention changes the scaling. By tiling and never materializing the score matrix, attention's HBM traffic grows linearly with context, not quadratically. For the endpoint, that means long conversations stop being bandwidth catastrophes, and the attention portion of each decode step gets both faster and far more memory-frugal. Combined with the batching from file 04, this is a large part of how the endpoint climbs toward target throughput: fusion cuts the per-step bytes, batching shares them across requests.

Mini-FAQ. "Is FlashAttention an approximation? Does it change the math?" No. It computes exact attention. The online-softmax trick is an algebraic reformulation that lets softmax be computed incrementally over tiles with a running max and running sum, so the final result is bit-faithful (up to floating-point reordering). You lose nothing in quality; you only avoid storing the intermediate.

4) Why fusion and not "just buy bandwidth"¶

The alternative to fusion is to accept the HBM round-trips and pay for enough bandwidth to make them cheap — i.e., a higher-bandwidth card. Why fuse instead?

Because fusion removes the traffic; bandwidth only makes the wasted traffic faster. Under our decode workload, the bytes spent on intermediates are pure waste — they exist only because of kernel granularity. Buying bandwidth speeds up both useful and wasted reads proportionally; fusion deletes the wasted ones, so it improves effective throughput more than the same money spent on hardware, and it stacks with whatever card you have. The one case where you'd lean on bandwidth instead is when fusion isn't possible — the intermediate is genuinely consumed by multiple downstream ops, or it's too large to keep on-chip.

Why this instead of the alternative, under our workload. Decode fires many tiny, memory-bound kernels with throwaway intermediates. Fusion targets exactly that structure — it's the highest-leverage move because it attacks both taxes (bandwidth on intermediates, launch count) at once. Bandwidth upgrades attack only one and don't reduce launch count at all.

5) The property that decides whether to fuse: data reuse and on-chip fit¶

Fusion pays off in proportion to how much HBM traffic the intermediate would have caused and whether the working tile fits on-chip. Three regimes:

Op chain	Intermediate size	Fits on-chip?	Fusion payoff
bias + GELU (element-wise)	same as activation	yes, streamed	high — deletes a full round-trip, trivial to fuse
attention scores (Q·Kᵀ → softmax → ·V)	N×N, huge	only in tiles	very high — `O(N²)` → `O(N)` traffic
two large independent matmuls	full output tensors	no	low — outputs are real, reused downstream; don't force it

The skill is recognizing the first two patterns and leaving the third alone. Element-wise epilogues (add, activation, scale, dropout) almost always fuse into the preceding matmul — modern libraries call these "fused epilogues." Attention fuses into FlashAttention. Two genuinely separate heavy matmuls whose outputs feed many consumers should not be jammed together; their intermediates aren't waste.

6) The failure walked through: a custom kernel that got slower¶

A team writes a custom fused kernel for an attention variant to "skip the library overhead." It runs correctly but slower than the unfused PyTorch version. Confusing — fusion should be faster.

Trace it. Their kernel tried to keep the entire score matrix tile in shared memory, but they sized the tile too large; it spilled out of the 256 KB of shared memory per SM into local memory, which lives in HBM. So their "fused" kernel was secretly round-tripping through HBM anyway — and now with worse memory access patterns than the well-tuned library kernels. The fusion was nominal; the intermediate didn't actually stay on-chip. The fix was to shrink the tile until it fit in shared memory and registers, restoring the O(N) traffic profile. The lesson: fusion only helps if the intermediate actually stays on-chip. A fused kernel that spills is an unfused kernel wearing a costume.

7) Cost movement: what fusion buys and what it costs¶

What it fixes: deletes HBM round-trips for intermediates (raising effective arithmetic intensity) and cuts kernel launch count (cutting launch overhead). For attention specifically, turns O(N²) memory traffic into O(N).
What it costs: engineering effort and rigidity. A fused kernel is a hand-tuned (or compiler-generated) artifact specialized to shapes, dtypes, and hardware. It is harder to write, harder to debug, and must be re-tuned for new GPU generations.
Which subsystem pays: the kernel author and the build/compile step. The reward lands at runtime as lower latency and higher throughput. This is the same trade you'll see amplified in file 04: TensorRT-LLM spends build time autotuning and fusing so that runtime is fast.

Concretely, replacing naive attention with FlashAttention on a long-context decode can cut the attention block's memory traffic by an order of magnitude and noticeably raise tokens/sec — without changing the model's outputs at all.

8) Signals: healthy, first to degrade, and the liar¶

Healthy: few, large kernels per forward step; high fraction of HBM traffic going to weights/inputs rather than intermediates; small gap time between kernels in the timeline.
First metric to degrade: kernel count per step climbs and average kernel duration falls (the model "fragmented"), often after adding a custom layer that didn't fuse. Launch-bound symptoms appear: GPU idle gaps between kernels in the Nsight Systems timeline.
The misleading metric: total GPU-Util again — fragmented, launch-bound execution can still report high util because some kernel is usually resident. It hides the gaps.
The graph an expert opens first: the Nsight Systems timeline (the per-kernel Gantt view). Tiny kernels separated by visible gaps mean launch overhead; large intermediate writes to HBM mean missing fusion. Both jump out visually in a way a single number never does.

9) Boundary: where fusion helps and where it can't¶

Fusion shines for memory-bound, element-wise, or tile-reusable operations — exactly the chains LLM decode is full of. It is close to mandatory for attention and for activation epilogues.

It becomes pathological or pointless in three places. When the operation is already compute-bound (a large square matmul at high batch), there is no idle bandwidth to reclaim — fusion adds complexity for no gain. When the intermediate is genuinely needed by several downstream consumers, "fusing" it just recomputes it, trading bandwidth for redundant FLOPs. And when the working set can't fit on-chip, the fused kernel spills to HBM and you're back where you started, often worse. The scale limit that invalidates naive intuition: people assume "more fusion = always faster," but past the point where tiles fit in shared memory, additional fusion slows down via register/shared-memory pressure and occupancy loss.

10) Wrong model: "the math is what's slow"¶

The seductive wrong idea is that a slow model is doing too much arithmetic, so the fix is fewer or cheaper FLOPs. For memory-bound LLM ops, the arithmetic is the cheap part. The time goes to moving bytes across the HBM cliff and to starting kernels.

Replace it with: in memory-bound regimes, data movement and kernel granularity are the cost, not the math. The fastest way to speed up a chain of cheap operations is usually not to make each cheaper but to stop moving their intermediates through HBM and stop launching them separately. Fusion is "do the same math with fewer trips across the cliff."

11) Other failure shapes to recognize¶

Launch-bound decode. Dozens of tiny kernels per step, each a few µs, with the GPU idle between them. Fix: CUDA Graphs + fusion.
The phantom score matrix. Long-context attention quietly allocating and round-tripping N×N per head per layer. Fix: FlashAttention-style tiling.
Spilled "fused" kernel. A custom kernel that overflows shared memory into HBM-backed local memory, negating the fusion. Fix: shrink tiles to fit on-chip.
Over-fusion occupancy collapse. Jamming too much into one kernel so register pressure cuts the number of resident warps, killing the latency-hiding that depends on many warps. Fix: back off fusion to restore occupancy.
Framework eager mode. Running ops one-at-a-time in eager PyTorch with no graph capture, paying full launch and round-trip cost. Fix: torch.compile, CUDA Graphs, or a compiled engine.
dtype thrash. Casting between FP16/FP32 between ops, materializing converted intermediates in HBM. Fix: fuse the cast into the consuming kernel.

12) Pattern transfer¶

Same shape as the roofline lesson. Fusion is literally "slide the dot rightward" — it raises arithmetic intensity by cutting byte traffic. It is the roofline's prescription made concrete at kernel granularity.
Same pressure as amortizing fixed cost. Launch overhead is a fixed per-operation cost; CUDA Graphs amortize it the way batching amortizes weight-reads (file 04) and the way dynamic batching amortizes serving overhead (file 05). One pattern, three layers.
Same shape as I/O coalescing in storage systems. A database that batches many small writes into one sequential flush avoids per-write overhead exactly as fusion avoids per-kernel overhead. The constraint — fixed cost per operation, cheap cost per byte once started — recurs.

13) Design test — five questions before you hand-write a kernel¶

Is the chain you want to fuse memory-bound (element-wise / tile-reusable), or already compute-bound (no idle bandwidth to reclaim)?
Will the fused intermediate actually fit in shared memory and registers, or will it spill to HBM?
Is the intermediate truly throwaway, or do other consumers need it (in which case fusion recomputes it)?
Are your symptoms bandwidth on intermediates (fix with fusion) or launch overhead on tiny kernels (fix with CUDA Graphs + fusion)?
Before writing a custom kernel, have you tried the library's fused path (torch.compile, fused epilogues, FlashAttention) — which is already tuned for shape and hardware?

Where this appears in production¶

FlashAttention (Dao et al.) — the canonical fused attention kernel; keeps the score matrix in SRAM, turning O(N²) HBM traffic into O(N). Standard in essentially every modern LLM stack.
NVIDIA TensorRT-LLM — fuses attention, layernorm, and activation epilogues at build time and uses CUDA Graphs to kill decode launch overhead.
PyTorch torch.compile / TorchInductor — automatically fuses element-wise chains and generates Triton kernels for them.
OpenAI Triton (the kernel language) — lets engineers write fused kernels in Python-like syntax; widely used for custom attention and MoE kernels. (Distinct from NVIDIA's Triton Inference Server in file 05 — same word, different thing.)
NVIDIA CUTLASS — provides fused-epilogue GEMM templates so a matmul can apply bias/activation without a second kernel.
NVIDIA cuDNN — fuses convolution + bias + activation for CNNs; the same idea predates transformers.
CUDA Graphs — used across training and inference to replay recorded launch sequences in one submission, eliminating per-kernel launch cost.
vLLM — integrates FlashAttention/PagedAttention kernels; its decode path depends on fused attention to stay on the memory roof.
xFormers (Meta) — memory-efficient fused attention used in diffusion and LLM serving.
JAX / XLA — fuses operation chains at the compiler level (operator fusion is a core XLA pass).
Apple MLX / Metal Performance Shaders — fuse element-wise epilogues on-device for the same bandwidth reasons.
DeepSpeed / Megatron-LM — ship fused LayerNorm, fused Adam, and fused softmax kernels for training throughput.
TensorRT (vision) — its layer-fusion pass merges conv+bn+relu into single kernels; the inference-graph ancestor of TensorRT-LLM's fusion.
Triton Inference Server ensembles — at a higher layer, fuse preprocessing+model+postprocessing into one server-side pipeline (file 05) — the same "don't round-trip between stages" idea one level up.
FlashAttention-3 on Hopper — exploits H100-specific async copy and Tensor Core features to push fused attention closer to the compute roof.

Pause and recall¶

What two costs does an unfused chain of kernels pay that a fused kernel avoids?
Why do element-wise ops (bias, GELU) waste bandwidth when run as separate kernels?
What does FlashAttention avoid materializing, and how does that change attention's memory traffic scaling?
Is FlashAttention an approximation? Why or why not?
What is launch overhead, and in which regime (prefill or decode) does it bite hardest?
What do CUDA Graphs fix, and how?
Give one case where you should not fuse.
A "fused" custom kernel runs slower than the unfused version. Name the most likely cause.

Interview Q&A¶

Q1. Walk me through why FlashAttention is faster than naive attention if it computes the same result. A. Naive attention materializes the N×N score matrix in HBM, writing and reading it across softmax — O(N²) memory traffic that dominates for long contexts. FlashAttention tiles Q/K/V into SRAM, computes a streaming (online) softmax, and accumulates the output without ever storing the full score matrix, cutting traffic to O(N). Same math, far fewer bytes across the HBM cliff, so a memory-bound op gets dramatically faster. Common wrong answer to avoid: "It approximates attention to save compute." It's exact; the win is memory traffic, not FLOPs.

Q2. Your decode path shows the GPU idle in small gaps between many tiny kernels. What's happening and how do you fix it? A. That's launch-bound execution: at small decode batch each kernel is a few µs, comparable to the ~µs launch cost, and dozens fire per step. Fix with CUDA Graphs (replay the whole launch sequence in one submission) and fusion (fewer, larger kernels). Buying a faster GPU won't help — launch cost is fixed. Common wrong answer to avoid: "GPU-Util is high, the GPU is maxed out." Launch-bound code reports high util while idling in the gaps.

Q3. When is fusion the wrong move? A. When the chain is already compute-bound (no idle bandwidth to reclaim), when the intermediate is consumed by multiple downstream ops (fusing recomputes it, trading bandwidth for FLOPs), or when the working set won't fit on-chip and the fused kernel spills to HBM. Over-fusion can also collapse occupancy by raising register/shared-memory pressure. Common wrong answer to avoid: "Always fuse more — fewer kernels is always faster." Past on-chip capacity, more fusion slows down.

Q4. How does fusion connect to the roofline from the previous file? A. Fusion raises arithmetic intensity by deleting HBM round-trips for throwaway intermediates — it slides a memory-bound kernel's dot rightward on the roofline, toward the ridge. It's the kernel-level mechanism for the roofline's prescription "do more work per byte read." For attention it's the difference between O(N²) and O(N) bytes. Common wrong answer to avoid: "Fusion is about reducing FLOPs." It reduces bytes moved; FLOPs are usually unchanged.

Q5. A teammate's hand-written fused kernel is slower than torch.compile's. What do you check first? A. Whether the intermediate actually stays on-chip. If the tile is sized too large it spills out of the ~256 KB shared memory per SM into HBM-backed local memory, negating the fusion and adding bad access patterns. Check shared-memory usage and occupancy; shrink the tile until it fits. Also verify they're not beating an already-tuned library path. Common wrong answer to avoid: "Custom CUDA is always faster than a framework." Untuned custom kernels routinely lose to autotuned library kernels.

Q6. (Cumulative.) For the 70B chat endpoint, where do fusion and batching each contribute, and do they overlap? A. They attack the same memory wall from different sides. Fusion (FlashAttention) cuts the per-step bytes — especially attention's O(N²) score matrix on long conversations. Batching (file 04) amortizes the weight-read bytes across many requests. They stack: fused attention makes each request cheaper in bytes; batching makes the expensive weight-read serve many requests. Both slide decode's roofline dot rightward; neither alone reaches target throughput. Common wrong answer to avoid: "Fusion and batching do the same thing, so pick one." They reduce different byte sources and compose multiplicatively.

Design/debug exercise (10 min)¶

Step 1 — Model it. Take y = gelu(x + b) on a tensor of T elements. Unfused: kernel 1 reads x and b, writes x+b (≈T writes + 2T reads); kernel 2 reads x+b, writes y (≈T reads + T writes). Total ≈ 5T element-movements and 2 launches. Fused: read x,b once, write y once ≈ 3T movements and 1 launch. You cut ~40% of the traffic and half the launches. Write out the byte counts.

Step 2 — Your turn. For the 70B endpoint at a 4096-token conversation, estimate the relative attention memory traffic of naive (O(N²)) vs FlashAttention (O(N)) for one head, one layer. With N = 4096, the score matrix is ~16.8M entries; FlashAttention touches O(N) ~ thousands. Argue why this matters more as conversations grow, and tie it to the roofline dot moving right.

Step 3 — Reproduce from memory. Redraw the unfused-vs-fused attention diagram from section 2, labeling where S and P hit HBM in the naive path and why they never do in the fused path. Then state in one sentence how this connects to file 01: fusion is the kernel-level way to raise arithmetic intensity, the exact lever the roofline said decode needs.

Operational memory¶

This chapter explained why a model that "looks busy" still wastes time: it shatters into many small kernels, each paying a fixed launch cost and writing throwaway intermediates across the HBM cliff. The important idea is that for memory-bound LLM ops, the seams between operations — launches and intermediate round-trips — cost more than the arithmetic inside them.

You learned to fix this by fusion: collapse a chain of memory-bound ops into one kernel that keeps intermediates in registers/shared memory and crosses HBM once, and use CUDA Graphs to replay launch sequences in a single submission. FlashAttention is the showcase — it never materializes the N×N score matrix, turning attention's traffic from O(N²) to O(N) with identical output. That solves the opening failure because it deletes the bytes and launches that were stalling the SMs between operations.

Carry this diagnostic forward: when a model is slow but the math looks cheap, open the kernel timeline, not the FLOP counter. If you see many tiny kernels with gaps, or large intermediate writes to HBM, suspect missing fusion or launch overhead before blaming the model size.

Remember:

Each kernel pays two taxes: a fixed launch cost and an HBM round-trip for its output. Fusion removes both for throwaway intermediates.
FlashAttention is exact; its win is never materializing the N×N score matrix — O(N²) → O(N) memory traffic.
Launch overhead dominates small-batch decode (tiny kernels ≈ launch cost); CUDA Graphs amortize it.
Fusion only helps if the intermediate actually stays on-chip — a spilled "fused" kernel is unfused in disguise.
Don't fuse compute-bound ops, multi-consumer intermediates, or working sets that won't fit on-chip; over-fusion collapses occupancy.
Next pressure: a 70B model doesn't fit on one card, so the work spans GPUs — and now the cost is the traffic between GPUs, which fusion can't touch.

Bridge. Fusion and CUDA Graphs make a single GPU's kernels efficient. But a 70B model in BF16 needs ~140 GB of weights — more than one 80 GB H100 holds. The model must be split across GPUs, and now every layer's computation requires the cards to exchange partial results. Suddenly the bottleneck is not inside any kernel; it is the wire between GPUs, and how fast that wire is depends on whether it's NVLink, PCIe, or InfiniBand. The next file builds the collective operations (all-reduce, all-gather) that move that data and shows how interconnect topology decides their speed. → 03-nccl-collectives-and-interconnect.md