00. GPU acceleration & inference-serving stack — First-principles overview¶

A GPU is only fast if its compute units stay fed. Every layer in this module exists to remove one bottleneck between the model's math and the hardware's peak.

A team buys eight H100s to serve a Llama-3-70B chat endpoint. On paper each card does roughly 989 trillion BF16 multiply-adds per second. The launch dashboard shows something else: the GPUs sit at 38% utilization, the endpoint pushes 280 tokens/sec aggregate, and P50 latency is 1.4 seconds for the first token. The bill says they are paying for eight cards. The throughput says they are using three. Nobody changed the model. Nobody changed the prompt. The math the model does is identical to the benchmark. So where did the other 60% of the silicon go?

It went to waiting. The decode step of an LLM reads the entire weight matrix from memory to produce one token, does very little arithmetic on each byte it reads, then throws the byte away and reads it again for the next token. The compute units are not the bottleneck — the wire from HBM memory to the compute units is. Add to that: kernels launched one at a time with gaps between them, batches that hold one request when they could hold thirty, KV cache that fragments memory until half of it is unusable, all-reduce traffic crawling over PCIe instead of NVLink, and a scheduler that pins one model to a whole card it barely touches. Each of those is a separate place where the model's math finishes and then waits for the hardware to catch up.

This module is the stack that closes those gaps, one bottleneck at a time. It is grounded in NVIDIA's concrete tooling — TensorRT-LLM, Triton (now Dynamo Triton), NIM, NeMo, NCCL, MIG — but the organizing idea is not the vendor. It is a single physical pressure: keep the compute units fed. Every mechanism here is a different answer to "what is the GPU waiting for, and how do we stop the wait." Roofline tells you whether you are waiting on compute or on memory. Kernel fusion stops you from waiting on round-trips to HBM. NCCL stops you from waiting on slow interconnect during multi-GPU work. TensorRT-LLM and Triton stop you from waiting on tiny batches and idle scheduler gaps. MIG and the cluster scheduler stop whole GPUs from waiting on nothing at all.

The reason this is hard, and the reason it fills a module, is that the bottleneck moves. Fix the batch size and now the interconnect is the wall. Fix the interconnect and now kernel launch overhead dominates. Fix that and now half your fleet is idle because one model can't fill a card. Performance work on GPUs is whack-a-mole with a profiler: you relieve one pressure and a new one surfaces in the subsystem that was previously hidden behind it. The skill this module builds is reading which wall you are against right now and reaching for the layer that moves it.

We thread one example through every file. Serve a Llama-3-70B chat endpoint at 2000 tokens/sec aggregate, under a P50 first-token budget, on H100s. We start at 280 tokens/sec and 38% utilization. Each file moves a concrete bottleneck and the number climbs.

The recurring pressures and concepts¶

Pressure / concept	Meaning
feed the beast	The core invariant: a GPU delivers peak only when compute units never stall waiting for data. Every mechanism removes one source of stall.
memory-bound vs compute-bound	Whether a kernel waits on the HBM wire (bytes/sec) or the compute units (FLOPs/sec). LLM decode is memory-bound; this flips almost every intuition.
arithmetic intensity	FLOPs done per byte read from memory. The single number that predicts which wall you hit, via the roofline model.
launch overhead	The fixed cost of starting a kernel (microseconds) that dominates when kernels are small and numerous. Fusion amortizes it.
collective cost	Time spent in all-reduce / all-gather moving data between GPUs. Decided by interconnect topology (NVLink vs PCIe vs InfiniBand), not by the model.
batching	Combining requests so one weight-read serves many tokens. The biggest single lever for memory-bound decode. Continuous (in-flight) batching is the modern form.
idle-GPU cost	A reserved GPU bills whether or not it computes. Utilization is the dominant cost pressure once correctness is solved. MIG and scheduling attack it.
buy-vs-build boundary	Whether to self-optimize (TensorRT-LLM + Triton) or take a prepackaged container (NIM). A judgment call, not a default.

Top resources¶

NVIDIA — Mastering LLM Techniques: Inference Optimization — https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
NVIDIA Hopper Architecture In-Depth — https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
TensorRT-LLM documentation — https://nvidia.github.io/TensorRT-LLM/
NVIDIA Triton / Dynamo Triton docs — https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
NVIDIA NIM microservices — https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/
NVIDIA NeMo Framework User Guide — https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
NCCL developer page — https://developer.nvidia.com/nccl
Multi-Instance GPU (MIG) User Guide — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
FlashAttention paper (fused attention kernel) — https://arxiv.org/abs/2205.14135
PagedAttention / vLLM paper (paged KV cache) — https://arxiv.org/abs/2309.06180

What's coming¶

01-gpu-execution-and-roofline.md — SMs, warps, the memory hierarchy, arithmetic intensity, and the roofline model. Why LLM decode is memory-bandwidth-bound, not compute-bound. The diagnostic that names every later wall.
02-cuda-kernels-and-fusion.md — What a kernel is, launch overhead, and kernel fusion (FlashAttention as the canonical fused kernel). Why unfused ops burn bandwidth round-tripping to HBM.
03-nccl-collectives-and-interconnect.md — All-reduce / all-gather / reduce-scatter, ring vs tree, and how NVLink/NVSwitch vs PCIe vs InfiniBand topology decides collective speed.
04-tensorrt-llm-compilation.md — Graph compilation, kernel autotuning, in-flight (continuous) batching, paged KV cache. What you trade — build time, rigidity — for throughput.
05-triton-inference-server.md — Model serving, dynamic batching, ensembles/pipelines, multi-framework backends, concurrent model execution. The serving layer above the engine.
06-nim-inference-microservices.md — NIM as prepackaged optimized inference containers. Buy-vs-build for inference: when NIM helps and when you self-optimize.
07-nemo-customization.md — NeMo for data curation, pretraining, SFT/PEFT, alignment. Where it sits relative to HuggingFace and raw PyTorch.
08-gpu-cluster-scheduling-and-mig.md — Sharing GPUs: MIG partitioning, the K8s GPU operator and device plugin, SLURM, bin-packing, and the cost pressure of idle GPUs.
09-boundary-tradeoff-review.md — Open problems, contested practices, and an honest accounting of vendor lock-in.

Memory map¶

Concept	Prerequisite	Pressure family	Recurs later as	Layer touched
Roofline / arithmetic intensity	basic GPU model	bandwidth vs compute	the wall every later layer moves	algorithm → hardware
Kernel fusion	roofline, what a kernel is	bandwidth, launch overhead	FlashAttention inside TensorRT-LLM	runtime → hardware
NCCL collectives	multi-GPU model parallelism	interconnect bandwidth	tensor-parallel 70B serving	runtime → hardware
In-flight batching + paged KV	roofline (decode is memory-bound)	throughput, memory waste	Triton dynamic batching above it	algorithm → runtime
Triton serving	a compiled engine exists	latency vs throughput, multi-model	NIM wraps it	API → runtime
NIM packaging	Triton + TensorRT-LLM	buy-vs-build, ops cost	the lock-in debate in 09	API → ops
NeMo customization	SFT/PEFT basics	data quality, training cost	the model NIM serves	data → training
MIG + cluster scheduling	utilization economics	idle-GPU cost, isolation	the bill that justifies all of the above	resource → operator

Three traversal paths. Prerequisite path — read 01→09 in order; each file assumes the prior. Failure path — utilization low? Start at 01 (which wall?) then 04/05 (batching). Multi-GPU slow? Jump to 03. Fleet expensive but idle? Jump to 08. Synthesis path — combine layers: a tensor-parallel 70B engine (04) needs NCCL (03) to be fast, Triton (05) to batch, and MIG (08) only if the model is small enough to share a card — which a 70B is not, a tension file 08 makes concrete.

Bridge. Before any tool, we need the diagnostic that tells us which wall we are against. A GPU has two ceilings — how fast it computes and how fast it reads memory — and almost every wrong optimization comes from assuming you are under the compute ceiling when you are pinned against the memory one. The next file builds the roofline model and shows why LLM decode lives on the memory wall. → 01-gpu-execution-and-roofline.md