GPU Acceleration Stack¶

The chapters in this module, in reading order.

#	Chapter
00	GPU acceleration & inference-serving stack — First-principles overview
01	GPU execution and the roofline — which ceiling are you actually hitting?
02	CUDA kernels and fusion — the tax you pay between operations
03	NCCL collectives and interconnect — the wire between GPUs is now the wall
04	TensorRT-LLM compilation — pay at build time so you don't pay every token
05	Triton Inference Server — the serving layer above the engine
06	NVIDIA NIM — when to take the prebuilt engine instead of building your own
07	NeMo customization — the training-side framework for the model NIM serves
08	GPU cluster scheduling and MIG — the idle GPU bills the same as the busy one
09	Boundaries and tradeoffs — what's physics, what's a vendor convention, and what it cost to forget the difference