00. Distributed Training Systems — First-Principles Overview¶

One GPU has 80 GB. A 70B model in training needs 1.1 TB before it touches a single activation. Everything in this module is a way to spread that 1.1 TB across machines without the machines spending all their time talking to each other.

A team sets out to train a 70-billion-parameter language model. They have a cluster: eight nodes, each with eight H100 SXM GPUs, 80 GB of HBM3 each. Sixty-four GPUs, 5.1 TB of aggregate memory. Plenty, on paper. The engineer writes the obvious first line — load the model onto a GPU — and the process dies before the first forward pass. torch.cuda.OutOfMemoryError: tried to allocate 2.00 GiB. Not at step 100. At step zero, during optimizer construction.

The arithmetic is brutal and it is fixed. Weights in bf16 cost two bytes per parameter: 140 GB. Gradients, another 140 GB. The Adam optimizer keeps three fp32 numbers per parameter — a master copy of the weight, a first moment, a second moment — at four bytes each: 840 GB. Add it up and the static state of training, before any input flows through, is 1,120 GB. That is fourteen H100s' worth of memory just to hold the model still. A single 80 GB GPU was never in the running. The memory wall is not a tuning problem you optimize your way around inside one device. It is a physical ceiling, and the only way through it is to put the model on many GPUs at once.

So you split the work. The instant you do, a second cost appears that did not exist on one GPU: coordination. Sixty-four GPUs computing pieces of the same training step must exchange data — gradients, activations, parameter shards — over wires. Those wires have a speed. Inside one node, NVLink 4.0 moves 900 GB/s between any two of the eight GPUs. Between nodes, InfiniBand moves a fraction of that. Every byte a GPU sends over the network is a byte it is not spending on matrix multiplies. The whole discipline of distributed training is the management of one tradeoff: how do you cut the model and the batch into pieces small enough to fit in 80 GB, while making the GPUs talk as little as possible — and when they must talk, over the fastest wire available?

Memory pressure and coordination cost pull in opposite directions. Shard the model more aggressively and each GPU holds less, but they must gather the missing pieces from each other more often. Shard less and they barely communicate, but nothing fits. Every mechanism in this module — data parallelism, ZeRO, tensor and pipeline parallelism, mixed precision, activation recompute, checkpointing — is a different point on that tradeoff curve, chosen for a different ratio of model size to interconnect speed. Get the choice wrong and your 64 GPUs run at the throughput of 20, or the run crashes at step zero, or a single dying GPU 40 hours in takes the whole job down with it.

This module builds the answer from the memory wall outward. First we measure exactly what eats GPU memory and why 70B will never fit on one card. Then we add GPUs the cheap way (replicate and sync), discover the sync itself becomes the bottleneck, and start shredding the model state into shards. Then we cut the model along two more axes — within a layer and across layers — and learn why one cut must stay inside a node while another can cross the network. We trade compute for memory with lower precision and recomputed activations. Finally we confront the ugliest truth of 1,000-GPU runs: at that scale something is always broken, and survival depends on checkpointing and fault tolerance, not on raw speed.

The recurring pressures and concepts¶

Pressure / concept	Meaning
The memory wall	A single GPU's HBM (80 GB on H100) cannot hold params + gradients + optimizer state + activations for a large model. The forcing constraint behind every mechanism here.
The 16-bytes-per-param rule	Mixed-precision Adam stores 16 bytes per parameter (2 bf16 weight + 2 bf16 grad + 4 fp32 master + 4 fp32 m + 4 fp32 v). For 70B that is 1,120 GB of static state.
Coordination cost	The price GPUs pay to exchange gradients, activations, or shards over the interconnect. Splitting work creates it; it never existed on one GPU.
The communication-vs-memory tradeoff	Shard more → less memory per GPU but more cross-GPU traffic. Shard less → more memory but less traffic. Every mechanism picks a point on this curve.
Interconnect hierarchy	NVLink/NVSwitch inside a node (900 GB/s) is roughly an order of magnitude faster than InfiniBand between nodes. Where you place a parallelism axis is decided by how chatty it is.
The all-reduce collective	The synchronization primitive that sums one GPU's tensor with every other GPU's copy and gives everyone the result. The heartbeat of data parallelism.
The pipeline bubble	Idle GPU time when a layer-split pipeline is filling or draining. Wasted compute that micro-batching and clever scheduling shrink but never erase.
Compute-for-memory trades	Recomputing activations in the backward pass, or using bf16/fp8 instead of fp32, buys memory headroom by spending extra FLOPs or precision.
MTBF at scale	Mean time between failures collapses as GPU count rises. At thousands of GPUs something fails every hour or faster; the run's design must assume continuous partial failure.

Top resources¶

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — https://arxiv.org/abs/1910.02054
Megatron-LM: Efficient Large-Scale Language Model Training on GPU Clusters — https://arxiv.org/abs/2104.04473
PyTorch FSDP paper (Zhao et al.) — https://arxiv.org/abs/2304.11277
PyTorch FSDP2 / fully_shard docs — https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html
DeepSpeed ZeRO tutorial — https://www.deepspeed.ai/tutorials/zero/
NVIDIA Transformer Engine FP8 primer — https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html
Mixed Precision Training (Micikevicius et al.) — https://arxiv.org/abs/1710.03740
The Llama 3 Herd of Models (training infrastructure §3) — https://arxiv.org/abs/2407.21783
NCCL collective communication docs — https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html

What's coming¶

01-single-gpu-memory-wall.md — What actually eats GPU memory, the 16-bytes-per-param arithmetic, and why the OOM hits at step zero.
02-data-parallelism-and-allreduce.md — Replicate the model, split the batch, sum gradients with all-reduce — and watch the sync become the bottleneck.
03-zero-sharding-and-fsdp.md — ZeRO stages 1/2/3 and FSDP: shred the optimizer state, then gradients, then parameters, trading memory for communication.
04-tensor-and-pipeline-parallelism.md — Cut inside a layer (tensor) and across layers (pipeline); the pipeline bubble and the 1F1B schedule.
05-3d-parallelism-and-interconnect.md — Compose DP × TP × PP and map each axis onto NVLink vs InfiniBand topology.
06-mixed-precision-and-activation-recompute.md — bf16/fp8, loss scaling, and recomputing activations to trade compute for memory.
07-fault-tolerance-and-checkpointing-at-scale.md — Checkpoint frequency, restart, stragglers, elastic training, and silent corruption on thousand-GPU runs.
08-boundary-tradeoff-review.md — Open problems, contested practice, and where intuition breaks.

Memory map¶

Concept	Prerequisite	Pressure family	Recurs later as	Layer touched
Memory wall / 16-bytes-per-param	basic training loop	memory	the budget every parallelism axis must satisfy	hardware → runtime
Data parallelism + all-reduce	the memory wall	coordination + bandwidth	the DP axis inside 3D parallelism	algorithm → interconnect
ZeRO / FSDP sharding	data parallelism	memory ↔ bandwidth tradeoff	how DP frees memory before TP/PP are needed	runtime → interconnect
Tensor + pipeline parallelism	ZeRO can't fit one layer	memory + coordination	the TP/PP axes of 3D parallelism	algorithm → interconnect
3D parallelism + topology mapping	DP, TP, PP all understood	coordination + bandwidth	the full cluster layout	algorithm → hardware
Mixed precision + recompute	memory wall arithmetic	memory ↔ compute tradeoff	what shrinks the per-GPU budget under any topology	algorithm → hardware
Fault tolerance + checkpointing	a working multi-GPU run	reliability + operator attention	what keeps a week-long run alive	runtime → operator

Three traversal paths use this map. Prerequisite path — read top to bottom; each file removes the constraint the next one assumes solved. Failure path — an OOM points at files 01/03/06, a throughput collapse at 02/04/05, a dead run at 07. Synthesis path — pick a memory file and a coordination file and ask how they fight: e.g., ZeRO-3 (03) frees memory but adds all-gather traffic that competes with the all-reduce (02) on the same NVLink wire.

Bridge. Before we can split anything intelligently, we must know exactly what we are splitting and why it doesn't fit. The next file opens one H100, adds up every byte a 70B training step demands — weights, gradients, optimizer moments, activations — and shows why the failure arrives at step zero, not step one thousand. Once you can do that arithmetic in your head, every later mechanism becomes a way to push one of those terms off the card. → 01-single-gpu-memory-wall.md