Distributed Training Systems¶

The chapters in this module, in reading order.

#	Chapter
00	Distributed Training Systems — First-Principles Overview
01	The single-GPU memory wall — why 70B dies at step zero
02	Data parallelism and all-reduce — when the sync becomes the bottleneck
03	ZeRO sharding and FSDP — stop storing 64 copies of the same state
04	Tensor and pipeline parallelism — when one layer won't fit and the GPUs take turns
05	3D parallelism and interconnect — putting every cut on the right wire
06	Mixed precision and activation recompute — buying memory with compute and bits
07	Fault tolerance and checkpointing at scale — surviving the week when something is always broken
08	Boundary and tradeoff review — where distributed-training intuition quietly lies