Home / Applied AI / 02. AI Infrastructure / 08. Distributed Training Systems Distributed Training Systems¶ The chapters in this module, in reading order. # Chapter 00 Distributed Training Systems — First-Principles Overview 01 The single-GPU memory wall — why 70B dies at step zero 02 Data parallelism and all-reduce — when the sync becomes the bottleneck 03 ZeRO sharding and FSDP — stop storing 64 copies of the same state 04 Tensor and pipeline parallelism — when one layer won't fit and the GPUs take turns 05 3D parallelism and interconnect — putting every cut on the right wire 06 Mixed precision and activation recompute — buying memory with compute and bits 07 Fault tolerance and checkpointing at scale — surviving the week when something is always broken 08 Boundary and tradeoff review — where distributed-training intuition quietly lies