02. Data parallelism and all-reduce — when the sync becomes the bottleneck¶

~18 min read. You have 64 GPUs and a model that fits on one. The obvious move: give every GPU a copy, feed each a different slice of the batch, and average the results. It works — until you measure where the time goes and find the GPUs spending more of each step talking than computing.

Built on the memory wall. That file proved a 70B model's 1,120 GB of static state needs many GPUs. This file takes the simplest way to use many GPUs — replicate the model, split the batch — and runs straight into the module's second pressure: coordination cost. The synchronization primitive that pays it is the all-reduce collective, and watching it saturate the interconnect is what motivates everything after.

What the memory wall left unsolved¶

The previous file established two things. First, the static state of training a 70B model is 1,120 GB — fourteen H100s before activations — so a single GPU is out of the question. Second, the way out is to spread that state across GPUs. But "spread the state" is two different problems wearing one phrase. You can keep a full copy of the model on every GPU and split the work, or you can split the model itself into pieces. The first is far simpler and is where almost every distributed run begins. This file builds it, then shows the wall it hits — not a memory wall this time, but a bandwidth one — which is what forces the model-splitting of the next files.

The catch: data parallelism, by itself, does nothing for the memory wall. Every GPU still holds the full 1,120 GB. So why start here at all? Because data parallelism is how GPUs learn to cooperate on one optimizer step, and that cooperation — the all-reduce — is the primitive every other parallelism strategy reuses. Get this right and the rest is recombination.

What this file solves¶

When you replicate a model across 64 GPUs and feed each a different micro-batch, each GPU computes a different gradient — and the optimizer can only take one step, so those 64 gradients must be combined into one before the update. This file shows how the all-reduce collective sums every GPU's 140 GB of gradients and hands everyone the average, why a naive central-server sum is O(N) in the worst link, why ring all-reduce makes the cost independent of GPU count, and how to recognize when gradient synchronization — not compute — is throttling your throughput.

Why replicate-and-split is the natural first move¶

Picture the 70B model fitting (suppose, for a moment, on a fat enough card). You want to process a batch of 512 sequences. One GPU does it in some time T. Hand each of 64 GPUs a copy of the model and 8 sequences each, and in roughly T/64 wall-clock time per GPU you have processed all 512 — if only you could combine what they learned.

That "if only" is the whole problem. Each GPU ran the forward and backward pass on its own 8 sequences, so each produced a gradient that reflects only its slice of the data. GPU 0's gradient says "move the weights this way for sequences 1–8." GPU 1's says something different for 9–16. The optimizer must apply one update that reflects all 512 sequences — which is the average of the 64 per-GPU gradients (by linearity of the gradient of a sum-of-losses). So before any GPU can call optimizer.step(), all 64 must agree on the averaged gradient.

   GPU 0  ── batch slice 0 ──→ grad₀  ┐
   GPU 1  ── batch slice 1 ──→ grad₁  │
   GPU 2  ── batch slice 2 ──→ grad₂  ├──→ average → ḡ → every GPU does step(ḡ)
   ...                                │
   GPU 63 ── batch slice 63 ─→ grad₆₃ ┘

Every GPU must end the step holding the same averaged gradient ḡ, then apply the same update, so all 64 copies stay bit-identical. The moment they drift apart, you no longer have one model — you have 64 diverging models, and the run is meaningless.

The load-bearing rule. Data-parallel replicas must hold identical weights after every step. That requires every replica to apply the same averaged gradient. The synchronization that enforces this — summing 64 gradient tensors and broadcasting the result — is the all-reduce, and it runs every single step.

The naive sync and where it breaks¶

The obvious way to average 64 gradients: pick one GPU as the boss. Everyone sends their gradient to GPU 0, GPU 0 sums and divides, then sends the result back to everyone. This is the parameter-server pattern, and it breaks in a way you can see in the numbers.

Each gradient is 140 GB (the full bf16 gradient of a 70B model). In the receive phase, GPU 0 must pull 140 GB from each of the other 63 GPUs: 63 × 140 GB = 8.8 TB arriving at one card. In the send-back phase, another 8.8 TB leaving it. Every other GPU sits idle waiting for the boss to finish its 17.6 TB of traffic on its single set of links. The cost scales with the number of GPUs, and one link — the boss's — is the chokepoint.

   PARAMETER SERVER (naive)              link load on the boss
   ─────────────────────────             ─────────────────────
        GPU1 ─┐                          receive: 63 × 140 GB = 8.8 TB
        GPU2 ─┤                          send:    63 × 140 GB = 8.8 TB
        GPU3 ─┼──→ GPU0 (boss) ──→ back   total on ONE GPU's links: 17.6 TB
        ...  ─┤                          → every other GPU idles, waiting
        GPU63─┘                          → cost grows with N. Does not scale.

So the real problem is not "averaging is expensive"; averaging is cheap arithmetic. The real problem is that a central collector forces all traffic through one GPU's links, so the cost grows with the number of GPUs and one wire becomes the bottleneck. How can we average the same data while spreading the traffic evenly across every link, so adding GPUs does not add load to any single one?

Ring all-reduce — the same answer, no central boss¶

The answer the field settled on is the ring all-reduce, the algorithm inside NCCL that PyTorch DDP calls under the hood. Arrange the GPUs in a logical ring. Chop each GPU's gradient into N chunks (one per GPU). Then run two passes around the ring.

Reduce-scatter phase. In each of N−1 steps, every GPU sends one chunk to its right neighbor and receives one from its left, adding the received chunk into its local copy. After N−1 steps, each GPU holds the fully summed version of exactly one chunk.

All-gather phase. Another N−1 steps, now passing the completed chunks around so everyone ends up with all N summed chunks — the full averaged gradient.

   RING ALL-REDUCE (N=4, gradient split into 4 chunks a,b,c,d)

   step:   GPU0 ──→ GPU1 ──→ GPU2 ──→ GPU3 ──┐
            ▲                                  │
            └──────────────────────────────────┘   (logical ring)

   reduce-scatter (N-1 steps): each GPU accumulates one chunk's full sum
   all-gather    (N-1 steps): each GPU collects every completed chunk
   ─────────────────────────────────────────────────────────────────
   data each GPU sends total ≈ 2 × (gradient size)   — INDEPENDENT of N

The arithmetic that changes everything: each GPU sends, in total, about 2 × (gradient_size × (N−1)/N) bytes — which for large N is just twice the gradient size, independent of how many GPUs there are. Every link carries the same load. Add GPUs and no single wire gets busier. The parameter server's O(N)-on-one-link cost becomes O(1)-per-link. This is why ring all-reduce, not parameter servers, is how large models actually sync.

Teacher voice. The trick is not better hardware; it is refusing to let any one GPU be the collector. Spread the same total work evenly and the per-link cost stops caring about cluster size. This "no central node, everyone carries an equal share" shape recurs across distributed systems — it is the same instinct behind consistent hashing and gossip protocols, applied to a numerical sum.

DDP — overlapping the sync with the backward pass¶

Knowing the all-reduce is cheap per link is not enough, because while the all-reduce runs the GPUs could be idle. PyTorch's DistributedDataParallel (DDP) hides the cost with a second idea: start syncing gradients before the backward pass finishes.

The backward pass produces gradients layer by layer, last layer first. The instant the last layer's gradient is ready, DDP fires its all-reduce — while the backward pass is still computing the earlier layers' gradients. Gradients are bucketed (grouped into ~25 MB blocks) so the all-reduce launches on whole buckets rather than per-tensor. By the time the backward pass reaches the first layer, most of the gradient traffic is already done.

   WITHOUT overlap                 WITH DDP overlap (backward + comm interleaved)
   ───────────────                 ──────────────────────────────────────────────
   [ backward all layers ]         [ backward L80 ][ backward L79 ]...[ backward L1 ]
   then [ all-reduce all ]              └─ all-reduce L80 ─┐
   ───────────────────────              └─ all-reduce L79 ─┤ (overlapped)
   total = compute + comm                                  └─ ... ─┘
                                   total ≈ max(compute, comm)  ← the win

The payoff: total step time drops from compute + communication toward max(compute, communication). When compute dominates, the all-reduce hides almost entirely and your 64 GPUs run near-linear scaling. When communication dominates, it doesn't hide — and that is the regime where data parallelism starts to fail, which is the whole point of the next files.

For the 70B run: even with perfect overlap, every step moves ~280 GB per GPU of gradient traffic (twice the 140 GB gradient). On intra-node NVLink at 900 GB/s that is feasible; the moment the ring crosses to inter-node InfiniBand, the slower wire sets the pace — a foreshadow of why topology placement (file 05) matters.

The 70B run through data parallelism¶

Thread the running example through. Suppose, for this file only, that the 70B model's static state somehow fit on each GPU (it doesn't — that is what file 03 fixes). With pure data parallelism across 64 H100s:

global batch        = 64 × micro-batch   (each GPU gets one micro-batch)
per-step gradient sync = all-reduce of 140 GB gradient across 64 GPUs
bytes per GPU moved  ≈ 2 × 140 GB = 280 GB per step
ideal scaling        = 64×   (if comm fully hidden behind compute)
real scaling         = 64× × efficiency, where efficiency = compute / (compute + exposed comm)

Two facts to hold:

Data parallelism scales throughput (more sequences per second) linearly until communication stops hiding behind compute. It does nothing for the memory wall — every GPU still holds the full 1,120 GB. That is the gap file 03 closes.
The communication cost per step is fixed by the gradient size (140 GB), not by the batch size. Larger micro-batches mean more compute per step to hide the same comm behind — which is why large-batch training scales better: it raises the compute-to-communication ratio.

Why this instead of a parameter server, under this workload¶

The parameter-server pattern is not dead — it is alive and well in workloads with sparse gradients (recommendation models with giant embedding tables, where each step touches a tiny fraction of parameters). There, sending only the touched rows to a central shard is genuinely cheaper than an all-reduce over the whole table.

But for a dense transformer, every parameter gets a gradient every step. There is no sparsity to exploit. The all-reduce moves the same data either way, and the ring spreads it across all links while the parameter server jams it through one. For our 70B dense model, ring all-reduce wins decisively. Match the sync algorithm to gradient density: dense → ring all-reduce; sparse → parameter server or sharded embeddings.

Mini-FAQ. "Why average gradients and not the weights?" Averaging weights after independent updates (as in some federated-learning schemes) lets the replicas drift apart between syncs, which changes the optimization trajectory and hurts convergence at these scales. Averaging the gradient every step keeps all replicas bit-identical and makes 64-GPU training mathematically equivalent to one giant batch on one GPU. Equivalence is what lets you reason about it.

Operational signals — is the sync eating your throughput?¶

Healthy behavior. Scaling efficiency above ~90%: 64 GPUs deliver close to 64× the single-GPU throughput. Per-step time roughly flat as you add GPUs (because ring all-reduce per-link cost is N-independent). GPU compute utilization (SM occupancy) high and steady.

First metric to degrade. Scaling efficiency, as you add nodes. Inside one node (NVLink) efficiency stays high; the first node-to-node hop introduces InfiniBand into the ring and efficiency steps down. If 8 GPUs give 7.6× but 64 give only 40×, communication has stopped hiding.

The misleading metric. Per-GPU utilization can look high (95%) while real throughput is poor — because the GPU is "busy" running the all-reduce kernels, not doing useful matmuls. Watch tokens/second/GPU or MFU (model FLOPs utilization), not raw SM utilization. A common trap: celebrating 95% GPU util while MFU is 35% because a third of every step is communication.

The graph an expert opens first. A profiler timeline (PyTorch profiler, Nsight Systems) showing compute kernels and NCCL kernels on parallel streams. The diagnostic is the gap: if there is white space where the GPU waits on an all-reduce that didn't overlap, communication is exposed. The fix is bigger buckets, more compute per step, or — if the wire itself is the limit — a different parallelism axis.

Boundary of applicability — where data parallelism stops being enough¶

Where it shines. Models that fit on one GPU (or fit after the sharding of file 03), where you want more throughput. Compute-bound steps with large micro-batches. Intra-node groups on NVLink, where 280 GB/step of traffic is comfortable. This is the workhorse: most training runs are data-parallel at the outermost layer.

Where it becomes pathological. When the model does not fit on one GPU at all — data parallelism replicates the problem 64 times instead of solving it. When communication dominates compute (tiny micro-batches, slow inter-node links), efficiency collapses and you are paying for 64 GPUs to do the work of 30. When you scale past the point where the global batch grows so large that convergence suffers (the large-batch generalization wall).

The scale limit on intuition. "More GPUs = proportionally faster" holds only while comm hides behind compute. Past the crossover — set by interconnect bandwidth versus model FLOPs — adding GPUs adds communication faster than it adds useful work, and throughput plateaus or drops. The crossover moves with your interconnect: pure-NVLink clusters scale further before hitting it than InfiniBand-connected ones.

The wrong model to carry, and the right one¶

The seductive-but-wrong intuition: "data parallelism lets me train bigger models by adding GPUs." It does not. Data parallelism replicates the entire model on every GPU; adding GPUs adds throughput, never capacity. If 70B doesn't fit on one GPU, it doesn't fit on 64 data-parallel GPUs either — each still needs the full 1,120 GB.

The right model: data parallelism scales throughput, not model size. To fit a bigger model you must split the state (file 03) or the model itself (file 04). Confusing throughput-scaling with capacity-scaling is why teams add GPUs to an OOMing run and watch it OOM identically on every one of them.

Other ways the all-reduce shows up as a problem¶

Scaling efficiency cliff at the node boundary — fine within 8 GPUs, drops at 16, because the ring now crosses InfiniBand. Topology-aware grouping (file 05) is the fix.
One straggler GPU stalls the whole all-reduce — the ring is only as fast as its slowest hop; a thermally throttled or failing GPU drags every step (revisited in file 07).
Gradient bucketing too small — many tiny all-reduces, launch overhead dominates, comm doesn't overlap; tune bucket size up.
Gradient bucketing too large — the last bucket can't start until late in the backward pass, reducing overlap; tune down.
Mismatched replicas after a non-deterministic op — replicas drift, the model silently diverges; usually a dropout/seed or a non-deterministic kernel bug.
NCCL hang on a dropped link — the collective blocks forever waiting on a peer that died; without a timeout the whole job freezes (file 07's territory).
Inflated micro-batch to hide comm, then activation OOM — raising the batch to improve the compute/comm ratio overflows activation memory; the two pressures collide.

Where this fits the larger systems map¶

Same shape, different layer — quorum writes. Requiring all replicas to converge on one averaged gradient before stepping is the same convergence requirement as a quorum write in a distributed database: everyone must agree before the system advances. Both pay a coordination cost per operation.
Failure geometry — head-of-line blocking. A straggler GPU stalling the ring is head-of-line blocking: the slowest participant gates the whole collective, exactly as a slow request gates a pipelined connection.
No-central-node pattern. Ring all-reduce spreads load evenly with no boss — the same structural choice as consistent hashing (03_vector_retrieval_infrastructure) and gossip dissemination: avoid a single hotspot by giving everyone an equal share.
Overlap compute and I/O. DDP hiding the all-reduce behind the backward pass is the same amortization as prefetching, double-buffering, or async I/O: keep the expensive resource busy by overlapping the wait with useful work.

Where this appears in production¶

PyTorch DistributedDataParallel (DDP) — the canonical implementation; bucketed gradients, all-reduce overlapped with backward, used by most multi-GPU training jobs.
NVIDIA NCCL — the library that implements ring (and tree) all-reduce on NVLink/InfiniBand; the actual code path under DDP, FSDP, DeepSpeed, and Megatron.
Horovod (Uber) — popularized ring all-reduce for deep learning, replacing parameter servers as the default for dense models.
Meta Llama 3 training — uses data parallelism as the outermost axis over thousands of GPUs, with all-reduce traffic carefully placed on the fastest links.
DeepSpeed (Microsoft) — builds its ZeRO sharding on top of the data-parallel group, reusing the all-reduce/all-gather collectives.
TensorFlow MirroredStrategy / MultiWorkerMirroredStrategy — Google's data-parallel APIs, all-reduce based for dense models.
JAX pmap / shard_map — data parallelism on TPU pods, with cross-replica sum as the collective.
Recommendation training (Meta DLRM, Google) — the counter-example: sparse embedding gradients still use parameter-server-style sharding because all-reduce over a huge sparse table is wasteful.
NVIDIA Nsight Systems — the profiler teams open to see whether NCCL kernels overlap compute or leave exposed gaps.
MLPerf training benchmarks — scaling-efficiency numbers there are essentially measuring how well all-reduce hides at each GPU count.
Mosaic/Databricks Composer — wraps DDP/FSDP with bucket-size and gradient-sync tuning exposed as config.
Lightning Fabric / Hugging Face Accelerate — abstract DDP setup so users get overlapped all-reduce without writing NCCL calls.
Stable Diffusion training pipelines — data-parallel across nodes, with all-reduce of the UNet gradients each step.
InfiniBand SHARP (NVIDIA) — offloads the reduction into the switch to cut all-reduce traffic further, an in-network optimization of this exact collective.

Pause and recall¶

Why must data-parallel replicas hold identical weights after every step?
Why are the per-GPU gradients different, and why must they be averaged before the optimizer steps?
Describe the parameter-server sync and the specific way it fails to scale.
What are the two phases of ring all-reduce, and why is its per-link cost independent of GPU count?
How does DDP hide the all-reduce cost, and what does step time approach when it works?
Does data parallelism help the memory wall? Explain in one sentence.
Name the metric that misleads people into thinking a data-parallel run is healthy when it isn't.
For dense transformers, why is ring all-reduce preferred over a parameter server?

Interview Q&A¶

Q1. Why does ring all-reduce scale better than a parameter server for dense gradients? A. A parameter server forces all N−1 gradient transfers through one GPU's links — cost grows with N and one wire is the bottleneck. Ring all-reduce spreads the traffic evenly: each GPU sends about twice the gradient size total, independent of N, so per-link cost stays flat as the cluster grows. For dense gradients (every param updated every step) there is no sparsity for a parameter server to exploit, so the ring wins. Common wrong answer to avoid: "Parameter servers are always slower" — for sparse embedding gradients they can be faster; the answer depends on gradient density.

Q2. Your run scales 7.6× on 8 GPUs but only 40× on 64. What's happening and where do you look? A. Communication has stopped hiding behind compute, almost certainly because the all-reduce ring now crosses node boundaries onto slower InfiniBand. Open a profiler timeline and look for exposed NCCL kernels — gaps where compute waits on a sync. Fixes: topology-aware process grouping so intra-node NVLink does most of the reduction, larger micro-batches to raise compute/comm ratio, or moving to a parallelism axis that communicates less across nodes. Common wrong answer to avoid: "The GPUs are underutilized, increase the batch" — utilization may already look high because the GPUs are busy with NCCL kernels, not useful matmuls.

Q3. A teammate adds 32 more GPUs to an OOMing 70B data-parallel run and is surprised it still OOMs. Explain. A. Data parallelism replicates the full model on every GPU. Each replica still needs the full 1,120 GB of static state, so adding data-parallel GPUs adds throughput, not capacity — every new GPU OOMs identically. To fit the model you must shard the state (ZeRO/FSDP) or split the model (tensor/pipeline parallelism). The teammate confused throughput-scaling with capacity-scaling. Common wrong answer to avoid: "They needed even more GPUs" — no number of data-parallel replicas helps; each holds the whole model.

Q4. Why does DDP start the all-reduce during the backward pass instead of after it? A. To overlap communication with computation. The backward pass produces gradients last-layer-first; DDP fires the all-reduce on each gradient bucket as soon as it's ready, while earlier layers are still computing. This drives step time from compute + comm toward max(compute, comm), hiding the sync when compute dominates. Common wrong answer to avoid: "To save memory" — overlap is about latency hiding, not memory; the gradients exist either way.

Q5. When is a parameter server still the right choice? A. Sparse-gradient workloads — large recommendation models with embedding tables where each step touches a tiny fraction of parameters. Sending only the touched rows to sharded servers is cheaper than an all-reduce over the entire table. The decision is gradient density: dense → ring all-reduce, sparse → parameter server / sharded embeddings. Common wrong answer to avoid: "Parameter servers are obsolete" — they remain standard for sparse recsys gradients.

Q6. (Cumulative.) A 70B run crashes at step zero on every one of 64 data-parallel GPUs. Is this a memory-wall problem (file 01) or a sync problem (file 02)? A. A memory-wall problem. Data parallelism doesn't touch the per-GPU static state — each GPU still allocates the full 1,120 GB and crashes at optimizer construction (t1 in file 01's timeline), before any all-reduce ever runs. A sync problem would manifest as poor scaling or a NCCL hang during steps, not an OOM at construction. The fix is sharding the state (file 03), not tuning the all-reduce. Common wrong answer to avoid: "It's a communication deadlock" — the crash is local memory allocation, before any collective is issued.

Design/debug exercise (10 min)¶

Step 1 — modeled example. Compute all-reduce traffic for a 7B model, bf16 gradients (14 GB), across 16 GPUs:

gradient size            = 7e9 × 2 = 14 GB
ring all-reduce per GPU  ≈ 2 × 14 GB × (16-1)/16 ≈ 26 GB sent + 26 GB received
parameter-server boss    = 15 × 14 = 210 GB received + 210 GB sent  ← 8× worse, on one card

Note the ring's per-GPU cost barely changes if you go to 64 GPUs ((N-1)/N → 1), while the parameter server's boss load grows linearly.

Step 2 — your turn. For the 70B running example (140 GB gradient) across 64 GPUs: compute the per-GPU ring all-reduce traffic per step, then state how much total data crosses the cluster per step. Then answer: if compute per step takes 600 ms and the exposed (un-hidden) all-reduce takes 250 ms, what is your scaling efficiency relative to ideal, and what is the first thing you'd try to raise it?

Step 3 — reproduce from memory. Without looking, draw the ring all-reduce diagram (ring of GPUs, reduce-scatter then all-gather), write the one-line cost result (≈ 2× gradient size per GPU, independent of N), and connect it back to file 01 in one sentence: data parallelism scales throughput but not capacity, so the 1,120 GB wall is still unsolved — which is why the next file shards the state.

Operational memory¶

This chapter explained how many GPUs cooperate on a single optimizer step: each computes a gradient on its own data slice, then an all-reduce averages all the gradients so every replica applies the identical update and stays bit-identical. The important idea is that the synchronization, not the averaging arithmetic, is the cost — and the ring all-reduce makes that cost independent of GPU count by refusing to route everything through a central collector.

You learned to compute all-reduce traffic (≈ 2× gradient size per GPU), to recognize that DDP hides it behind the backward pass, and to read scaling efficiency and profiler timelines to tell whether the sync is hiding or exposed. That solves the throughput-scaling problem — but only for models that already fit, because data parallelism replicates the full state on every GPU and does nothing for the memory wall.

Carry this diagnostic forward: when scaling efficiency drops as you add GPUs, suspect the all-reduce ring crossing onto slower inter-node links, and check the profiler for exposed NCCL gaps before blaming the model. When a run OOMs identically on every GPU, remember data parallelism multiplied the memory problem rather than solving it.

Remember:

Data-parallel replicas must apply the same averaged gradient every step to stay identical — that's the all-reduce's job.
Ring all-reduce costs ≈ 2× gradient size per GPU, independent of N; the parameter server costs O(N) on one card's links.
DDP overlaps the all-reduce with the backward pass, driving step time toward max(compute, comm).
Data parallelism scales throughput, not capacity — every GPU still holds the full 1,120 GB.
Watch MFU / tokens-per-second, not raw GPU utilization; a busy GPU may be busy with NCCL, not matmuls.
Scaling efficiency falls when the ring crosses node boundaries onto InfiniBand — a foreshadow of topology placement.

Bridge. We made 64 GPUs cooperate on one step, and the all-reduce keeps them in sync cheaply. But every one of those GPUs still holds the entire 1,120 GB of static state — data parallelism scaled our throughput and did nothing for the memory wall. The waste is staggering: 64 GPUs each storing identical copies of the same 840 GB of optimizer state. The next file asks the obvious question — why store 64 copies of the optimizer state when we could store one copy split across 64 GPUs? — and turns the all-reduce we just built into the sharding of ZeRO and FSDP. → 03-zero-sharding-and-fsdp.md