Skip to content

01. The single-GPU memory wall — why 70B dies at step zero

~18 min read. You requested eight DGX nodes, sixty-four H100s, five terabytes of GPU memory. You wrote model.cuda(). The process crashed before the first batch. This file shows you exactly which bytes killed it, and teaches you the arithmetic that governs every decision in the rest of the module.

Built on the first-principles overview. The dominant pressure of this module is the memory wall: a single 80 GB H100 cannot hold params + gradients + optimizer state + activations for a large model. This file derives that wall from first principles using the 16-bytes-per-param rule, so every later mechanism reads as a way to push one term off the card.


What we still cannot do after writing a correct training loop

You can write a training loop. Forward pass, loss, loss.backward(), optimizer.step(), optimizer.zero_grad(). On a 1B model it runs on one GPU and trains fine. Nothing in that loop is wrong. The code that works at 1B is byte-for-byte the code that explodes at 70B. So the gap this file closes is not how to train — you know that — it is how to predict, before you launch, whether the model will physically fit, and which of the four memory consumers will be the one that overflows. Without that prediction you are flying blind: you size a cluster, queue a job, wait an hour for the scheduler, and learn at step zero that you were off by a factor of fourteen.

What this file solves

A 70B model OOMs on an 80 GB H100 not because of a leak, a fragmentation bug, or a batch size you can lower — it OOMs because the static state of mixed-precision Adam training is 16 bytes per parameter, which is 1,120 GB, fourteen times what one card holds. This file teaches you to decompose GPU memory into its four consumers (parameters, gradients, optimizer state, activations), compute each one in your head for any model, and read the resulting number as a hard physical ceiling rather than a tuning knob.

The four things that live in GPU memory during training

Open nvidia-smi during a healthy training step and watch the memory number. It is the sum of four distinct populations, and they have wildly different sizes and lifetimes. Confusing them is the single most common reason engineers mis-size a run.

Parameters. The weights themselves. For a model with N parameters stored in bf16, this is 2N bytes. It is constant for the whole run — the count never changes, only the values.

Gradients. One number per parameter, produced by the backward pass, holding ∂loss/∂weight. Same shape as the parameters, so in bf16 another 2N bytes. Allocated once the backward pass starts touching a layer, and (naively) kept until the optimizer consumes it.

Optimizer state. This is the one that surprises people. Plain SGD keeps nothing extra. But nobody trains a large transformer with plain SGD — they use Adam (or AdamW), which keeps a running first moment (the smoothed gradient) and a running second moment (the smoothed squared gradient) for every parameter. And here is the subtlety that doubles the bill: for numerical stability, mixed-precision training keeps these moments, plus a master copy of the weights, in fp32 — four bytes each. Three fp32 numbers per parameter: 12N bytes.

Activations. The intermediate outputs of every layer during the forward pass, saved because the backward pass needs them to compute gradients. Unlike the first three, activation memory does not scale with parameter count alone — it scales with batch size, sequence length, and the number of layers. This is the one term you can actually move with a knob, and we will return to it again and again.

Teacher voice. The first three populations are fixed by the model and the optimizer. You cannot lower them by changing batch size. Engineers who treat an OOM as "reduce the batch" are reaching for the only knob that touches the fourth population — and on a large model the OOM is almost always in the first three, where that knob does nothing.

The naive repair and the visible break

The model OOMs. The instinct, drilled in from years of single-GPU work, is to lower the batch size. Batch of 8 crashes. Try 4. Crashes. Try 1. Still crashes. Try a batch of one single token? Still crashes — and now you should be suspicious, because if a batch of one token cannot fit, the problem was never the batch.

Here is what the trace tells you when you read it instead of reflexively reaching for the batch knob:

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB.
GPU 0 has a total capacity of 79.21 GiB
  of which 0 bytes is free.
Process has 79.21 GiB memory in use.
  Allocated: 78.9 GiB  |  Reserved: 79.2 GiB
Traceback (most recent call last):
  ...
  File "torch/optim/adamw.py", line 184, in step
    exp_avg = torch.zeros_like(p, dtype=torch.float32)   # <-- here

The crash is inside optimizer.step, allocating exp_avg — Adam's first-moment buffer. Not in the forward pass. Not in the data loader. The model loaded, the forward and backward even ran on a tiny batch, and then the optimizer tried to allocate its fp32 moments and the card was already full. So the real problem is not the batch size and not a leak; it is that the optimizer's own bookkeeping needs more memory than the weights and gradients combined, and that bookkeeping is allocated regardless of how small you make the batch.

So the natural question: if the batch knob is the wrong knob, what is the actual size of the thing we are trying to stuff into 80 GB, and where exactly does it cross the line?

The 16-bytes-per-param rule

Add up the per-parameter cost of mixed-precision Adam training and you get one number you should memorize, because it sizes every distributed-training decision you will ever make:

  bf16 parameter copy      2 bytes
  bf16 gradient            2 bytes
  fp32 master weight       4 bytes
  fp32 Adam first moment   4 bytes
  fp32 Adam second moment  4 bytes
  ─────────────────────────────────
  TOTAL                   16 bytes  per parameter

The load-bearing rule. Mixed-precision Adam costs 16 bytes per parameter of static state — memory that exists before a single activation, independent of batch size. Multiply your parameter count by 16 and you have the floor. If that floor exceeds your aggregate GPU memory, no batch size, no precision trick on activations, and no gradient accumulation will save you. You must split the model across GPUs.

This is the chapter's invariant. Everything downstream — every all-reduce, every shard, every pipeline stage — exists to get that 16N number to fit, and to do it without the GPUs spending all their time talking.

The 70B model on one H100 — the worked arithmetic

Thread this number through and watch the card overflow. Our running example, carried through the entire module: a 70-billion-parameter transformer, trained with AdamW in bf16 mixed precision, on a cluster of eight nodes × eight H100 SXM (80 GB HBM3) = 64 GPUs.

Static state on a single GPU, if we tried to put the whole model there:

N = 70 × 10^9 parameters

Parameters (bf16)         70e9 × 2  =  140 GB
Gradients (bf16)          70e9 × 2  =  140 GB
fp32 master weights       70e9 × 4  =  280 GB
fp32 Adam m               70e9 × 4  =  280 GB
fp32 Adam v               70e9 × 4  =  280 GB
──────────────────────────────────────────────
Static total              70e9 × 16 = 1,120 GB

One H100 holds 80 GB. The static state alone is fourteen H100s before activations. This is why the crash is at step zero and not at step one thousand — the wall is hit during optimizer construction, the moment those fp32 buffers are allocated. The batch never mattered.

Now the fourth consumer, activations, which do depend on the batch. A rough but useful estimate for a transformer's activation memory, per the standard accounting, is governed by sequence length s, batch size b, hidden dimension h, and layer count L. For our 70B model — roughly L = 80 layers, h = 8192, sequence length s = 8192, and even a single sequence b = 1 — the saved activations across all layers, in bf16 without any recompute, run to roughly 30–45 GB per sequence. Scale the batch to 8 and you are adding hundreds of gigabytes of activations on top of the 1,120 GB you already cannot fit.

                         One H100 = 80 GB
   ┌──────────────────────────────────────────────────┐
   │ ████ params 140  → already 1.75× the card          │
   └──────────────────────────────────────────────────┘
        + gradients 140
        + optimizer 840
        + activations 30–350 (batch-dependent)
        ────────────────────────────────────
        ≈ 1,150 – 1,470 GB   needed
        ÷ 80 GB              available
        = 15 – 19 GPUs       just to hold one copy

The asymmetry to burn into memory: the optimizer state (840 GB) is six times the parameter memory (140 GB). People picture "the model" as the weights and reason about memory from there. But the weights are the smallest of the three fixed terms. The thing that actually fills the card is the optimizer's fp32 bookkeeping — which is exactly why the first sharding mechanism in the next file goes after optimizer state first.

A picture of the memory budget

Keep this image as the canonical mental model for the whole module. The four consumers stacked against the 80 GB ceiling, with their byte-per-param weights labeled:

   PER-PARAMETER MEMORY (mixed-precision Adam)            ONE H100: 80 GB ceiling
   ════════════════════════════════════════              ═══════════════════════

   parameters  ▓▓                2 B/p   ──┐
   gradients   ▓▓                2 B/p     │  these three are FIXED.
   fp32 master ▓▓▓▓              4 B/p     │  batch size cannot move them.
   adam   m    ▓▓▓▓              4 B/p     │  total = 16 B/param of "static state"
   adam   v    ▓▓▓▓              4 B/p   ──┘
   ─────────────────────────────────────
   activations ▓▓▓ … ▓▓▓   variable   ←── the ONLY term batch size moves
                                            (and what recompute will shrink later)

   For 70B:   static = 16 × 70e9 = 1,120 GB
              one GPU =                80 GB
              ──────────────────────────────
              shortfall: must spread across ≥ 14 GPUs before a single activation

The line that matters: of the five rows, only activations responds to the batch knob. The other four are nailed down by parameter count and optimizer choice. An engineer who has internalized this picture never debugs a 70B OOM by lowering the batch — they immediately ask which of the fixed rows to push off the card.

Why this instead of "just use a smaller optimizer or fp32 everywhere"

Two plausible alternatives present themselves, and seeing why each is wrong sharpens why the 16-byte figure is the real constraint under this workload.

Alternative A — drop Adam, use plain SGD (0 bytes of optimizer state). That collapses the bill from 16 to 4 bytes per parameter: 280 GB instead of 1,120 GB. Tempting. But large transformers do not converge well under plain SGD at these scales; the adaptive per-parameter learning rates of Adam are doing real work on the loss landscape. You would trade a memory problem for a convergence problem, which is far more expensive — a failed run wastes the entire compute budget, not just memory. Memory you can buy with more GPUs; convergence you cannot.

Alternative B — train entirely in fp32 (no master copy needed, but everything is 4 bytes). Params 280, grads 280, optimizer m and v 280 each, no separate master copy = 1,120 GB, the same number, and now every matmul runs at half the throughput because you have abandoned the H100's bf16 tensor cores. Worse compute, identical memory.

The 16-byte mixed-precision-Adam configuration is not an accident; it is the point that keeps convergence quality (Adam, fp32 moments) and compute speed (bf16 matmuls on tensor cores) while accepting the memory cost. The memory cost is the bill you then pay by distributing — which is what the rest of the module does.

Mini-FAQ. "Can't I just use a GPU with more memory?" You can — an H200 has 141 GB, a B200 more. But 141 GB still cannot hold 1,120 GB of static state, let alone activations. Bigger cards raise the ceiling; they do not remove the wall for frontier-scale models. The wall is relative to model size, and models grow faster than single-card memory.

The failure walked through: an OOM at step zero on the 70B run

Trace the exact sequence of allocations when the engineer launches the 70B model naively on one GPU, to see why step zero and not step one:

t0   load checkpoint → params materialize in bf16     +140 GB   (already impossible, but suppose 2 TB card)
t1   build AdamW optimizer
       allocate fp32 master copy                       +280 GB
       allocate exp_avg (m), zeros                      +280 GB   ← crash usually lands here
       allocate exp_avg_sq (v), zeros                   +280 GB
t2   forward pass (never reached)
t3   backward pass: allocate gradients                  +140 GB
t4   optimizer.step (never reached)

The optimizer is eagerly constructing its fp32 buffers when you call Adam(model.parameters()), before any data moves. That is why lowering the batch — which only affects t2/t3 activation allocation — never helps. The card is gone at t1. The fix is not a smaller batch; the fix is that the buffers allocated at t1 must live on different GPUs, which is precisely the optimizer-state sharding the next file introduces.

Operational signals — reading the memory wall before it reads you

Healthy behavior. Steady nvidia-smi memory that plateaus a few gigabytes below the ceiling and stays flat step-to-step. A small rise over the first dozen steps as the allocator caches blocks is normal; a continuous rise is not.

First metric to degrade. Reserved memory creeping toward total capacity as you scale batch or sequence length — the activation term growing into whatever headroom the fixed terms left. When reserved hits ~95% you are one long sequence away from a crash.

The misleading metric. "Allocated" memory looks like it has headroom while "Reserved" is pinned at the ceiling — people see free allocated bytes and raise the batch, then crash, because the allocator has already reserved blocks it will reuse. Watch reserved, not allocated. The other classic misread: blaming a leak for a flat-but-high curve that is simply the fixed 16N state doing exactly what arithmetic predicted.

The graph an expert opens first. Memory-over-time per GPU, with allocation call sites tagged (PyTorch's torch.cuda.memory._record_memory_history() snapshot, viewed in the memory visualizer). It shows the fp32 optimizer buffers appearing as three big steps at construction — instantly distinguishing "the static state doesn't fit" from "activations grew unbounded."

Boundary of applicability — when the 16-byte rule bends

Where it holds tightly. Dense transformers trained with AdamW in bf16 mixed precision — the overwhelmingly common case, including our 70B example. Memorize 16 bytes/param and you will predict the static floor within a few percent.

Where it bends. 8-bit optimizers (bitsandbytes Adam) quantize the moments to roughly 2 bytes each, dropping the bill toward ~8 bytes/param — at some risk to numerical stability on the hardest tasks. Mixture-of-Experts models have far more parameters than are active per token, so the optimizer-state bill tracks total params while compute tracks active params; the memory wall is even more dominant relative to FLOPs. Full-fp32 training (rare now) lands at 16 bytes too but for different reasons. And fine-tuning with LoRA freezes the base weights, so the 12N optimizer term applies only to the tiny adapter — which is the entire memory argument for parameter-efficient fine-tuning.

The scale limit on intuition. The rule says nothing about activations, which on long sequences can rival or exceed the static state. At s = 128k context the activation term, not 16N, can be the thing that overflows — and that flips which mechanism (recompute, file 06) you reach for first.

The wrong model to carry, and the right one

The seductive-but-wrong intuition: "GPU memory is dominated by the model weights, so a model fits if the weights fit." A 70B model in bf16 is 140 GB of weights — fits across two H100s, surely? No. Weights are 140 of the 1,120 GB. The optimizer state alone is six times the weights. Inference, where only weights and a little activation memory exist, does fit a 70B model on two cards — which is exactly why people are surprised that training needs fourteen. Training and inference live in different memory universes.

The right model: training memory is dominated by the optimizer, not the weights. When you size a training run, multiply parameters by 16, not by 2. The factor-of-8 difference between "weights fit" and "training fits" is the single most common sizing error in the field.

Other ways the memory wall shows up

  • OOM during optimizer construction, batch=1 — the static 16N state exceeds the card; no batch knob helps. Shard the optimizer (file 03).
  • OOM only on long sequences — activations, not static state, overflowed; reach for recompute (file 06), not sharding.
  • OOM only at the first optimizer.step after several clean forward/backwards — gradients (+2N) plus the just-now-touched optimizer buffers tipped it over; the backward pass was fine, the update was not.
  • Slow steps but no OOM — you fit, but the allocator is thrashing or you are paging through unified memory; you are over the practical limit even if under the hard one.
  • OOM that appears only at higher GPU count — communication buffers (NCCL) and per-GPU framework overhead grew; the fixed state per GPU shrank but the runtime's own footprint did not.
  • Fragmentation OOM — total free exceeds the request but no contiguous block does; allocated < reserved < capacity but the allocation still fails.
  • Gradual climb to OOM over hours — a genuine leak (retained graph, accumulated Python references), distinct from the flat-high curve of the fixed state.

Where this fits the larger systems map

  • Same constraint, different layer — the KV cache at inference. In serving, the analogous wall is the KV cache, which grows with batch × sequence and overflows the card exactly as activations do here. Same memory-pressure geometry, one layer over in 02_inference_serving_systems.
  • Compute-for-memory trade — checkpointing in databases. Recomputing activations to save memory (file 06) is the same shape as recomputing a value rather than caching it; the system spends CPU/FLOPs to relieve a memory ceiling, a tradeoff that recurs everywhere from query planners to JIT compilers.
  • Fixed cost vs variable cost. The 16N static state is a fixed cost paid once; activations are a variable cost paid per batch. This fixed-vs-variable split is the same reasoning you apply to connection pools, thread stacks, or any system where a baseline footprint dwarfs per-request cost.
  • Amortization. Spreading the fixed 16N across many GPUs (files 02–05) is amortization of an indivisible cost — the same move as sharding a hot table across nodes so no single machine holds it all.

Where this appears in production

  • Meta Llama 3 (405B) — the published training setup sized the run from exactly this arithmetic; the optimizer-state term is why 405B needed thousands of GPUs, not the FLOPs alone.
  • DeepSpeed (Microsoft) — its entire ZeRO line exists to attack the 12N optimizer term that this file shows dominates the budget.
  • PyTorch FSDP — the fully_shard API treats the 16-byte static state as the thing to shard across the data-parallel group.
  • NVIDIA Megatron-LM — its memory accounting docs reproduce this per-parameter breakdown to let teams predict fit before launch.
  • Hugging Face Accelerate — its model-memory estimator multiplies parameters by a per-optimizer byte factor (4 for SGD, 8 for Adam half-precision moments, more for fp32) — productizing exactly this rule.
  • bitsandbytes 8-bit Adam — sells itself on shrinking the 8N moment term, the second-largest line in the budget.
  • EleutherAI GPT-NeoX — its config docs warn users to compute 16 × params before requesting nodes, to avoid step-zero OOMs on shared clusters.
  • Mosaic/Databricks MPT training — capacity planning for these runs starts from the static-state floor, then adds activation estimates by sequence length.
  • Character.AI / inference vendors — the inverse case: they fit large models on few cards precisely because inference drops the 12N + 2N training terms, leaving only weights.
  • LoRA / QLoRA fine-tuning — the technique exists because freezing base weights removes the 12N optimizer term for all but the tiny adapter, collapsing the wall.
  • Stability AI — diffusion-model training hits the activation term harder than the static term, the inverse balance from LLMs, which changes which mechanism they reach for first.
  • NVIDIA Nsight / PyTorch memory visualizer — the tooling teams open to see the three fp32 optimizer steps appear at construction time.
  • Google TPU pods — the same arithmetic in a different memory architecture; HBM per chip is the ceiling, optimizer state is still the dominant term.
  • Cohere / AI21 model training — capacity requests to cloud providers are derived from 16N plus an activation budget, not from weight size.
  • Cluster schedulers (Slurm, Kubernetes + Kueue) — operators set per-job memory guards from this floor so a mis-sized job fails fast instead of after an hour in queue.

Pause and recall

Close the file. Answer from memory, then check.

  1. Name the four populations that occupy GPU memory during training. Which one depends on batch size?
  2. State the 16-bytes-per-parameter breakdown. Where does each byte come from?
  3. For a 70B model, how much static state? How many 80 GB H100s does that alone require?
  4. The optimizer state is how many times larger than the parameter memory? Why does that decide what we shard first?
  5. Why does lowering the batch size to 1 fail to fix a 70B OOM?
  6. Where in the allocation timeline does the crash usually land, and why is it "step zero"?
  7. What is the difference between "allocated" and "reserved" memory, and which misleads people?
  8. Why does the same 70B model fit on two GPUs for inference but need fourteen for training?

Interview Q&A

Q1. A 70B model OOMs on an 80 GB GPU at batch size 1. Your junior suggests gradient accumulation. Right call? A. No. Gradient accumulation reduces the effective batch by splitting it into micro-batches — it touches activations only. The crash is in the static 16N state (1,120 GB), which accumulation does not touch at all. The fix is to distribute the model state across GPUs (optimizer-state sharding first). Accumulation helps an activation-bound OOM, not a static-state-bound one. Common wrong answer to avoid: "Yes, accumulation reduces memory" — it reduces activation memory, which is not where a 70B model overflows.

Q2. Why is the optimizer state, not the weights, the dominant memory term in transformer training? A. Mixed-precision Adam keeps three fp32 numbers per parameter (master weight, first moment, second moment) at 4 bytes each = 12 bytes, versus 2 bytes for the bf16 weight copy. The optimizer term is 6× the weight term. People underestimate it because they picture "the model" as its weights and reason from inference, where the optimizer state does not exist. Common wrong answer to avoid: "The weights dominate because that's the model" — weights are the smallest of the three fixed terms.

Q3. How do you predict, before launching, whether a model will fit on a given cluster? A. Static floor = 16 × params (bf16 + mixed-precision Adam). Add an activation estimate driven by batch, sequence length, hidden size, and layer count. Compare the sum against aggregate GPU memory minus framework/NCCL overhead (budget ~10–15% for that). If the static floor alone exceeds aggregate memory, no batch tuning helps — you need model sharding or model/pipeline parallelism. Common wrong answer to avoid: "Multiply params by 2 for bf16" — that sizes inference, not training; you'll be off by 8×.

Q4. Same model, same hardware: it fits for inference but OOMs for training. Explain to a PM in one sentence. A. Inference needs only the weights (2 bytes/param) plus a little activation memory; training adds gradients and the optimizer's fp32 bookkeeping, taking the per-parameter cost from 2 bytes to 16 — eight times more. Common wrong answer to avoid: "Training uses bigger batches" — even at batch 1 the static state is 8× inference.

Q5. You see a flat, high memory curve near the ceiling and suspect a leak. How do you tell a leak from the fixed state? A. A leak shows a monotonic climb across steps; the fixed 16N state shows a flat plateau established at optimizer construction and never moving. Capture a memory-history snapshot: if the big allocations are the three fp32 optimizer buffers at construction, it's the static state, not a leak. A leak's growth correlates with step count or with retained autograd graphs. Common wrong answer to avoid: "High memory means a leak" — a correctly sized 70B run sits high by design.

Q6. (Cumulative, looks ahead.) Given the 70B static floor of 1,120 GB, which memory term would you shard first across 64 GPUs, and why that one before the others? A. The optimizer state (840 GB, the 12N term), because it is the largest term and it is needed only during the update step, not during forward/backward — so sharding it costs the least communication relative to memory freed. That is exactly ZeRO stage 1. Sharding gradients (stage 2) and parameters (stage 3) frees more but adds proportionally more communication. Common wrong answer to avoid: "Shard the weights first" — weights are the smallest term and are needed in every forward/backward, so sharding them costs the most communication per byte saved.

Design/debug exercise (10 min)

Step 1 — modeled example. Compute the static floor for a 13B model in bf16 + AdamW:

params       13e9 × 2  =  26 GB
gradients    13e9 × 2  =  26 GB
fp32 master  13e9 × 4  =  52 GB
adam m       13e9 × 4  =  52 GB
adam v       13e9 × 4  =  52 GB
─────────────────────────────────
static       13e9 × 16 = 208 GB   → needs 3 × 80 GB H100 before activations
Note it does not fit on two 80 GB cards even though the bf16 weights (26 GB) fit on one.

Step 2 — your turn. Take the running 70B example and answer: you have one node of 8 H100s (640 GB aggregate). Does the static state (1,120 GB) fit on one node? By how much are you short, and how many nodes minimum just for static state (ignore activations and communication overhead)? Then state which single term you would shard first and why.

Step 3 — reproduce from memory. Without looking, redraw the per-parameter memory diagram (the five rows with their byte counts and the activation row), label which rows the batch size can move, and write the one-line connection to the next chapter: the largest fixed row (optimizer state, 840 GB) is the first thing ZeRO shards because it is biggest and only needed at update time.

Operational memory

This chapter explained why a 70B model crashes at step zero on an 80 GB GPU. The important idea is that training memory is dominated by the optimizer's fp32 bookkeeping — 16 bytes per parameter of static state — not by the weights, and that this state is fixed regardless of batch size.

You learned to decompose GPU memory into four consumers (parameters, gradients, optimizer state, activations), compute the static floor as 16 × params, and recognize that only the activation term responds to the batch knob. That solves the opening failure because it tells you instantly that the 70B OOM is a 1,120 GB static-state problem demanding distribution, not a batch problem demanding a smaller batch.

Carry this diagnostic forward: when a large model OOMs, first ask which population overflowed. If it crashed at optimizer construction or step, it is the fixed 16N state — distribute the model. If it crashed only on long sequences, it is activations — recompute them. Never debug a static-state OOM by lowering the batch.

Remember:

  • Mixed-precision Adam costs 16 bytes/param: 2 + 2 + 4 + 4 + 4. Memorize it.
  • For 70B that is 1,120 GB of static state — fourteen 80 GB H100s before a single activation.
  • The optimizer state (12N) is 6× the weights (2N); it is the biggest term and the first thing to shard.
  • Only activations respond to batch size; the other four rows are nailed down by params and optimizer choice.
  • A 70B model fits on 2 GPUs for inference, needs 14 for training — an 8× gap that is the field's most common sizing error.
  • Watch reserved, not allocated; a flat-high curve is the fixed state working as designed, not a leak.

Bridge. We can now compute the wall: 1,120 GB of static state against 80 GB cards. The obvious move is to add GPUs — but how do many GPUs cooperate on one training step? The cheapest answer is to give every GPU a full copy of the model and split the data between them, then reconcile their gradients afterward. That buys throughput immediately. But the reconciliation — summing 140 GB of gradients across 64 GPUs every step — becomes a bandwidth problem that can swamp the compute it was meant to accelerate. The next file builds data parallelism and meets the all-reduce. → 02-data-parallelism-and-allreduce.md