04. Tensor and pipeline parallelism — when one layer won't fit and the GPUs take turns¶
~20 min read. ZeRO got the 70B model to fit by sharding the optimizer's bookkeeping. But it still made every GPU compute a whole layer on gathered weights. Push the model wider — a 175B with a single MLP that overflows 80 GB on its own — and you hit a wall ZeRO cannot move: the layer itself is too big. This file makes two new cuts. One splits the math inside a layer across GPUs. The other splits the sequence of layers across GPUs — and pays for it with idle GPUs you will learn to call the bubble.
Built on ZeRO sharding. ZeRO solved the memory wall by sharding state and gathering it on demand — but it never splits a layer, and its bridge named the hard boundary: when one layer won't fit, you've left ZeRO's domain. This file crosses that boundary with two model-splitting cuts, meets the coordination cost at a new granularity (an all-reduce inside every layer), and introduces a fresh waste — the pipeline bubble — that micro-batching and the 1F1B schedule shrink but never erase.
What ZeRO solved and what one giant layer still breaks¶
By the end of the last file the 70B model trained: ZeRO-3 sharded parameters, gradients, and optimizer state down to ~17.5 GB per GPU, gathered each layer's weights just-in-time, and discarded them after. The trick worked because one layer's weights fit on one GPU — gather them, compute, reshard. That assumption is doing quiet load-bearing work. ZeRO never splits a layer; it only avoids storing the layers it isn't currently computing.
So picture the assumption breaking. Train a wider model — 175B, or a Mixture-of-Experts whose single feed-forward block is a 50,000-wide matrix — and one transformer layer's weights, plus the activations to run it, exceed 80 GB on their own. Now there is no shard-and-gather that helps: even a single GPU holding only that layer's params with zero optimizer state still can't materialize the layer to compute it. ZeRO has run out of road. The cut has to go through the layer, not around it.
This file teaches the two cuts that go through the model itself. Tensor parallelism splits the matrix multiply inside a layer across GPUs. Pipeline parallelism assigns different layers to different GPUs and streams data through them like a factory line. The first creates a fine-grained, chatty sync; the second creates idle time. Both are necessary, and the next file composes them.
What this file solves¶
When a single transformer layer's weights and activations exceed one GPU, no amount of ZeRO sharding helps — the layer must be cut. This file shows how Megatron-style tensor parallelism splits each layer's two big matrix multiplies (the MLP and the attention projections) column-wise then row-wise so only one all-reduce per block per pass is needed, and why that all-reduce is so frequent it must stay on the fastest wire. It then shows how pipeline parallelism stacks layers across GPUs, why a naive pipeline leaves most GPUs idle (the bubble), and how splitting a batch into micro-batches with the 1F1B schedule shrinks the bubble from catastrophic to tolerable.
Cut one: split the math inside a layer¶
Start with the cheaper-to-explain cut. A transformer layer is dominated by two big matrix multiplies: the MLP (two linear layers with a nonlinearity between them) and the attention projections (Q, K, V, and the output projection). These are just Y = X · W operations where W is a giant weight matrix. The question tensor parallelism answers: can we split W across GPUs so each holds a slice, and still get the right Y?
Yes — and the direction of the split is everything. Megatron-LM splits the first MLP matrix column-wise: GPU 0 holds the left half of the columns, GPU 1 the right half. Each computes a slice of the output Y from the full input X, with no communication needed yet. The nonlinearity (GeLU) applies element-wise, so each GPU runs it on its own slice independently. Then the second MLP matrix is split row-wise, which exactly matches the column-split output: each GPU multiplies its slice and produces a partial sum of the final output. One all-reduce sums those partials into the correct result.
MLP under tensor parallelism (2 GPUs) X = input, full copy on both GPUs
───────────────────────────────────────
W1 split by COLUMN W2 split by ROW
GPU0: X·W1[:, left ]→Z0 GPU0: Z0·W2[top] ─┐
GPU1: X·W1[:, right]→Z1 GPU1: Z1·W2[bot] ─┤→ all-reduce → Y (full)
┘
GeLU(Z) applies per-slice, no comm. ONE all-reduce at the end of the block.
The same dance works for attention: split Q, K, V column-wise so each GPU owns a subset of the attention heads (heads are independent, so this is clean), compute attention locally per head, then split the output projection row-wise and all-reduce. The result: one all-reduce in the forward pass and one in the backward pass, per attention block and per MLP block — Megatron's "f and g" conjugate operators handle the symmetric pattern. That is remarkably few collectives for splitting a whole layer's compute.
Teacher voice. The genius is the column-then-row pairing. Split column-wise and each GPU produces a complete slice of the intermediate — no sync. Split the next matrix row-wise and each GPU produces a partial of the final answer — sync once. Pair them and a whole MLP costs exactly one all-reduce instead of one per matrix. Get the split direction wrong and you'd pay a sync between the two matrices too, doubling the traffic.
For our running example: a 70B layer under tensor parallelism across the 8 GPUs of one node means each GPU holds 1/8 of every weight matrix and computes 1/8 of the attention heads. The layer that ZeRO had to gather whole onto one GPU is now genuinely distributed — no single GPU ever holds the full layer.
Cut two: split the sequence of layers¶
Tensor parallelism splits within a layer. Pipeline parallelism splits across layers. Our 70B model has ~80 transformer layers. Assign layers 1–20 to GPU 0, 21–40 to GPU 1, 41–60 to GPU 2, 61–80 to GPU 3. A batch enters GPU 0, flows through its 20 layers, the activations cross to GPU 1, and so on. Each GPU holds only its 20 layers — a quarter of the parameters and a quarter of the optimizer state.
The naive version of this is a disaster, and seeing the disaster is the point. Feed one batch in. GPU 0 computes layers 1–20 while GPUs 1, 2, 3 sit idle. Then GPU 1 computes while 0, 2, 3 idle. The forward pass walks down the pipeline lighting up one GPU at a time, then the backward pass walks back up — again one GPU at a time. Three out of every four GPUs are idle at any instant.
NAIVE PIPELINE (4 stages, 1 batch) — time flows right →
──────────────────────────────────────────────────────
GPU0 [F0] [B0]
GPU1 [F1] [B1]
GPU2 [F2] [B2]
GPU3 [F3][B3]
└── only ONE GPU busy at any column ──┘
utilization ≈ 1/4. The other 3/4 is the BUBBLE.
This idle time is the pipeline bubble — GPUs waiting for work to reach them while the pipeline fills, and waiting again as it drains. With p stages and one batch, you waste roughly (p−1)/p of your compute. Four stages: 75% wasted. The whole point of pipeline parallelism — fitting a model too tall for one GPU — is undone if 75% of your hardware idles.
So the real problem is not "layers depend on each other" — they do, and that is fixed. The real problem is that one batch only ever occupies one stage at a time, so the other stages starve. How do we keep every stage fed?
The minimal fix: more batches in flight¶
The answer is the same instinct as an assembly line. Don't push one car through and watch every station but one idle. Push many cars, staggered, so every station always has a car. Split the batch into many micro-batches and pipe them in one after another. While GPU 0 starts micro-batch 2, GPU 1 is working micro-batch 1. Once the pipeline is full, every stage is busy on a different micro-batch.
PIPELINED with 4 micro-batches (GPipe-style fill/drain) — time →
────────────────────────────────────────────────────────────────
GPU0 [F1][F2][F3][F4] ...backward...
GPU1 [F1][F2][F3][F4]
GPU2 [F1][F2][F3][F4]
GPU3 [F1][F2][F3][F4][B4][B3][B2][B1]
└fill┘ └─── all 4 GPUs busy ───┘ └drain┘
bubble shrinks from (p-1)/p toward (p-1)/(m+p-1), m = micro-batches
The bubble fraction drops from (p−1)/p to (p−1)/(m+p−1), where m is the number of micro-batches. With p=4 stages and m=1, that's 3/4 idle. With m=16 micro-batches, it's 3/19 ≈ 16%. With m=64, under 5%. The fill and drain are fixed costs; the more micro-batches you push between them, the more that fixed cost amortizes. This is amortization in its purest form — pay the pipeline-fill overhead once, spread it over many micro-batches.
Mini-FAQ. "Why not just use thousands of micro-batches and drive the bubble to zero?" Two limits. First, tiny micro-batches mean tiny matrix multiplies that underutilize each GPU's tensor cores — you trade bubble for low arithmetic intensity. Second, the GPipe schedule keeps every micro-batch's activations alive until its backward pass runs, so more micro-batches means more activation memory. The bubble and activation memory pull against each other, which is exactly what 1F1B fixes next.
The 1F1B schedule — same bubble, far less memory¶
GPipe runs all forwards, then all backwards. That keeps m micro-batches' worth of activations in memory simultaneously — on the last stage, every micro-batch's activations pile up before any backward frees them. For large m that activation pile becomes its own memory wall.
The fix is 1F1B (one-forward-one-backward), the schedule Megatron-LM uses. Once the pipeline is full, each stage alternates: do one forward for a new micro-batch, then immediately one backward for the oldest in-flight micro-batch. The backward frees that micro-batch's activations right away, so the number of activations held at once is bounded by the pipeline depth p, not the micro-batch count m.
1F1B steady state on one stage — time →
───────────────────────────────────────
[F1][F2][F3][F4][B1][F5][B2][F6][B3]...
└ warm-up ┘ └── steady: F then B, F then B ──┘
activations alive at once ≈ p (pipeline depth), NOT m (micro-batch count)
bubble fraction: same (p-1)/(m+p-1) as GPipe — but bounded memory
The win to internalize: 1F1B has the same bubble as GPipe but bounds activation memory to O(p) instead of O(m). It decouples the two pressures GPipe coupled. You can now push m high to shrink the bubble without the activation pile exploding. Megatron's interleaved 1F1B goes further — it gives each GPU several non-contiguous chunks of layers (e.g. GPU 0 holds layers 1–10 and 41–50) so the pipeline depth seen by the schedule rises, shrinking the bubble more, at the cost of extra cross-stage communication and a bit more activation memory.
For our 70B run: 4 pipeline stages of 20 layers each, with m=32 micro-batches under 1F1B, gives a bubble of 3/35 ≈ 8.6% while keeping at most ~4 micro-batches' activations live per stage. That is a usable pipeline.
The picture: two cuts, two costs¶
The canonical mental model for this file — the same layer stack, cut two different ways, each paying a different tax:
TENSOR PARALLELISM (cut inside a layer) PIPELINE PARALLELISM (cut across layers)
════════════════════════════════════════ ══════════════════════════════════════════
one layer, split across GPUs layers split across GPUs, batch streamed
┌─────────layer L─────────┐ GPU0: layers 1–20 ─┐
│ GPU0 GPU1 GPU2 GPU3 │ ← every GPU GPU1: layers 21–40 │ micro-batches flow
│ ¼W ¼W ¼W ¼W │ holds ¼ of GPU2: layers 41–60 │ down, activations
│ └──all-reduce──┘ │ THIS layer GPU3: layers 61–80 ─┘ cross stage→stage
└──────────────────────────┘
TAX: all-reduce INSIDE every TAX: the BUBBLE — idle fill/drain time
block, fwd + bwd. Very shrunk by micro-batching to
chatty → must stay in-node. (p-1)/(m+p-1), never zero.
Read the asymmetry directly. Tensor parallelism communicates a lot (an all-reduce inside every block, every pass) but the activations never leave the layer — its cost is bandwidth, paid continuously. Pipeline parallelism communicates little (just the activations crossing a stage boundary, point-to-point) but wastes GPU time at the edges — its cost is idle compute, paid at fill/drain. They fail differently, which is exactly why a real run uses both, placed where each tax is cheapest. That placement is the next file.
Why tensor parallelism, not just more pipeline stages¶
When one layer won't fit, you might reach only for pipeline parallelism — more, thinner stages. But pipeline parallelism splits across layers; it cannot make a single layer smaller. If one layer overflows one GPU, no number of pipeline stages helps, because the smallest pipeline unit is a whole layer. You must split inside the layer, which is tensor parallelism's job.
And why not use tensor parallelism for everything, skipping pipeline entirely? Because tensor parallelism's all-reduce fires inside every block, in both passes — its communication volume is enormous and constant. On the 900 GB/s NVLink inside one node, an 8-way tensor-parallel layer syncs comfortably. Cross a node boundary onto InfiniBand and that per-block all-reduce, now riding a wire ~10× slower, throttles every layer. Tensor parallelism is bandwidth-hungry and must stay inside a node; pipeline parallelism crosses node boundaries cheaply because it only ships activations point-to-point at stage edges. Match the cut to the wire: TP on the fast in-node fabric, PP across the slow inter-node fabric.
Teacher voice. This is the seed of the whole next file. Tensor parallelism's tax is bandwidth, so it lives where bandwidth is cheapest — inside the node. Pipeline parallelism's tax is latency at the edges, which it pays rarely, so it tolerates the slow inter-node wire. You don't choose one cut; you assign each cut to the layer of the network where its tax hurts least.
A worked comparison on the 70B model¶
Put concrete numbers on the three model-splitting strategies for the 70B run, holding the cluster at 8 nodes × 8 H100s = 64 GPUs:
| Strategy | What's split | Per-GPU param share | Dominant comm per step | First thing that breaks at scale |
|---|---|---|---|---|
| ZeRO-3 / FSDP | state, gathered whole layers | 1/64 (gathered to full layer) | per-layer all-gather, fwd+bwd | one layer exceeds a GPU |
| Tensor parallel (8-way) | the matrices inside each layer | 1/8 of every matrix | all-reduce inside every block | crossing a node boundary |
| Pipeline parallel (4-way) | the stack of layers | 1/4 of the layers | activations across stage edges | the bubble at low micro-batch count |
The reading: ZeRO-3 frees the most memory per GPU but breaks when a layer is too big. Tensor parallelism splits the layer but is too chatty to leave a node. Pipeline parallelism crosses nodes cheaply but wastes compute in the bubble. No single cut is sufficient at frontier scale — which is the synthesis the next file builds.
Operational signals — bubbles, syncs, and starved stages¶
Healthy behavior. Under tensor parallelism: all-reduce kernels inside each block overlap nothing (they're on the critical path) but complete fast on NVLink, and per-GPU MFU stays high. Under pipeline parallelism with 1F1B: a profiler timeline shows a short fill ramp, a long dense steady-state where every stage is busy, and a short drain — the bubble visible only at the edges.
First metric to degrade. For pipeline: the bubble fraction, which you compute as (p−1)/(m+p−1). If a profiler shows long idle gaps in the middle of the steady state (not just edges), micro-batches are too few or stages are load-imbalanced — one stage's 20 layers are heavier than another's. For tensor parallelism: step time when the TP group accidentally spans two nodes, dragging every block's all-reduce onto InfiniBand.
The misleading metric. Average GPU utilization across the pipeline looks "fine" at, say, 80% — but that average hides the structure. The first and last stages bubble more than the middle ones; the average smears a real imbalance. Watch the per-stage timeline, not the cluster average. The other trap: blaming low throughput on the bubble when the real culprit is uneven stage assignment (an embedding layer or LM head bolted onto one stage makes it the slow link).
The graph an expert opens first. A per-stage Gantt chart of forward/backward blocks over time (Megatron and the PyTorch profiler both emit this). The bubble is visible as the triangular white space at fill and drain; imbalance is visible as one stage's blocks being wider than the rest. You diagnose pipeline health by looking at the shape, not a number.
Boundary of applicability — where each cut stops working¶
Tensor parallelism — strong fit and pathology. Shines inside a single node on NVLink/NVSwitch, where 8-way TP syncs a per-block all-reduce in microseconds. Pathological the instant the TP group crosses a node boundary: every block now pays an inter-node all-reduce, and throughput collapses. The hard limit: TP degree is effectively capped at the number of GPUs sharing fast intra-node fabric — 8 on a DGX H100 node, more on an NVL72 rack.
Pipeline parallelism — strong fit and pathology. Shines when the model is tall (many layers) and you have enough micro-batches to amortize the bubble. Pathological when the batch is too small to split into many micro-batches (large bubble) or when stages are imbalanced (one slow stage gates the whole pipeline — head-of-line blocking). The scale limit: bubble fraction (p−1)/(m+p−1) means deep pipelines (p large) need proportionally more micro-batches to stay efficient, and micro-batches can't shrink below the point where matmuls starve the tensor cores.
The scale limit on intuition. "Add more pipeline stages to fit a bigger model" works for memory but silently raises the bubble — past some p, you've added stages faster than you can feed them, and throughput drops even though the model fits. The cure isn't more stages; it's more micro-batches, interleaving, or shifting some of the split to tensor parallelism.
The wrong model to carry, and the right one¶
The seductive-but-wrong intuition: "tensor and pipeline parallelism are alternatives — pick the one that fits your model." They are not alternatives; they cut orthogonal dimensions and they pay different taxes on different wires. Treating them as either/or leads to placing tensor parallelism across nodes (and watching every block's all-reduce crawl on InfiniBand) or running a deep pipeline with too few micro-batches (and idling in the bubble).
The right model: tensor parallelism cuts inside a layer and must stay on the fast in-node wire; pipeline parallelism cuts across layers and rides the slow inter-node wire cheaply. They compose. The skill is not choosing between them but assigning each to the network layer where its tax is smallest — which is precisely 3D parallelism.
Other ways these cuts show up as a problem¶
- TP group spans two nodes — every block's all-reduce hits InfiniBand; step time doubles or worse. Fix: keep TP degree ≤ GPUs-per-node.
- Pipeline bubble dominates — too few micro-batches (small global batch); raise
mor reducep. - Pipeline stage imbalance — the stage holding the embedding + LM head is heavier; rebalance layer assignment so per-stage compute is even.
- Activation memory blows up under GPipe — all-forward-then-all-backward holds
mactivations; switch to 1F1B to bound it top. - Interleaved 1F1B raises comm — more, smaller chunks cross stage boundaries more often; worth it only when the bubble reduction pays for the extra traffic.
- TP all-reduce can't overlap — unlike DDP's gradient all-reduce, the TP all-reduce is on the forward/backward critical path and can't hide behind compute; it directly adds to step time.
- Micro-batches too small — matmuls shrink below the tensor cores' efficient size; you trade bubble for low arithmetic intensity and net-lose.
Where this fits the larger systems map¶
- Failure geometry — head-of-line blocking. A slow pipeline stage gates the whole pipeline, the same shape as the straggler stalling the all-reduce ring (file 02) and a slow request blocking a pipelined connection. The slowest participant sets the pace.
- Amortization. Micro-batching amortizes the fixed pipeline-fill cost over many micro-batches — the same move as batching small writes, coalescing I/O, or spreading a fixed
16Nstate across GPUs (file 03). Pay a fixed cost once, spread it wide. - Same constraint, different layer — instruction pipelines. The CPU pipeline bubble (a branch stall idling later stages) is structurally identical to the training pipeline bubble; both fill and drain, both hide it by keeping more work in flight. Same shape, hardware layer below.
- Bandwidth vs latency cost split. TP pays bandwidth (volume of all-reduce); PP pays latency (idle edges). This bandwidth-vs-latency split is the same axis you weigh choosing between chatty fine-grained RPCs and coarse batched ones.
Where this appears in production¶
Tensor parallelism in practice:
- NVIDIA Megatron-LM — the origin of the column-then-row split and the f/g conjugate all-reduce operators; the reference implementation everyone copies.
- Meta Llama 3 training — uses 8-way tensor parallelism inside each node, deliberately capped at the NVLink domain, with pipeline and data parallelism outside.
- NVIDIA NeMo / Transformer Engine — productizes Megatron TP with fused kernels and FP8, exposing TP degree as a config knob.
- DeepSpeed-Megatron (Megatron-DeepSpeed) — combines ZeRO data parallelism with Megatron TP for the largest open runs (BLOOM 176B).
- Hugging Face
transformerstensor-parallel API — exposes column/row-parallel linear layers so users get Megatron-style TP without hand-writing collectives. - vLLM / TensorRT-LLM (inference) — reuse the same TP split at serving time to fit a 70B model across the 8 GPUs of one node, capped at NVLink exactly as in training.
Pipeline parallelism in practice:
- GPipe (Google) — introduced micro-batching to shrink the bubble; the all-forward-then-all-backward schedule.
- PipeDream / 1F1B (Microsoft) — introduced the one-forward-one-backward schedule that bounds activation memory.
- Megatron-LM interleaved 1F1B — the schedule used for the largest dense runs, trading extra comm for a smaller bubble.
- Meta Llama 3 training — pipeline parallelism as the axis that crosses node groups, where its cheap point-to-point activation transfers tolerate InfiniBand.
- PyTorch
torch.distributed.pipelining— native pipeline parallelism with GPipe and 1F1B schedules built in. - DeepSpeed pipeline engine — pipeline parallelism integrated with ZeRO for combined memory and layer splitting.
- Colossal-AI — exposes both TP and PP with auto-balancing of pipeline stages.
- Zero Bubble / controllable-memory schedules (research) — split the backward pass into input-grad and weight-grad halves to fill the bubble further, pushing toward near-zero idle time.
Pause and recall¶
- Why can't ZeRO-3 help when a single layer exceeds one GPU's memory?
- In Megatron tensor parallelism, why is the first MLP matrix split column-wise and the second row-wise? How many all-reduces does that cost per block per pass?
- What is the pipeline bubble, and what is the bubble fraction with
pstages andmmicro-batches? - Why does pipeline parallelism with one batch waste
(p−1)/pof compute? - What does 1F1B change relative to GPipe — to the bubble, and to activation memory?
- Why must tensor parallelism stay inside a node, while pipeline parallelism can cross node boundaries?
- For the 70B run with
p=4andm=32, what is the bubble fraction? - What's the difference between the tax tensor parallelism pays and the tax pipeline parallelism pays?
Interview Q&A¶
Q1. A model trains fine under ZeRO-3, then you widen it and it OOMs even at batch 1 on every GPU. Diagnose. A. A single layer's parameters plus activations now exceed one GPU, so ZeRO-3 — which gathers whole layers to compute them — can't help; it still has to materialize a full layer. The fix is tensor parallelism, which splits the layer's matrices across GPUs so no GPU ever holds the whole layer. Confirm by checking whether even one layer's weights alone exceed the card. Common wrong answer to avoid: "Raise the ZeRO stage" — you're already at stage 3, the highest; the problem is a layer that won't fit, which no ZeRO stage splits.
Q2. Why does Megatron split the MLP column-then-row instead of, say, both column-wise? A. Column-splitting the first matrix lets each GPU produce a complete slice of the intermediate from the full input — no sync, and the element-wise GeLU applies per-slice cleanly. Row-splitting the second matrix makes each GPU's output a partial sum of the final result, summed by one all-reduce. The pairing costs exactly one all-reduce per block per pass. Splitting both column-wise would force an extra sync between the two matmuls to redistribute the intermediate. Common wrong answer to avoid: "The direction doesn't matter, you all-reduce either way" — wrong split doubles the syncs; the column-then-row pairing is what makes it one.
Q3. Your 4-stage pipeline shows 80% average GPU utilization but throughput is far below 4× a single GPU. What's wrong and what do you check?
A. The average hides structure. Open the per-stage Gantt chart. Likely causes: too few micro-batches (large bubble at fill/drain — compute (p−1)/(m+p−1)), or stage imbalance where one stage (often the one holding the embedding + LM head) is heavier and gates the rest via head-of-line blocking. Fix: raise micro-batch count, or rebalance layer assignment so per-stage compute is even.
Common wrong answer to avoid: "80% is fine, the GPUs are busy" — the average smears the bubble and imbalance; you must look per-stage.
Q4. Why can't the tensor-parallel all-reduce hide behind compute like DDP's gradient all-reduce does? A. DDP's all-reduce syncs gradients after a layer's backward, so it can overlap with computing earlier layers. The TP all-reduce sits inside a block on the forward/backward critical path — the next operation needs its result before it can run. There's nothing to overlap it with, so it adds directly to step time. That's why TP must run on the fastest wire (NVLink), where the all-reduce is cheapest. Common wrong answer to avoid: "Use bigger buckets to overlap it" — bucketing helps DDP's off-critical-path sync, not TP's on-critical-path one.
Q5. You want a deeper pipeline (more stages) to fit a bigger model. What happens to efficiency, and how do you keep it?
A. Deeper pipeline (p up) raises the bubble fraction (p−1)/(m+p−1), so efficiency drops unless you raise m (more micro-batches) proportionally — but m can't grow without bound because tiny micro-batches starve the tensor cores, and GPipe's activation memory grows with m. Use 1F1B to bound activation memory to O(p), and consider interleaved 1F1B to shrink the bubble at fixed p.
Common wrong answer to avoid: "More stages always means more throughput" — past a point you add stages faster than you can feed them, and the bubble eats the gain.
Q6. (Cumulative.) A 70B run is slow. How do you tell whether it's an exposed all-gather (file 03), an exposed all-reduce in the ring (file 02), a TP all-reduce on the wrong wire, or a pipeline bubble (this file)? A. Open the profiler timeline. Exposed FSDP all-gather: gaps before each layer's compute, memory comfortable. Exposed DDP all-reduce: gaps at the end of the backward pass, fixable with bucketing/topology. TP all-reduce on InfiniBand: uniform slowdown inside every block, and the TP group spans two nodes. Pipeline bubble: triangular idle at the fill/drain edges of a per-stage Gantt chart. The shape and where the gap sits tells you which mechanism, and therefore which file's fix applies. Common wrong answer to avoid: "It's all just communication, increase bandwidth" — the four causes need four different fixes (prefetch, bucketing, regrouping, more micro-batches); diagnosing the shape is the skill.
Design/debug exercise (10 min)¶
Step 1 — modeled example. Compute the pipeline bubble for a 4-stage pipeline at several micro-batch counts:
bubble fraction = (p-1)/(m+p-1), p = 4
m = 1 → 3/4 = 75% (naive, one batch)
m = 4 → 3/7 ≈ 43%
m = 16 → 3/19 ≈ 16%
m = 64 → 3/67 ≈ 4.5%
m from 16 to 32 only drops the bubble from 16% to 8%, while the activation pressure (under GPipe) doubles — which is why 1F1B's O(p) memory bound matters.
Step 2 — your turn. For the 70B running example, you split it as 4 pipeline stages of 20 layers each, with 8-way tensor parallelism inside each stage (so each stage is one 8-GPU node). That's 4 × 8 = 32 GPUs. (a) With m=32 micro-batches under 1F1B, what is the bubble fraction? (b) Which axis (TP or PP) crosses the node boundary in this layout, and why is that the right choice? (c) Roughly how many activations are live per stage under 1F1B versus GPipe?
Step 3 — reproduce from memory. Without looking, redraw the two-cuts diagram (tensor parallelism splitting inside a layer with an all-reduce; pipeline parallelism splitting across layers with the bubble), label the tax each cut pays (TP = bandwidth/in-node; PP = idle compute/edges), and write the one-line connection forward: because TP is chatty and PP is sparse-but-bursty, the next file places TP on NVLink inside a node and PP across InfiniBand between nodes.
Operational memory¶
This chapter explained what to do when ZeRO runs out of road — when a single layer is too big to fit on one GPU even with all state sharded. The important idea is that there are two orthogonal cuts through the model itself: tensor parallelism splits the matrix math inside a layer (column-then-row, one all-reduce per block), and pipeline parallelism splits the stack of layers across GPUs (streaming micro-batches to fill the bubble). Neither is a tuning of ZeRO; both cut the model where ZeRO would only gather it.
You learned to split a layer's MLP and attention across GPUs with a single all-reduce, to compute the pipeline bubble (p−1)/(m+p−1) and shrink it with micro-batches, and to use 1F1B to bound activation memory to O(p) instead of GPipe's O(m). That solves the one-giant-layer failure — the layer that overflowed one card is now genuinely distributed, and the layer stack that idled three of four GPUs now keeps them all fed.
Carry this diagnostic forward: when a layer won't fit, split inside it (tensor parallelism), never around it (pipeline). When throughput lags on a pipeline, look at the per-stage Gantt chart for the bubble shape and stage imbalance, not the cluster-average utilization. And remember the tax each cut pays — TP's bandwidth, PP's idle edges — because that tax decides which wire each cut belongs on.
Remember:
- Tensor parallelism splits the matrices inside a layer (column-then-row), costing one all-reduce per block per pass — bandwidth-hungry, must stay on NVLink inside a node.
- Pipeline parallelism splits the stack of layers across GPUs; the idle fill/drain is the bubble, fraction
(p−1)/(m+p−1). - Micro-batching shrinks the bubble by amortizing fill/drain; more micro-batches → smaller bubble, with diminishing returns.
- 1F1B keeps GPipe's bubble but bounds activation memory to
O(p)notO(m)— it decouples bubble from memory. - TP's tax is bandwidth (paid continuously); PP's tax is idle compute (paid at edges) — different wires suit different cuts.
- When one layer won't fit, only TP helps; PP's smallest unit is a whole layer.
Bridge. We now have three cuts: data parallelism (split the batch), tensor parallelism (split inside a layer), pipeline parallelism (split across layers). Each pays a different tax — and each tax hurts more or less depending on which wire it rides. Tensor parallelism's per-block all-reduce is unbearable on InfiniBand but fine on NVLink. Pipeline parallelism's point-to-point activation transfers cross nodes cheaply. Data parallelism's gradient all-reduce sits at the outermost layer. The next file stops treating these as separate choices and composes all three into one 3D layout, then maps each axis onto the cluster's physical topology — NVLink inside the node, InfiniBand between — so every tax rides the wire where it is cheapest. → 05-3d-parallelism-and-interconnect.md