3D parallelism and interconnect — putting every cut on the right wire¶

~20 min read. You now own three cuts — data, tensor, pipeline — each with a different tax. The mistake that wastes half a cluster isn't choosing the wrong cut. It's choosing the right cuts and then placing them on the wrong wires: tensor parallelism stretched across InfiniBand, data parallelism's all-reduce jammed inside one node. This file composes all three into a single 3D layout and maps each axis onto the physical network so every tax rides the wire where it is cheapest.

Built on tensor and pipeline parallelism. That file ended with a promise: TP's per-block all-reduce is bandwidth-hungry and belongs on NVLink; PP's point-to-point activation transfers cross nodes cheaply. This file cashes that promise. It composes data parallelism, tensor parallelism, and pipeline parallelism into one layout and maps each onto the interconnect hierarchy — NVLink/NVSwitch inside a node, InfiniBand between — so the coordination cost of each axis lands on the fastest wire it can.

What three separate cuts left unresolved¶

The last three files built three ways to split a training job, and each was taught in isolation. Data parallelism (file 02) split the batch and synced gradients with an all-reduce. ZeRO/FSDP (file 03) sharded the data-parallel state. Tensor parallelism (file 04) split the math inside a layer with a per-block all-reduce. Pipeline parallelism (file 04) split the stack of layers and streamed micro-batches through it.

Each was presented as if you'd pick one. But the 70B run — and certainly the 405B runs the field actually does — needs them together. One layer is too big, so you need tensor parallelism. The stack is too tall for the GPUs left over, so you need pipeline parallelism. And you still want throughput, so you replicate the whole pipelined-and-tensor-split arrangement and add data parallelism on top. Three cuts, composed, is 3D parallelism.

The composition raises a question none of the earlier files had to answer: with 64 GPUs and three orthogonal axes, which GPUs get grouped into which axis? Get the grouping wrong and a perfectly correct 3D layout runs at half speed — because each axis has a different communication tax, and the cluster has wires of wildly different speeds. This file is about matching tax to wire.

What this file solves¶

A 70B run on 64 GPUs can be split three ways at once — data × tensor × pipeline — but the layout of those axes onto physical GPUs decides whether it runs near peak or at half speed. This file shows how to compose the three axes into one grid where total_GPUs = DP × TP × PP, why the per-block tensor-parallel all-reduce must stay inside a single NVLink node while pipeline and data parallelism can cross the slower InfiniBand fabric, and how to read a topology so you place the chattiest axis on the fastest wire. The concrete move: map TP to the 8 GPUs of one node, PP across nodes, DP across pipeline replicas.

The interconnect is not one wire — it's a hierarchy¶

Every decision in this file rests on one hardware fact: the wires between GPUs are not uniform. Inside a single DGX H100 node, the eight GPUs are connected by NVLink 4.0 through NVSwitch, giving any two of them ~900 GB/s of bidirectional bandwidth — full, non-blocking, every GPU to every other. Step outside the node and GPUs talk over InfiniBand (NDR, ~400 Gb/s ≈ 50 GB/s per link), roughly an order of magnitude slower per GPU pair, and routed through switches that can contend under load.

   THE INTERCONNECT HIERARCHY (one cluster)
   ═════════════════════════════════════════

   ┌─ NODE 0 ──────────────────┐      ┌─ NODE 1 ──────────────────┐
   │  G0 G1 G2 G3 G4 G5 G6 G7   │      │  G0 G1 G2 G3 G4 G5 G6 G7   │
   │   └──── NVSwitch ────┘      │      │   └──── NVSwitch ────┘      │
   │   ~900 GB/s any-to-any      │      │   ~900 GB/s any-to-any      │
   └──────────┬─────────────────┘      └──────────┬─────────────────┘
              │                                    │
              └────────── InfiniBand fabric ───────┘
                          ~50 GB/s per GPU, switched
                          (≈ 18× slower than NVLink)

That ~18× gap is the whole game. A collective that's free on NVLink can be the bottleneck on InfiniBand. So the placement rule writes itself: the axis that communicates the most must live inside the node, on NVLink; the axis that communicates the least can cross nodes, on InfiniBand. Now rank the three axes by how chatty they are.

Ranking the three axes by communication appetite¶

Each axis pays its tax at a different frequency and volume. Lay them side by side:

Axis	What it sends	When	Volume per step	Wants which wire
Tensor (TP)	all-reduce of activations	inside every block, fwd + bwd, on the critical path	high, and constant	NVLink (in-node)
Pipeline (PP)	activations point-to-point	only at stage boundaries	low, just boundary activations	InfiniBand (cross-node) tolerable
Data (DP)	all-reduce / reduce-scatter of gradients	once per step, off critical path, overlappable	high but hidden behind backward	InfiniBand tolerable (overlaps)

Tensor parallelism is the glutton. Its all-reduce fires inside every block, in both passes, and sits on the critical path with nothing to overlap it (the next op needs the result). That traffic must ride the 900 GB/s wire. Pipeline parallelism is the ascetic — it only ships a stage's boundary activations point-to-point, a small volume that tolerates the slow inter-node wire. Data parallelism moves a lot (the whole gradient) but only once per step and overlapped behind the backward pass (file 02's DDP trick), so its exposed cost is small enough to cross nodes.

Teacher voice. The placement order falls straight out of the ranking. TP is innermost — pin it to the 8 GPUs sharing one NVSwitch. PP is the middle ring — it crosses nodes but only ships boundary activations, so InfiniBand is fine. DP is outermost — its gradient all-reduce is heavy but hidden, so it too tolerates the cross-node fabric. Innermost = chattiest = fastest wire. That single sentence is most of 3D parallelism.

The minimal scenario: one wrong placement halves the cluster¶

Make the failure concrete. Take the 70B run and split it 8-way tensor-parallel — but place the TP group across two nodes (4 GPUs in node 0, 4 in node 1) instead of within one. Now every block's all-reduce, which fires on the critical path twice per layer per pass, must cross InfiniBand. There are ~80 layers. The per-step communication that would have completed in microseconds on NVLink now plods across a wire 18× slower, on the critical path, with nothing to hide behind.

   TP group placement — same 8-way TP, two layouts
   ────────────────────────────────────────────────
   WRONG: TP spans nodes          RIGHT: TP within a node
   node0: G0 G1 G2 G3 ┐           node0: G0..G7  ← all 8 TP ranks here
   node1: G4 G5 G6 G7 ┘ TP        node1: G0..G7  ← a different TP group

   every block's all-reduce       every block's all-reduce
   crosses InfiniBand (50 GB/s)   stays on NVLink (900 GB/s)
   → step time 2–4× worse         → near-peak

So the real problem is not "tensor parallelism is slow" — on the right wire it's nearly free. The real problem is that the chattiest axis was placed on the slowest wire. How do we guarantee the innermost, chattiest axis always lands inside a single NVLink domain? By choosing the 3D grid so TP degree never exceeds the GPUs-per-node and TP ranks are always co-located on one node.

The 3D grid — how the axes compose¶

The composition is a product. With G total GPUs, you pick three factors:

   G = DP × TP × PP

   for the 70B run on 64 GPUs, a typical choice:
     TP = 8   (one full node — the 8 NVLink-connected GPUs)
     PP = 4   (4 pipeline stages, each stage = one TP node)
     DP = 2   (two replicas of the whole 8×4 = 32-GPU arrangement)
   ──────────────────────────────────────────────────────────
     8 × 4 × 2 = 64 GPUs  ✓

Read it as nested rings. Innermost: 8 GPUs in one node form a TP group — they split each layer's matrices and all-reduce on NVLink. Next: 4 of those TP-nodes form a pipeline — stage 0 is node A, stage 1 node B, and so on, with activations crossing InfiniBand between stages. Outermost: the whole 4-node pipeline is replicated into 2 data-parallel copies, whose gradients all-reduce across nodes, overlapped behind the backward pass.

   3D PARALLELISM LAYOUT — 64 GPUs as DP(2) × PP(4) × TP(8)
   ════════════════════════════════════════════════════════

   DP replica 0                          DP replica 1
   ┌──────────────────────────────┐      ┌──────────────────────────────┐
   │ PP0  [node: 8-GPU TP group]   │      │ PP0  [node: 8-GPU TP group]   │
   │  │   layers 1–20              │      │  │   layers 1–20              │
   │  ▼   (NVLink all-reduce)      │      │  ▼                            │
   │ PP1  [node]  layers 21–40     │      │ PP1  [node]  layers 21–40     │
   │  │                            │      │  │                            │
   │  ▼   (InfiniBand activations) │      │  ▼                            │
   │ PP2  [node]  layers 41–60     │      │ PP2  [node]  layers 41–60     │
   │  │                            │      │  │                            │
   │  ▼                            │      │  ▼                            │
   │ PP3  [node]  layers 61–80     │      │ PP3  [node]  layers 61–80     │
   └──────────────────────────────┘      └──────────────────────────────┘
         └──── DP gradient all-reduce across the two replicas (InfiniBand, overlapped) ────┘

   innermost = TP = NVLink (chattiest)
   middle    = PP = InfiniBand point-to-point (sparse)
   outermost = DP = InfiniBand all-reduce (heavy but hidden)

This is the canonical mental model for the file. Three nested rings, each on the wire its tax can afford. Hold this picture and the rest is bookkeeping.

The 70B run, fully laid out¶

Thread the running example to its full multi-GPU resolution. The 70B model that crashed at step zero on one GPU (file 01) now runs across 64 H100s as DP=2 × PP=4 × TP=8:

   per-GPU parameter share:
     TP splits each layer 8 ways    → 1/8 of each layer's weights
     PP gives each stage 20 layers  → 1/4 of the layers
     so each GPU holds 1/8 × 1/4 = 1/32 of the model's parameters
     (DP=2 replicates, doesn't reduce per-GPU params)

   70B params → 70e9 / 32 ≈ 2.2B params per GPU
   in bf16: 2.2B × 2 = 4.4 GB params, + grads + sharded optimizer state
   comfortably inside 80 GB, with room for activations

The per-step communication, now placed correctly:

TP all-reduce: inside each node, on NVLink — fast, but on the critical path inside every block.
PP activations: 3 inter-node hops (between the 4 stages), point-to-point, small volume, overlapped with the next micro-batch's compute under 1F1B.
DP gradient all-reduce: across the 2 replicas, over InfiniBand, but hidden behind the backward pass.

The result: a 70B model trains at high MFU because every heavy collective is on a wire it can afford. Move the TP group across nodes and that same layout collapses to half throughput. Same math, same FLOPs, same model — only the placement changed.

Why not just one big data-parallel-with-FSDP run?¶

A reasonable alternative: skip TP and PP, use FSDP (ZeRO-3) across all 64 GPUs, which file 03 showed fits the 70B model. Why bother with 3D? Two reasons, both about the wire.

First, FSDP's per-layer all-gather, like TP's all-reduce, is heavy — and across 64 GPUs spanning 8 nodes, those all-gathers cross InfiniBand. FSDP works when the all-gathers hide behind compute, but as the data-parallel group grows past a node, the cross-node all-gathers start to expose. 3D parallelism keeps the heaviest per-layer traffic (TP) inside a node, where FSDP would have spread it across all nodes.

Second, for the truly large models (175B, 405B), even a single layer doesn't fit one GPU — and FSDP can't split a layer (file 04). You need TP. Once you have TP inside the node, PP across nodes is the natural way to add the remaining layers without inflating the chatty TP group. So 3D isn't a competitor to FSDP; the modern recipe often combines them — TP inside the node, PP across some nodes, and FSDP/ZeRO as the data-parallel axis on top. The decision is set by model shape and node size: small enough to fit a layer and a short pipeline → FSDP alone; layer too big or model too tall → add TP and PP.

Mini-FAQ. "How do I pick the actual numbers for DP, TP, PP?" Work inside-out. Set TP = GPUs per node (8 on H100 nodes) so the chatty all-reduce stays on NVLink — but no higher, because crossing the node kills it. Set PP to whatever depth makes each stage's layers fit a node's memory, keeping enough micro-batches that the bubble (p−1)/(m+p−1) stays small. DP = remaining GPUs / (TP × PP) for throughput. TP is capped by node size, PP by the bubble, DP by the large-batch convergence wall.

Operational signals — is every axis on the right wire?¶

Healthy behavior. A profiler timeline shows TP all-reduce kernels completing in microseconds (NVLink), PP activation sends overlapping the next micro-batch (1F1B steady state dense), and DP gradient all-reduce hidden behind the backward pass. MFU high and stable; per-GPU memory flat at ~model/32 + activations.

First metric to degrade. Step time when the TP group accidentally straddles a node — the single most common 3D misconfiguration. The symptom: every block slows uniformly, and nvidia-smi topo -m or NCCL's topology dump shows TP ranks split across two nodes. The second: DP all-reduce stops hiding when the data-parallel group spans many nodes and the gradient is large.

The misleading metric. Aggregate cluster utilization looks high while MFU is mediocre, because GPUs are "busy" running cross-node TP all-reduce kernels instead of matmuls — the same trap as file 02, now at the TP layer. Watch MFU and the per-axis collective timeline, not raw utilization. Another trap: blaming the model or the optimizer for a slowdown that is purely a topology placement error.

The graph an expert opens first. The NCCL/profiler view annotated with which physical link each collective rode (Nsight Systems shows NVLink vs network transfers separately). The diagnostic question: is any TP all-reduce on a network link? If yes, the TP group spans nodes — regroup. Is any PP activation transfer unexpectedly large? Then stage boundaries cut a fat tensor — rebalance.

Boundary of applicability — where the layout rules bend¶

Where 3D shines. Frontier-scale dense transformers on hierarchical clusters (DGX nodes + InfiniBand) — the 70B-to-405B regime, where one layer needs TP, the stack needs PP, and throughput needs DP. The placement rule (chattiest axis innermost) holds tightly here.

Where it becomes pathological. On flat topologies where every GPU has equal bandwidth to every other (some TPU pod configurations, or a single NVL72 rack where 72 GPUs share NVLink), the hierarchy assumption weakens — TP can extend beyond 8 because the "node boundary" is much larger or absent. The rule "TP ≤ 8" is really "TP ≤ the NVLink domain size," which is 8 on a DGX H100 but 72 on a GB200 NVL72. Hard-coding 8 leaves performance on the table on newer racks.

The scale limit on intuition. "More data parallelism is free throughput" breaks twice: the global batch grows with DP and eventually hits the large-batch convergence wall, and the DP all-reduce stops hiding when the group spans too many nodes. "Deeper pipelines fit bigger models for free" breaks on the bubble. The composition has three independent ceilings — TP capped by node size, PP capped by the bubble, DP capped by batch convergence — and a frontier run sits near all three at once.

The wrong model to carry, and the right one¶

The seductive-but-wrong intuition: "pick the parallelism strategy that fits your model." At frontier scale there is no single strategy — you compose all three, and the grouping onto hardware matters as much as the split itself. Teams that treat 3D as "choose one axis" either can't fit the model (DP/FSDP alone when a layer is too big) or run at half speed (TP placed across nodes).

The right model: 3D parallelism is one layout, G = DP × TP × PP, and the layout's job is to put each axis's communication tax on the wire it can afford — chattiest (TP) innermost on NVLink, sparsest (PP) and hidden (DP) crossing nodes. The skill is not choosing an axis; it's placing the grid so no heavy collective ever rides the slow wire.

Other ways the layout shows up as a problem¶

TP group spans a node boundary — uniform per-block slowdown; the #1 misconfiguration. Regroup so TP ranks share one NVSwitch.
DP all-reduce exposed — the data-parallel group spans too many nodes; raise micro-batch/compute per step or shrink DP relative to PP.
Pipeline stage straddles nodes mid-stage — a single stage's layers split across nodes so its internal activations cross InfiniBand; align stage boundaries to node boundaries.
TP degree under node size — using TP=4 on an 8-GPU node wastes half the NVLink domain; bump TP to fill the node before adding PP.
Global batch too large — DP × micro-batch × gradient-accumulation pushed the effective batch past the convergence wall; loss plateaus or diverges despite a correct layout.
NCCL picks a bad ring — without topology hints, the collective routes across slow links; set NCCL_TOPO_FILE / process placement so NCCL knows the hierarchy.
Uneven node health — one node's NVLink degraded (a bad cable), so its TP group is slower and gates the pipeline stage it hosts (head-of-line blocking, revisited in file 07).

Where this fits the larger systems map¶

Locality optimization — same shape as cache hierarchy. Placing the chattiest axis on the fastest wire is the same instinct as keeping hot data in L1 and cold data in DRAM: match access frequency to access cost. The interconnect hierarchy is a memory hierarchy for communication.
Constraint echo — bandwidth. Every axis fights the same constraint files 02–04 fought: finite link bandwidth. 3D parallelism is the resolution — not a new mechanism but a placement of old ones so the bandwidth constraint binds on the cheapest wire.
Failure geometry — head-of-line blocking, recurring. A degraded node gating its pipeline stage is the same straggler shape as the slow GPU stalling the all-reduce ring (file 02) and the slow pipeline stage (file 04). At every layer of the hierarchy, the slowest participant sets the pace.
Same problem, different layer — NUMA placement. Pinning a TP group to one NVSwitch domain is structurally identical to pinning a thread to the NUMA node holding its memory: co-locate the chatty parties to avoid the slow interconnect. Same reasoning, OS layer below.

Where this appears in production¶

NVIDIA Megatron-LM — the canonical 3D implementation; its trillion-parameter runs use exactly DP × TP × PP with TP pinned inside the NVLink domain.
Meta Llama 3 (405B) — published training used 4D parallelism (TP=8 inside the node, plus pipeline, context, and data parallelism), with TP deliberately capped at the node's 8 GPUs.
Microsoft + NVIDIA Megatron-Turing NLG 530B — composed TP, PP, and DP across thousands of A100s, the reference for large 3D layouts.
BLOOM 176B (BigScience) — Megatron-DeepSpeed with TP inside nodes and pipeline + ZeRO across them.
NVIDIA NeMo — exposes tensor_model_parallel_size, pipeline_model_parallel_size, and data parallelism as config, computing the grid for you.
DeepSpeed — combines ZeRO data parallelism with Megatron TP/PP; the "3D parallelism" tutorial is the standard reference.
PyTorch DeviceMesh / torch.distributed — represents the 3D grid as a mesh so collectives are issued per-axis with topology awareness.
Google TPU pods (JAX mesh / GSPMD) — the same composition on a flatter topology, where the "node boundary" is the ICI mesh rather than NVLink.
NVIDIA NCCL topology detection — auto-discovers NVLink vs network links so collectives prefer the fast wire; NCCL_TOPO_FILE overrides it.
nvidia-smi topo -m — the command engineers run first to verify which GPUs share NVLink before assigning the TP group.
GB200 NVL72 — a 72-GPU NVLink domain that lets TP extend far past 8, changing the inside-out sizing rule.
Slurm / Kubernetes gang scheduling — places all ranks of a TP group on co-located GPUs so the layout the framework assumes matches the hardware.

Pause and recall¶

What is the ~18× fact about NVLink versus InfiniBand, and why does it drive every placement decision?
Rank the three axes (DP, TP, PP) by communication appetite and say which wire each wants.
Why must the tensor-parallel group stay inside a single node?
Write the 3D grid equation and a valid DP × TP × PP for 64 GPUs.
For the 70B run as DP=2 × PP=4 × TP=8, what fraction of the model's parameters does each GPU hold?
What happens to step time if the TP group is placed across two nodes, and why?
Why can pipeline and data parallelism tolerate the slow inter-node wire while tensor parallelism cannot?
How do you pick TP, PP, and DP — and what caps each one?

Interview Q&A¶

Q1. You have 64 H100s in 8 nodes. Lay out a 70B model in 3D and justify each axis's placement. A. TP=8 pinned to one node (the chatty per-block all-reduce stays on 900 GB/s NVLink), PP=4 across nodes (point-to-point activations tolerate InfiniBand), DP=2 replicas (gradient all-reduce is heavy but hidden behind the backward pass, so it crosses nodes fine). 8×4×2=64. The principle: chattiest axis innermost on the fastest wire, sparsest and hidden axes outermost on the slow wire. Common wrong answer to avoid: "Use TP=16 across two nodes for more layer splitting" — that drags every block's critical-path all-reduce onto InfiniBand and halves throughput.

Q2. A 3D run is at 40% MFU. Memory is fine, the math is correct. Where do you look first? A. Topology placement. Run nvidia-smi topo -m and check the NCCL link annotations in a profiler. The most likely cause is the TP group straddling a node boundary, so every block's all-reduce rides InfiniBand on the critical path — a uniform per-block slowdown. Fix by regrouping ranks so each TP group of 8 shares one NVSwitch. Second suspect: an exposed DP all-reduce because the DP group spans too many nodes. Common wrong answer to avoid: "The model is too big, add GPUs" — adding GPUs without fixing placement keeps the chatty collective on the slow wire; it's a layout bug, not a capacity bug.

Q3. Why not just use FSDP across all 64 GPUs instead of 3D parallelism? A. Two reasons. FSDP can't split a single layer, so if one layer exceeds a GPU (175B+) you need TP regardless. And FSDP's per-layer all-gather, spread across all 8 nodes, crosses InfiniBand and starts to expose as the group grows; 3D keeps the heaviest per-layer traffic (TP) inside a node. The modern recipe often combines them — TP inside the node, PP across some, FSDP as the data axis. Choose by model shape: layer fits and pipeline is short → FSDP alone; layer too big or model too tall → add TP and PP. Common wrong answer to avoid: "FSDP and 3D parallelism are mutually exclusive" — FSDP is frequently the DP axis within a 3D layout.

Q4. How do you choose the values of DP, TP, and PP, and what limits each? A. Inside-out. TP = GPUs per node (fill the NVLink domain, never exceed it). PP = the depth needed so each stage's layers fit a node's memory, kept shallow enough that the bubble (p−1)/(m+p−1) stays small given your micro-batch count. DP = remaining GPUs / (TP × PP) for throughput. Caps: TP by node/NVLink-domain size, PP by the pipeline bubble, DP by the large-batch convergence wall. Common wrong answer to avoid: "Maximize TP since it splits the layer finest" — TP above node size puts the critical-path all-reduce on the slow wire; it's capped by the NVLink domain, not by preference.

Q5. The same model and layout that ran at 90% MFU last week now runs at 50%. Nothing in the code changed. What's a likely hardware-side cause? A. A degraded link or sick GPU inside one node — a bad NVLink cable drops a TP group's all-reduce bandwidth, and since TP is on the critical path, that node's whole pipeline stage slows, gating the pipeline (head-of-line blocking). Check nvidia-smi nvlink -s for downgraded lanes and per-rank step times for a slow outlier. The scheduler may also have placed the job on a worse topology this run. Common wrong answer to avoid: "It must be a data or convergence issue" — identical code and layout with a sudden throughput drop points at hardware/topology, not the model.

Q6. (Cumulative.) Trace which file's mechanism each communication in a 3D step comes from, and which wire it rides. A. The TP all-reduce inside each block is file 04's tensor parallelism, on NVLink. The PP activation transfer at stage boundaries is file 04's pipeline parallelism, point-to-point on InfiniBand. The gradient all-reduce across DP replicas is file 02's data parallelism (or file 03's reduce-scatter + all-gather if the DP axis is FSDP), hidden behind the backward pass on InfiniBand. 3D parallelism is the placement of these earlier mechanisms, each on the wire its tax can afford. Common wrong answer to avoid: "3D parallelism is a new communication pattern" — it introduces no new collective; it composes and places the ones from files 02–04.

Design/debug exercise (10 min)¶

Step 1 — modeled example. Lay out a 32B model on 32 H100s (4 nodes × 8 GPUs):

   constraint: TP = GPUs per node = 8   (keep all-reduce on NVLink)
   one layer fits a node under 8-way TP? yes (32B is modest) → could even use TP=8, PP=1
   choose PP = 2 (two stages across nodes), DP = remaining
   32 / (TP=8 × PP=2) = DP = 2
   → DP=2 × PP=2 × TP=8 = 32 ✓
   per-GPU param share = 1/(TP × PP) = 1/16 → 32B/16 = 2B params/GPU

Step 2 — your turn. For the 70B running example, you're given 128 H100s (16 nodes). (a) Propose a DP × TP × PP layout, keeping TP inside a node. (b) State which axis crosses node boundaries and which stays in-node. (c) If a teammate proposes TP=16 to split layers finer, explain in one sentence what breaks. (d) What fraction of the model's parameters does each GPU hold in your layout?

Step 3 — reproduce from memory. Without looking, redraw the nested-rings 3D layout (TP innermost on NVLink, PP middle crossing nodes, DP outermost), annotate each ring with its collective and its wire, and write the one-line placement rule: chattiest axis (TP) innermost on the fastest wire; sparsest (PP) and hidden (DP) cross nodes. Connect to file 04 in one sentence: 3D parallelism places file 04's two cuts and file 02's data parallelism, each on the wire its tax can afford.

Operational memory¶

This chapter explained why composing the three cuts is not enough — the layout onto physical hardware decides whether a correct 3D split runs near peak or at half speed. The important idea is that the interconnect is a hierarchy (NVLink ~18× faster than InfiniBand), and each parallelism axis has a different communication appetite, so you place the chattiest axis on the fastest wire: tensor parallelism innermost on NVLink, pipeline and data parallelism crossing nodes on InfiniBand.

You learned to compose G = DP × TP × PP, to rank the axes by communication appetite, and to size the grid inside-out — TP capped by node size, PP by the bubble, DP by the convergence wall. That solves the half-speed-cluster failure: the same 70B model and FLOPs run at high MFU when TP stays on NVLink, and collapse when it spans nodes.

Carry this diagnostic forward: when a correct 3D run runs slow with comfortable memory, suspect topology placement before the model — run nvidia-smi topo -m, check whether any TP all-reduce rode a network link, and regroup before touching anything else. The grouping onto hardware is as load-bearing as the split itself.

Remember:

The interconnect is a hierarchy: NVLink (~900 GB/s in-node) is ~18× faster than InfiniBand (~50 GB/s cross-node).
G = DP × TP × PP — three orthogonal axes composed into one grid.
Place the chattiest axis innermost on the fastest wire: TP on NVLink (in-node), PP and DP across nodes.
TP must not span a node — its per-block critical-path all-reduce on InfiniBand halves throughput.
Size inside-out: TP = GPUs per node, PP by the bubble, DP by throughput and the convergence wall.
A slow run with fine memory and correct math is almost always a placement bug — check nvidia-smi topo -m first.

Bridge. We can now place a 70B (or 405B) model across a cluster so every collective rides the wire it can afford, and the run hums at high MFU. But every GPU's budget is still tight — params, gradients, sharded optimizer state, and activations all competing for 80 GB, with activations the one term that explodes on long sequences. We've spread the model as far as the topology allows; the remaining headroom has to come from inside each GPU. The next file buys that headroom two ways: dropping precision (bf16/fp8) to halve or quarter the bytes per number, and recomputing activations in the backward pass instead of storing them — trading compute and precision for the memory that lets a longer sequence or a bigger micro-batch fit. → 06-mixed-precision-and-activation-recompute.md