03. NCCL collectives and interconnect — the wire between GPUs is now the wall¶

~22 min read. The 70B model doesn't fit on one card, so it lives across several. Now every transformer layer needs the cards to stop and exchange partial results before they can continue. The math inside each GPU is fast. The conversation between them is where the time goes — and how fast that conversation runs is decided by a wire you may not have looked at.

Built on the first-principles overview, roofline, and kernel fusion. The invariant is still feed the beast, but the starving unit changes scale: it is no longer a stalled SM waiting on HBM, it is a whole GPU waiting on its peers. This file introduces collective cost — time spent in all-reduce / all-gather / reduce-scatter — and the rule that interconnect topology, not the model, decides how fast collectives run.

What fusion fixed and what splitting the model now breaks¶

Roofline gave you the diagnostic: count FLOPs and bytes, name the wall. Fusion acted on it inside one GPU — keep intermediates on-chip, cross the HBM cliff once, replay launches with CUDA Graphs. Both files lived entirely inside a single card's memory hierarchy. They quietly assumed the model fits on one GPU.

It does not. Llama-3-70B in BF16 is ~140 GB of weights; an H100 holds 80 GB. The model has to be cut across at least two cards, and in practice four or eight for headroom and throughput. The moment you cut it, a new cost appears that neither roofline nor fusion can see: every layer's computation now produces a partial result on each GPU, and the GPUs must combine those partials before the next layer can run. That combination travels over a wire. This file shows what that wire is, why the all-reduce that combines partials is the dominant new cost, and why the same all-reduce can be fast or catastrophically slow depending only on whether the wire is NVLink, PCIe, or InfiniBand.

What this file solves¶

Our 70B endpoint runs tensor-parallel across GPUs, and the profile shows each GPU's kernels are efficient yet the per-token latency is far worse than a single GPU's math would predict. The gap is collective communication — all-reduce traffic between the cards. The naive read is "multi-GPU is just slower." The real cause is that the collectives are running over the wrong interconnect, or in the wrong topology, for the message sizes involved. This file teaches you to recognize collective time in a profile, to know which collective each parallelism strategy fires, and to place that traffic on the fastest wire available so the cards stop waiting on each other.

Why splitting a model forces the cards to talk¶

There are two reasons a model spans GPUs, and they create different communication.

It doesn't fit. 140 GB of weights can't live on an 80 GB card. You shard the weights. The most common intra-layer shard is tensor parallelism: each matmul's weight matrix is cut into column or row slices, each GPU holds one slice, each computes a partial output for the same tokens. To get the true layer output you must sum the partials across GPUs — an all-reduce.

It would be too slow as one stream. Even when a model fits, you may split work across GPUs for throughput. Pipeline parallelism puts different layers on different GPUs and passes activations down the line. Data parallelism replicates the model and splits the batch, then all-reduces gradients during training.

For our serving case, tensor parallelism is the one that bites, because it fires a collective inside every layer. An 80-layer model run tensor-parallel does roughly two all-reduces per layer in the forward pass — on the order of 160 all-reduces to produce activations for a single forward step. Each all-reduce is a synchronization point: every GPU stops, exchanges and sums data, and only then continues. If that exchange is slow, the fast kernels on either side of it wait.

Teacher voice. Tensor parallelism turns one big matmul into several smaller matmuls plus an all-reduce. The matmuls got cheaper — you split the work. But you bought that with a synchronization tax on every layer. Whether tensor parallelism is a win depends entirely on whether the all-reduce is cheaper than the FLOPs it saved. On NVLink it usually is. On PCIe it often is not.

The naive setup that strangles itself on PCIe¶

A team puts the 70B model on four GPUs in a server, sets tensor-parallel size to 4, and serves. Throughput is worse than they expected — barely better than two GPUs, sometimes worse. They add GPUs five through eight; it gets worse still. Multi-GPU is supposed to help. It's hurting.

The visible break: a profile shows each GPU spending a large share of every layer inside NCCL all-reduce, with the compute kernels idle on both sides of it. Adding GPUs increased the number of participants in each all-reduce, so each collective got slower, and the per-layer sync tax grew faster than the per-GPU compute savings.

Then someone checks the topology. The four GPUs are connected to the CPU over PCIe, not to each other over NVLink. Every all-reduce is crawling over a ~64 GB/s PCIe path — through the CPU root complex — instead of a 900 GB/s NVLink fabric. The collective is moving the same bytes either way; it's the wire that's 14× slower.

So the real problem is not that the model is on multiple GPUs; it is that the all-reduce traffic is on the wrong interconnect. The cards aren't slow at talking — they're whispering through a straw.

So how do we make the all-reduce as cheap as the FLOPs it saved — and how do we even know which wire our traffic is taking?

When two GPUs must agree on one sum¶

Take the smallest case. Two GPUs, tensor-parallel. Each computes a partial vector for the same tokens: GPU 0 has a, GPU 1 has b, and the layer needs a + b on both cards before the next layer runs. That is an all-reduce of two vectors.

Naively: GPU 0 sends a to GPU 1, GPU 1 computes a + b, sends it back. Two transfers of the full vector. Over NVLink at 900 GB/s a 100 MB partial moves in ~0.1 ms each way. Over PCIe at ~64 GB/s the same partial takes ~1.5 ms each way — and it routes through the CPU, adding latency. Same operation, same bytes, 14× the time, fired ~160 times per forward step. That is the difference between a multi-GPU endpoint that flies and one that grinds.

Rule: collective time is bytes ÷ the slowest link on the path¶

The collective rule. A collective's time is set by the volume of data each GPU must move and the bandwidth of the slowest link on the path it travels. The model decides the bytes; the topology decides the bandwidth. Keep tensor-parallel groups inside the highest-bandwidth domain (NVLink within a node); never let per-layer collectives cross a slow link (PCIe, or inter-node InfiniBand) if you can avoid it.

Why the rule exists. The primitive is that a partial result computed on one GPU has no value to the layer until it's combined with the partials on the others — combining requires moving bytes between cards. The constraint is that GPU-to-GPU links differ in bandwidth by more than an order of magnitude (NVLink 900 GB/s vs PCIe ~64 GB/s vs InfiniBand 50 GB/s per direction). The naive "just add GPUs" approach breaks because each added participant adds collective traffic, and if that traffic lands on a slow link, the sync tax outgrows the compute savings. Collectives are bandwidth-bound the same way decode is — but the bandwidth in question is the interconnect, not HBM.

1) The three collectives you must recognize¶

Three collective patterns cover almost everything in LLM training and serving. Each moves data between all GPUs in a group; they differ in what each GPU ends up holding.

ALL-REDUCE        every GPU contributes a value; every GPU ends with the SUM of all values
                  GPU0:a  GPU1:b  GPU2:c  GPU3:d   →   all GPUs hold (a+b+c+d)
                  (tensor parallelism's per-layer op; gradient sync in data parallelism)

ALL-GATHER        every GPU has a shard; every GPU ends with ALL shards concatenated
                  GPU0:[w0] GPU1:[w1] ... → all GPUs hold [w0|w1|w2|w3]
                  (gather sharded weights/activations; FSDP unshards before a layer)

REDUCE-SCATTER    every GPU contributes a full vector; each GPU ends with one SUMMED shard
                  inverse of all-gather; the two compose into a ring all-reduce
                  (FSDP gradient reduction; half of the ring all-reduce algorithm)

The key structural fact: all-reduce = reduce-scatter + all-gather. The standard ring algorithm runs a reduce-scatter (each GPU accumulates one shard of the sum) followed by an all-gather (each GPU collects the other shards). This is why all three appear together in real systems and why understanding the ring explains all of them.

Mini-FAQ. "Why not just have one GPU collect everything and broadcast the answer?" Because that GPU's single link becomes the bottleneck — every other GPU's data funnels through it, so total time scales with the number of participants. The ring algorithm spreads the traffic so each GPU only ever sends to its one neighbor, and total bandwidth used stays roughly constant as you add GPUs. That's the whole reason NCCL uses rings and trees instead of a naive gather-broadcast.

2) The picture — same collective, three wires¶

The mental model that lands this file is one all-reduce drawn over three interconnects, because the bytes are identical and only the wire changes.

                         ALL-REDUCE of a partial, four GPUs
                         (model fixes the bytes; topology fixes the speed)

  NVLINK / NVSWITCH (intra-node)          PCIe (intra-node, no NVLink)        INFINIBAND (inter-node)
  ──────────────────────────────         ────────────────────────────        ───────────────────────
   G0 ◀═══════▶ G1                         G0 ──┐                               Node A            Node B
   ║  ╲       ╱  ║   every pair             G1 ──┤                              [G0 G1 G2 G3] ⇄ [G4 G5 G6 G7]
   ║   ╲     ╱   ║   ~900 GB/s              G2 ──┼── CPU root ── G2,G3              ▲   400 Gb/s    ▲
   ║    ╲   ╱    ║   via NVSwitch           G3 ──┘   complex                       │  (~50 GB/s)   │
   G3 ◀═══════▶ G2                          (~64 GB/s, through CPU)             NVLink inside    NVLink inside
                                                                                each node        each node
   all pairs talk at full NVLink BW        all traffic funnels through CPU     fast inside node, 18× drop
   collective ≈ cheap                      collective ≈ expensive              the instant it crosses nodes

Three numbers anchor the whole file. NVLink (4th gen, H100): 900 GB/s per GPU, and an NVSwitch fabric lets all 8 GPUs in a DGX H100 talk to each other at that rate simultaneously. PCIe Gen5: ~64 GB/s, and it routes through the CPU root complex, so cards contend for it. InfiniBand (the inter-node fabric): 400 Gb/s ≈ 50 GB/s per link — an ~18× drop the instant a collective has to leave the node. The bytes the model moves are the same in all three. The time is not.

Teacher voice. There is a hard cliff at the edge of a node, just like the HBM cliff inside a GPU. Inside a DGX H100, eight GPUs talk at 900 GB/s over NVSwitch. The ninth GPU is in another box, and reaching it means dropping to ~50 GB/s over InfiniBand. The roofline had one cliff between L2 and HBM. Multi-GPU has another between NVLink and the network. The whole game of placement is keeping hot collectives on the fast side of every cliff.

3) Placing the 70B endpoint's collectives — the running example¶

Our endpoint serves Llama-3-70B tensor-parallel. The design question is now concrete: how many GPUs, and on which wire?

Tensor parallelism fires an all-reduce inside every layer — ~160 per forward step for an 80-layer model. That traffic is the hottest, most latency-sensitive communication in the system, and it sits on the critical path of every single token. The rule says: keep it on the fastest wire. So tensor-parallel size must stay inside one NVLink/NVSwitch domain — within the 8 GPUs of a single DGX H100 node, talking at 900 GB/s. The moment tensor parallelism spans two nodes, those 160 per-layer all-reduces cross InfiniBand at ~50 GB/s, and the endpoint's per-token latency collapses.

So the placement is: tensor-parallel the 70B across 4 (or 8) GPUs within one node, over NVLink. If we later need more replicas for throughput, we add whole replicas on other nodes and route requests across them — the cross-node traffic is then just load-balancing, not per-layer all-reduce. We never let the per-layer collective leave the NVLink domain.

This is the concrete answer to the naive PCIe failure above: the four GPUs must be NVLink-connected, not PCIe-attached, and NCCL must be told (or must detect) to use the NVLink fabric. Get that right and the per-layer all-reduce drops from ~1.5 ms to ~0.1 ms, and the endpoint climbs toward target throughput because the cards stop waiting on each other every layer.

4) Ring vs tree — why NCCL picks a different algorithm by size and topology¶

NCCL doesn't run one fixed algorithm. It detects the topology at init and selects per collective, and the two big families trade off differently.

Ring all-reduce¶

Each GPU sends only to its next neighbor, in 2(N-1) steps for N GPUs. The bandwidth used per GPU is independent of N — it's bandwidth-optimal. But latency grows linearly with N, because data has to walk all the way around the ring. Great for large messages where bandwidth dominates; poor for tiny messages where the per-step latency dominates.

Tree all-reduce¶

Data reduces up a tree and broadcasts back down in O(log N) steps. Latency-optimal for small messages at large scale, because the number of hops grows logarithmically, not linearly. NCCL added tree algorithms precisely because ring's linear latency hurt at thousand-GPU scale for small collectives.

On NVSwitch systems, NCCL can use NVLS (NVLink SHARP) algorithms that offload the reduction into the switch itself — the switch sums the data as it passes through, so GPUs send once and the fabric does the arithmetic. Recent NCCL versions extend SHARP across both NVLink and InfiniBand fabrics, and added new algorithms (PAT) for all-gather/reduce-scatter that use a logarithmic number of network transfers at scale where ring would stall.

Why ring over tree under our workload? For our single-node tensor-parallel 70B, messages are large (full activation partials) and N is small (4–8 GPUs on NVSwitch). Large messages + few GPUs is exactly where ring (or NVLS on the switch) wins: bandwidth dominates, ring is bandwidth-optimal, and NVSwitch gives every pair full bandwidth so the ring's linear-latency weakness barely shows. Tree's logarithmic latency only pays off at hundreds-to-thousands of GPUs with small messages — a training-at-scale regime, not our serving node. NCCL picks this for you; the judgment is knowing why, so you can verify it chose right.

5) The property that decides everything: where the collective's traffic lands¶

The one dimension that changes the multi-GPU design is which interconnect domain the hot collective traffic stays inside. Cross a domain boundary and the math inverts.

Tensor-parallel placement	Hot collective path	Effective BW	Per-layer all-reduce (100 MB partial)	Verdict
4 GPUs, NVLink/NVSwitch, 1 node	NVLink 900 GB/s	~900 GB/s	~0.1 ms	strong — sync tax hidden under compute
4 GPUs, PCIe-attached, 1 node	PCIe via CPU ~64 GB/s	~64 GB/s	~1.5 ms	weak — sync dominates, adding GPUs hurts
8 GPUs spanning 2 nodes	per-layer all-reduce over InfiniBand	~50 GB/s	~2 ms+	pathological — 160× per step over the network
4 GPUs NVLink + replicas on other nodes	per-layer on NVLink; cross-node only load-balances	~900 GB/s hot path	~0.1 ms	strong — scale-out without crossing the cliff

The surprise is the inversion in row three: more GPUs makes it worse, not better, because the added GPUs pushed the per-layer collective across the node boundary onto InfiniBand. The lesson NVIDIA's own guidance encodes: confine tensor-parallel size to the NVLink domain; scale beyond it with pipeline or data parallelism whose collectives are far less frequent and less latency-sensitive.

6) The failure walked through: the eight-GPU job slower than four¶

A team scales their tensor-parallel serving job from 4 GPUs to 8 to "double throughput." Throughput drops. They're baffled — twice the hardware, less output.

Trace it. The server had two groups of 4 GPUs, each group internally NVLink-connected, but the two groups connected to each other only over PCIe / a slower inter-group link (or the 8 GPUs spanned two nodes over InfiniBand). At tensor-parallel-4, every all-reduce stayed inside one NVLink group — fast. At tensor-parallel-8, every per-layer all-reduce now spanned both groups, crossing the slow link ~160 times per forward step. The collective time exploded and swamped the compute savings from splitting the matmuls further.

The fix wasn't more hardware — it was placement. Keep tensor-parallel at 4 inside one NVLink group, and run the second group of 4 as a separate replica serving its own requests. Same 8 GPUs, but now the hot per-layer collective never crosses the slow link; the cross-group traffic is only request routing. Throughput roughly doubled, as the hardware count promised. The lesson: in multi-GPU, topology-aware placement beats raw GPU count, and the wrong placement makes more GPUs actively harmful.

7) Cost movement: what tensor parallelism buys and what the collective costs¶

What it fixes: lets a model that doesn't fit on one GPU run at all, and splits each matmul's FLOPs across cards so per-GPU compute drops.
What it costs: a per-layer synchronization tax — an all-reduce on the critical path of every token, whose price is set by the interconnect. On NVLink it's cheap enough to hide; on PCIe or across nodes it dominates and can erase the compute savings.
Which subsystem pays: the interconnect fabric and the synchronization barrier. Every all-reduce is a point where all GPUs must arrive before any proceeds, so the slowest GPU and the slowest link gate the whole group. The reward (fitting the model, more aggregate compute) lands only if the collective stays on a fast wire.

For the running example: tensor-parallel-4 on NVLink hides the all-reduce under compute, so splitting the 70B is nearly free and the model fits. The same split on PCIe would make the sync tax the dominant cost — the difference between an endpoint that scales and one that doesn't, decided entirely by which wire the bytes take.

8) Signals: healthy, first to degrade, and the liar¶

Healthy: NCCL collective time is a small, stable fraction of each layer's wall-clock; the chosen algorithm matches topology (ring/NVLS on NVSwitch for large messages); nccl-tests all_reduce_perf reports busbw close to the link's peak (near 900 GB/s on NVLink, near 50 GB/s if you're correctly inter-node).
First metric to degrade: collective time per layer climbs as you add GPUs or as a placement change pushes traffic onto a slower link; per-token latency rises while per-GPU compute looks unchanged. A widening gap between "GPU active" and "useful compute" in the Nsight Systems timeline, with NCCL kernels filling the gap.
The misleading metric: GPU-Util again — NCCL kernels are kernels, so a GPU stuck in all-reduce reports high utilization while doing no useful math. And aggregate FLOPs look fine per GPU; the loss is between them.
The graph an expert opens first: the Nsight Systems timeline showing NCCL kernels and the gaps around them, plus nccl-tests busbw vs the theoretical link bandwidth. If measured busbw is a fraction of the link peak, the collective is on the wrong wire or the wrong algorithm. NCCL_DEBUG=INFO prints the topology and algorithm NCCL chose — the first thing to check when a collective is slow.

9) Boundary: where collectives are cheap and where they wreck you¶

Collectives are cheap and tensor parallelism is a clean win when the group stays inside a single NVLink/NVSwitch domain with large messages — exactly the DGX H100 serving a 70B at tensor-parallel-4-or-8. The fabric gives every pair full bandwidth, the ring/NVLS algorithm is bandwidth-optimal, and the sync tax hides under compute.

It becomes pathological the instant the hot collective crosses a domain boundary. Spanning tensor parallelism across nodes puts ~160 per-layer all-reduces on InfiniBand at ~50 GB/s; the endpoint's latency collapses and adding GPUs makes it worse. PCIe-only servers (no NVLink) are nearly as bad for per-layer collectives. The scale limit that invalidates naive intuition: "more GPUs = more throughput" is false for tensor parallelism past the NVLink domain — there, the right move is to stop growing the tensor-parallel group and instead add independent replicas or switch to a parallelism whose collectives are rarer (pipeline parallelism communicates once per pipeline stage, not once per layer).

10) Wrong model: "the interconnect is just plumbing"¶

The seductive wrong idea is that GPU-to-GPU links are interchangeable plumbing — bytes get there eventually, so topology is an ops detail, not a design parameter. For multi-GPU LLM serving it is the design parameter.

Replace it with: the interconnect is part of the compute substrate, and its bandwidth varies by more than an order of magnitude across NVLink (900 GB/s), PCIe (~64 GB/s), and InfiniBand (~50 GB/s). A collective fired ~160 times per forward step is on the critical path of every token; whether it runs on the 900 GB/s wire or the 50 GB/s wire decides whether multi-GPU helps or hurts. Topology is not plumbing; it is the second roofline.

11) Other failure shapes to recognize¶

PCIe-attached "GPU server." Cards present but not NVLink-connected; every per-layer all-reduce funnels through the CPU at ~64 GB/s. Fix: NVLink/NVSwitch hardware, or shrink the tensor-parallel group.
Tensor parallelism across nodes. Per-layer all-reduce on InfiniBand; latency collapse. Fix: confine TP to one node; scale out with pipeline/data parallelism or replicas.
NCCL choosing the wrong algorithm. Tree where ring would win, or vice versa, often after a topology change NCCL misdetected. Fix: check NCCL_DEBUG=INFO; tune NCCL_ALGO/NCCL_PROTO only as a last resort.
The straggler GPU. One slow card (thermal throttle, ECC errors) gates every all-reduce because all must arrive at the barrier. Fix: find and drain the straggler; the collective is only as fast as the slowest participant.
Mixing collective and compute streams badly. Failing to overlap communication with computation, so the GPU idles during all-reduce instead of computing the next chunk. Fix: overlap comms with compute where the algorithm allows.
Inter-node fabric misconfig. GPUDirect RDMA not enabled, so inter-node traffic bounces through host memory; effective bandwidth far below InfiniBand peak. Fix: enable GPUDirect RDMA; verify with nccl-tests.

12) Pattern transfer¶

Same cliff geometry as the roofline. The HBM cliff (file 01) sits between L2 and VRAM; the interconnect cliff sits between NVLink and the network. Both are "fast, scarce, near" vs "slow, plentiful, far," and the whole optimization game is keeping hot data on the near side. One physical pattern, two layers.
Same pressure as amortizing fixed cost. Ring all-reduce keeps per-GPU bandwidth constant as N grows by making each GPU talk only to its neighbor — the same "spread the cost so it doesn't concentrate" move as batching weight-reads (file 04) and fusing launches (file 02). Different layer, same amortization shape.
Same shape as distributed-systems quorum latency. A collective is a barrier: all participants must arrive before any proceeds, so the slowest gates the group — structurally identical to a quorum write waiting on the slowest replica. The straggler problem recurs wherever a group must synchronize.

13) Design test — five questions before you scale multi-GPU¶

Does your tensor-parallel group stay inside a single NVLink/NVSwitch domain, or does it cross PCIe or node boundaries?
Can you state which collective each parallelism strategy fires, and how often per forward step (TP: per layer; PP: per stage; DP: per step)?
Have you measured nccl-tests busbw and compared it to the theoretical link bandwidth — not just assumed the wire is fast?
When you add GPUs, are you growing the tensor-parallel group (more per-layer collective) or adding replicas (no new hot collective)? Which does your workload actually need?
If a collective is slow, have you checked NCCL_DEBUG=INFO for the detected topology and chosen algorithm before tuning anything?

Where this appears in production¶

The collectives and the library

NVIDIA NCCL — the standard collective library; auto-detects topology and selects ring/tree/NVLS per collective. Every multi-GPU PyTorch/Megatron job calls it under the hood.
Megatron-LM — uses NCCL all-reduce for tensor-parallel layers; an 80-layer 70B does ~160 all-reduces per forward step, the canonical heavy-collective workload.
PyTorch DDP / FSDP — DDP all-reduces gradients each step; FSDP all-gathers sharded weights before a layer and reduce-scatters gradients after — the all-reduce decomposition made explicit.
DeepSpeed ZeRO — partitions optimizer state/gradients/weights and reconstructs them with all-gather/reduce-scatter; collective cost is its central tuning knob.
NCCL SHARP / NVLS — offloads reduction into the NVSwitch or InfiniBand switch so the fabric does the arithmetic; used at thousand-GPU training scale.
nccl-tests (all_reduce_perf) — the busbw benchmark every cluster team runs to verify the interconnect delivers near peak before trusting a training run.

The interconnect hardware

NVLink (4th gen, H100) — 900 GB/s GPU-to-GPU; the fast wire that makes intra-node tensor parallelism cheap.
NVSwitch (DGX H100) — connects all 8 GPUs in a node at full NVLink bandwidth simultaneously, and scales to 256 GPUs in NVLink domains.
NVLink 5 / GB200 NVL72 — fifth-gen NVLink at 1,800 GB/s per GPU connects 72 Blackwell GPUs as one NVLink domain, pushing the "node boundary" out to a whole rack so larger tensor-parallel groups stay on the fast wire.
InfiniBand (400 Gb/s NDR) — the inter-node fabric; ~50 GB/s per link, an ~18× drop from NVLink — why per-layer collectives must not cross nodes.
PCIe Gen5 (~64 GB/s) — the fallback when there's no NVLink; routes through the CPU and strangles per-layer collectives.
GPUDirect RDMA — lets InfiniBand NICs read/write GPU memory directly, skipping host bounce; required to hit InfiniBand peak for inter-node collectives.
AWS EFA / GCP GPUDirect-TCPX — cloud inter-node fabrics that approximate InfiniBand for multi-node training when bare NVLink isn't available across nodes.
Meta / xAI training clusters — public notes describe topology-aware placement (TP within NVLink, DP/PP across nodes) as essential to hit MFU targets on tens of thousands of GPUs.

Pause and recall¶

Why does splitting a 70B model across GPUs create an all-reduce on the critical path of every token?
Name the three collective patterns and what each GPU ends up holding after each.
State the relationship between all-reduce, reduce-scatter, and all-gather.
Give the approximate bandwidths of NVLink, PCIe, and InfiniBand, and the ratio between NVLink and InfiniBand.
Why does ring all-reduce keep per-GPU bandwidth constant as you add GPUs, and when does tree win instead?
Why can adding GPUs to a tensor-parallel job make it slower?
State the rule for placing tensor-parallel groups relative to the NVLink domain.
A multi-GPU job shows high GPU-Util but poor per-token latency. What do you check first?

Interview Q&A¶

Q1. Your tensor-parallel 70B endpoint shows each GPU at 90% utilization but per-token latency is 3× what the per-GPU math predicts. Where do you look? A. High util with bad latency in a multi-GPU job is the collective signature: NCCL all-reduce kernels are resident (so util reads high) while the cards wait on each other. Open the Nsight Systems timeline and look for NCCL kernels and the gaps around them; run nccl-tests all_reduce_perf and compare busbw to the link peak. If busbw is far below NVLink's 900 GB/s, the per-layer all-reduce is on the wrong wire (PCIe or across nodes) — fix placement before anything else. Common wrong answer to avoid: "Utilization is high, so the GPUs are maxed — add more." NCCL kernels count as utilization; you'd add more participants and make every collective slower.

Q2. Why does scaling a tensor-parallel job from 4 to 8 GPUs sometimes reduce throughput? A. If the 8 GPUs don't all share one NVLink/NVSwitch domain, growing the tensor-parallel group pushes the ~per-layer all-reduce across a slower link (PCIe between groups or InfiniBand between nodes). The collective fires ~160 times per forward step, so its slowdown swamps the compute savings from splitting matmuls further. Keep TP inside the NVLink domain and add the extra GPUs as a separate replica instead. Common wrong answer to avoid: "More GPUs always means more throughput." For tensor parallelism past the NVLink domain, more GPUs adds collective traffic that can net-slow the job.

Q3. Why must tensor parallelism stay within a node but data/pipeline parallelism can span nodes? A. Tensor parallelism fires an all-reduce inside every layer — ~160 per forward step — all on the token's critical path, so it needs the 900 GB/s NVLink wire. Pipeline parallelism communicates only at stage boundaries (far rarer), and data parallelism all-reduces gradients once per step (training) or just load-balances (serving). The rarer, less latency-sensitive collectives tolerate the ~50 GB/s InfiniBand inter-node link; the per-layer one does not. Common wrong answer to avoid: "All parallelism is the same; just split however." The collective frequency and latency-sensitivity differ enormously by strategy, and that dictates placement.

Q4. Why does NCCL use a ring instead of having one GPU gather and broadcast? A. Gather-broadcast funnels every GPU's data through one GPU's single link, so total time scales with the number of participants — the collecting GPU's bandwidth is the bottleneck. The ring has each GPU send only to its neighbor, so per-GPU bandwidth stays constant as N grows; it's bandwidth-optimal. Trees trade some bandwidth for logarithmic latency, which wins only for small messages at large scale. Common wrong answer to avoid: "Gather to one GPU is simpler and fine." It serializes traffic through one link and scales badly with GPU count.

Q5. When is NVLink overkill and PCIe acceptable? A. When the GPUs run independent models (separate replicas, no per-layer collective) — e.g., eight 7B models each on its own card serving its own requests. There's no tensor-parallel all-reduce, so the interconnect only carries occasional data movement, and PCIe is fine. NVLink earns its cost specifically when GPUs must synchronize frequently, as in tensor-parallel serving of a model too big for one card. Common wrong answer to avoid: "Always buy NVLink." If your GPUs never run a shared, sharded model, the per-layer collective doesn't exist and NVLink is unused capital.

Q6. (Cumulative.) A multi-GPU 70B endpoint is slow. How do you tell whether it's a roofline (file 01) problem, a fusion (file 02) problem, or a collective (this file) problem? A. Open the Nsight Systems timeline. If time is inside one kernel with high DRAM-active and low Tensor-active, it's the memory roofline — fix with batching. If it's many tiny kernels with launch gaps or huge intermediate HBM writes, it's missing fusion. If it's NCCL kernels and barriers between the compute kernels, with busbw far below the link peak, it's the collective on the wrong wire. The timeline localizes the wall; each wall has a different file's fix. Common wrong answer to avoid: "Slow GPU means buy faster GPUs." Three different walls have three different fixes, and only one of them is hardware — and even that is usually the wrong one.

Design/debug exercise (10 min)¶

Step 1 — Model it. A tensor-parallel-4 layer must all-reduce a 100 MB activation partial. On NVLink (900 GB/s), the ring all-reduce moves roughly 2 × 100 MB of effective traffic per GPU; estimate ~0.1–0.2 ms. On PCIe (~64 GB/s through the CPU), the same all-reduce takes ~1.5 ms or more. Multiply each by ~160 per-layer collectives per forward step and write down the per-token communication time on each wire. The NVLink path adds ~tens of ms total hidden under compute; the PCIe path adds hundreds of ms on the critical path.

Step 2 — Your turn. For the 70B endpoint, you're offered two layouts: (a) tensor-parallel-8 spanning two DGX nodes over InfiniBand, or (b) tensor-parallel-4 on one node's NVLink, with a second TP-4 replica on the other node. Using the per-collective times from step 1 and the ~50 GB/s InfiniBand figure, argue which layout serves more tokens/sec and why the "more GPUs in one group" option (a) loses. Tie it to the rule: confine the per-layer collective to the NVLink domain.

Step 3 — Reproduce from memory. Redraw the "same collective, three wires" diagram from section 2, labeling NVLink (900 GB/s, NVSwitch full mesh), PCIe (~64 GB/s through CPU), and InfiniBand (~50 GB/s, node boundary). Then state in one sentence how this connects to file 01: the interconnect is a second roofline whose slope is link bandwidth, and the optimization is the same — keep hot traffic on the near, fast side of the cliff.

Operational memory¶

This chapter explained why a 70B model split across GPUs can run efficient per-GPU kernels yet serve tokens far slower than expected: every transformer layer fires an all-reduce to combine the partial results from each card, and that collective sits on the critical path of every token. The important idea is that the model fixes how many bytes the collective moves, but the interconnect topology fixes how fast they move — and NVLink, PCIe, and InfiniBand differ by more than an order of magnitude.

You learned to place the hot collective on the fast wire: confine tensor-parallel groups to a single NVLink/NVSwitch domain, recognize the three collective patterns (all-reduce, all-gather, reduce-scatter) and which strategy fires each, and verify the wire with nccl-tests busbw and NCCL_DEBUG=INFO. That solves the opening failure because it moves the ~160 per-layer all-reduces from a ~50–64 GB/s link onto the 900 GB/s NVLink fabric, where the sync tax hides under compute instead of dominating it.

Carry this diagnostic forward: when multi-GPU is slow, ask "which wire is the hot collective on?" before adding hardware. If you see high GPU-Util with bad per-token latency on a sharded model, suspect a collective on the wrong interconnect before blaming the model or buying cards.

Remember:

Tensor parallelism trades smaller per-GPU matmuls for a per-layer all-reduce on every token's critical path — a good trade only on a fast wire.
The model decides the collective's bytes; the topology decides its speed. NVLink 900 GB/s, PCIe ~64 GB/s, InfiniBand ~50 GB/s.
all-reduce = reduce-scatter + all-gather; ring is bandwidth-optimal, tree is latency-optimal at large scale with small messages.
Confine tensor-parallel groups to one NVLink/NVSwitch domain; scale beyond it with replicas or pipeline/data parallelism, never by stretching the per-layer collective across nodes.
More GPUs can make a tensor-parallel job slower if it pushes the collective across a domain boundary.
Next pressure: the model now fits and the cards talk fast — but we're still feeding them one request at a time, leaving the memory roofline's batching lever untouched. A compiler that fuses, batches, and pages the KV cache is the next layer.

Bridge. We made the model fit across GPUs and put the per-layer all-reduce on NVLink so the cards stop waiting on each other. But we're still serving badly: each forward pass handles too few requests, so the weight-reads the roofline told us to amortize are barely amortized, and the KV cache fragments memory until we can't fit a large batch anyway. The next file is the compiler that attacks all of this at once — TensorRT-LLM fuses the kernels (file 02's lever), keeps tensor-parallel collectives efficient (this file's lever), and adds the two big serving wins: in-flight batching to share weight-reads across requests, and paged KV cache to stop memory fragmentation from capping the batch. → 04-tensorrt-llm-compilation.md