Skip to content

08. Boundary and tradeoff review — where distributed-training intuition quietly lies

~22 min read. You can now make a 70B model fit, run fast, and survive a week. But every clean rule in this module has a frayed edge where the field still argues, where the theory ran out and people kept shipping anyway, and where the textbook answer is the wrong one in production. This file walks those edges — not to unteach the rules, but to mark where they stop being safe to trust.

Built on the whole module. We resolved the memory wall with sharding and model-splitting, paid the coordination cost along the interconnect hierarchy, traded compute for memory, and survived MTBF at scale. Each mechanism arrived as a clean answer to a clean pressure. This file is the opposite mood: it returns to the same mechanisms and asks where the clean answer frays — open problems, practices that work without theory, and live debates where reasonable senior engineers still disagree.


What the clean rules quietly assumed

Every prior file gave you a rule with a crisp justification. The 16-bytes-per-param arithmetic told you exactly what fits. Ring all-reduce costs 2× gradient size per GPU, independent of N. ZeRO-3 trades memory for all-gather traffic. Tensor parallelism stays inside a node; pipeline parallelism crosses it. Mixed precision halves the activation footprint. Checkpoint interval is √(2 × write × MTBF). These rules are real and you should keep them.

But each one carried a silent assumption that holds at the scale where it was taught and bends at the scale where the field actually operates. "Ring all-reduce is N-independent per link" assumes the ring is a single clean loop — at 16,000 GPUs across hundreds of switches, the topology of the ring dominates and the simple formula stops predicting throughput. "Find the optimal parallelism config" assumes you can search the space — the space is combinatorial and the search itself is an open research problem. "fp8 is just bf16 with a smaller exponent budget" assumes numerical stability you cannot actually guarantee, and whether fp8 pretraining is safe at frontier scale is contested right now, in production, with billions of dollars riding on the answer.

This file does not add a new mechanism. It does the harder thing: it takes the mechanisms you trust and marks exactly where trusting them costs you. The load-bearing rule for the whole file is one sentence.

Rule: at frontier scale, every clean tradeoff curve becomes a noisy search problem with no provably optimal answer

The neat curves of the earlier files — memory versus communication, compute versus precision, checkpoint cost versus failure cost — are local truths. Each holds when you vary one knob and freeze the rest. Real frontier runs vary all the knobs at once, on hardware that is heterogeneous and failing, under a wall-clock and dollar budget, with no closed-form optimum and no time to grid-search. The judgment this file builds is recognizing when you have left the regime where the formula is the answer and entered the regime where the formula is just a prior.

Why this rule exists. Each earlier mechanism solved one pressure in isolation, so its tradeoff could be written as a clean two-variable curve. Compose six mechanisms across thousands of failing GPUs and the curves couple: changing the parallelism degree moves the memory budget, which moves the recompute decision, which moves the arithmetic intensity, which moves whether fp8 is stable, which moves the checkpoint size. The composition has no analytic minimum. The naive approach — optimize each axis independently — fails because the axes are not independent, and the failure is invisible until you measure end-to-end MFU and find it 30% below what each local optimum predicted.


1) The parallelism-config search problem — how the field actually picks DP×TP×PP×ZeRO

Across the module you learned each parallelism axis in isolation and the rule for placing it: tensor parallelism inside a node, pipeline parallelism across nodes, data parallelism on the outside, ZeRO to free memory. So picking a config sounds like following rules. It is not. It is a search over a combinatorial space with a noisy, expensive objective, and there is no agreed optimal method.

Consider the 70B run on 1,024 H100s. You must choose a tensor-parallel degree (1, 2, 4, 8), a pipeline-parallel degree (1, 2, 4, 8, 16…), a ZeRO stage (0, 1, 2, 3), a micro-batch size, a number of pipeline micro-batches, an activation-recompute policy (none, selective, full), and a precision (bf16, fp8 on which ops). The product of valid combinations is in the thousands, and the objective — end-to-end MFU — costs a multi-hundred-GPU-hour run to measure for each point.

   THE CONFIG SPACE (illustrative, 1024 GPUs)
   ─────────────────────────────────────────
   TP      ∈ {1, 2, 4, 8}            (must stay intra-node → ≤ 8)
   PP      ∈ {1, 2, 4, 8, 16}        (crosses nodes; sets the bubble)
   ZeRO    ∈ {0, 1, 2, 3}            (frees memory, adds all-gather)
   recompute ∈ {none, selective, full}
   micro-batch, num-micro-batches, fp8 op-set ...
   ───────────────────────────────────────────────────────────
   constraint: TP × PP × DP = 1024,  and per-GPU memory ≤ 80 GB
   objective:  maximize end-to-end MFU  ← each measurement = $$$ + hours

So the real problem is not "what is the best config"; the real problem is that the objective is expensive to evaluate, the space is combinatorial, and the per-axis rules give you a starting region, not an answer. How does anyone pick a config without burning the budget on the search itself?

Three approaches coexist, none dominant. Analytic cost models (Megatron's, DeepSpeed's, Calculon, and similar) estimate MFU from FLOPs, bytes moved per axis, and interconnect bandwidth — fast to evaluate, but they mispredict because they cannot model kernel-level overlap, real NCCL behavior on a busy fabric, or stragglers. Empirical sweeps run a handful of candidate configs for a few hundred steps each and pick the best measured — accurate but expensive, and what wins at 1,000 steps can lose at 100,000 as the memory profile shifts. Auto-tuners and learned search (some frameworks search the space with heuristics or ML) promise to automate it but are not yet trusted for frontier runs where one bad config wastes a fortune. The honest state of practice: a senior engineer uses the cost model to prune to a few candidates, sweeps those empirically, and accepts that the chosen config is good, not provably optimal.

Teacher voice. When an interviewer asks "how do you choose the parallelism config," the wrong answer is a formula. The right answer names the method: prune with a cost model, sweep the survivors empirically, validate at scale, and revisit when hardware or model shape changes. The config is a search result under a budget, not a derivation. Anyone who claims a single right answer hasn't run the search.

2) The mental model — a frontier run lives in a foggy valley, not on a clean curve

Picture the earlier files as a clean U-shaped curve: vary one knob, watch the cost dip to a minimum, pick the bottom. The boundary reality is different and worth holding as one ASCII picture for the whole chapter.

   THE FOGGY VALLEY (composed config landscape, end-to-end MFU)
   ════════════════════════════════════════════════════════════

   MFU
    │   .  cost-model says optimum is HERE  (clean curve's bottom)
    │  . .         ╳                          ← but you can't see clearly
    │ .   .   ┌────┴────┐                         (fog = measurement noise,
    │.     . .│ real best│  ●  measured points        stragglers, fp8 drift)
    │       .─┘ is OVER  └─.    × × ×
    │          HERE, shifted   × ●  ×    ← each ● = a 100s-step run, noisy
    │          by stragglers,    ×
    │          fp8 drift, fabric
    │__________contention_______________________→  config space (1000s of points)

   You optimize by sampling a fog-bound landscape under a budget.
   The analytic minimum is a PRIOR, not the answer. The valley moves
   as the run lengthens, hardware degrades, and precision drifts.

The cost model points at the clean curve's bottom. The true optimum sits somewhere nearby, shifted by effects the model cannot see: a few stragglers, fabric contention from other jobs, fp8 numerics that force you off the aggressive config, a memory profile that grows over 100,000 steps. You sample the fog with a handful of expensive runs and settle for a good basin, not a proven point. Hold this picture — every debate below is a different patch of fog.

3) The threaded example — the 70B run, re-examined where each rule frays

Take the 70B run that survived file 07 and walk it back through the module, this time stopping at every place the clean rule has a contested edge. This is the running example for the rest of the file.

  • File 02 said ring all-reduce is N-independent per link. At 1,024 GPUs the ring is no longer one loop on one fabric; it is a hierarchy of intra-node NVLink rings feeding inter-node InfiniBand, and the tree and hierarchical collectives NCCL actually uses behave differently from the textbook ring. The formula predicts the shape, not the number.
  • File 03 said ZeRO-3 frees the most memory. It also adds the most communication, and at frontier scale teams often prefer ZeRO-1 plus tensor parallelism over ZeRO-3, because ZeRO-3's all-gather traffic competes with everything else on the wire. The "most memory saved" rule loses to "least wire contention."
  • File 04 said the pipeline bubble shrinks with more micro-batches. It never reaches zero, and whether pipeline parallelism is worth its bubble versus more tensor parallelism plus ZeRO is workload-dependent and argued.
  • File 06 said mixed precision is a clean compute-for-memory trade. fp8 is not clean. Whether you can pretrain a frontier model end-to-end in fp8 without quality loss is an open, contested question right now.
  • File 07 said checkpoint interval is √(2 × write × MTBF). That formula assumes you know MTBF and that failures are independent. At 16,000 GPUs failures correlate (a rack loses power, a switch flaps a whole pod), and correlated failures break the independence the formula rests on.

Each bullet is a place where you stop deriving and start judging. The sections below take the three sharpest.

4) Why teams pick ZeRO-1 + TP over ZeRO-3, against the "save the most memory" rule

The plausible textbook move: you are tight on memory, so reach for ZeRO-3, which shards parameters, gradients, and optimizer state and frees the most per GPU. File 03 taught exactly this ladder. Under the frontier workload, many teams do the opposite — ZeRO-1 (optimizer state only) combined with tensor parallelism — and the reason is the same coordination cost that has haunted the whole module.

ZeRO-3 must all-gather the full parameters of each layer just before it computes that layer, every forward and every backward, then discard them. That is a parameter-sized communication on the critical path of every layer. At 70B across 1,024 GPUs, the all-gather traffic is large and it lands on the same InfiniBand the data-parallel all-reduce needs.

   ZeRO-3 vs (ZeRO-1 + TP), 70B, frontier scale
   ─────────────────────────────────────────────
   ZeRO-3:    + frees the most memory (params sharded too)
              − all-gather full params per layer, every fwd/bwd
              − that traffic crosses inter-node fabric, competes with all-reduce
              → great on fast uniform NVLink; degrades when sharded across nodes

   ZeRO-1+TP: + TP splits the layer compute & memory inside a node (NVLink)
              + ZeRO-1 only shards optimizer state (smaller, less frequent comm)
              − TP limited to ≤ 8 (intra-node), so caps how far it scales alone
              → less inter-node contention; the frontier default for dense LLMs

So the real problem is not "which stage frees the most memory"; the real problem is which option puts the least traffic on the slowest wire while still fitting. ZeRO-3 wins the memory question and can lose the throughput question, because the memory it frees is paid for in inter-node all-gather that contends with everything else. Match the choice to the fabric: on a small fast-NVLink pod ZeRO-3 can be excellent; spread across InfiniBand-connected nodes, ZeRO-1 + TP (+ PP) frequently beats it. This is a genuine production reversal of the textbook ladder, and the deciding variable is interconnect, not memory.

Mini-FAQ. "So is ZeRO-3 wrong?" No — it is right when the shard group lives on fast uniform interconnect (a single NVLink-connected node or NVSwitch pod), where the all-gather is cheap. It becomes the wrong default when the shard group is spread across slow inter-node links, because then its defining operation runs on the wire you most want to protect. The rule isn't "ZeRO-3 bad"; it's "ZeRO-3's cost is communication, so place it where communication is cheap."

5) The fp8 stability debate — the dimension where theory ran out and the field shipped anyway

File 06 taught mixed precision as a clean trade: store activations in bf16, halve the footprint, keep an fp32 master copy, scale the loss to keep gradients in range. bf16 pretraining is settled — the field trusts it. fp8 is where the clean story breaks, and the breakage is not academic: it is the single most contested numerical question in frontier training right now, because fp8 roughly doubles tensor-core throughput on Hopper/Blackwell and the prize for making it work is enormous.

fp8 has two formats — E4M3 (more mantissa, less range, used for forward activations and weights) and E5M2 (more range, less mantissa, used for gradients). With only 3–4 mantissa bits, the representable values are coarse, and the dynamic range is narrow enough that a layer's activations can drift out of the representable window mid-training. The defenses are per-tensor scaling (each tensor gets its own scale factor, updated from running amax statistics) implemented in NVIDIA's Transformer Engine. It works — for inference and for fine-tuning, fp8 is broadly accepted.

The contested question is end-to-end fp8 pretraining of a frontier model. Here the two sides are both serious.

The case for fp8 pretraining

Argues: with per-tensor delayed scaling, careful op selection (keep the sensitive ops — layernorm, softmax, the master weights — in higher precision), and amax monitoring, fp8 pretraining matches bf16 loss curves while nearly doubling throughput. Vendors and several large labs report production fp8 pretraining at scale with no measurable quality loss. The throughput is real and the money is decisive.

The case against (or "not yet")

Argues: fp8 pretraining is fragile in ways that surface late. Loss can track bf16 for tens of thousands of steps and then diverge as a few layers' dynamics push activations out of range; the scaling statistics lag the true distribution; and the failure is expensive precisely because it appears far into a run. Skeptics keep the forward and backward matmuls in fp8 but the optimizer and accumulation in higher precision, or restrict fp8 to the most compute-heavy layers, and treat full-network fp8 pretraining as not-yet-proven for the longest frontier runs.

   PRECISION TRUST GRADIENT (where the field actually stands)
   ──────────────────────────────────────────────────────────
   fp32        ▓▓▓▓▓▓▓▓  fully trusted, rarely needed end-to-end now
   bf16        ▓▓▓▓▓▓▓░  the settled default for pretraining
   fp8 (infer) ▓▓▓▓▓▓░░  broadly accepted for inference & fine-tune
   fp8 (pretrain, selective ops) ▓▓▓▓░░░░  growing, with guardrails
   fp8 (pretrain, end-to-end full) ▓▓░░░░░░  CONTESTED — works for some, fragile for others

So the honest answer to "should we pretrain in fp8" is not yes or no; it is "with per-tensor scaling and sensitive ops kept higher, fp8 buys ~2× throughput, and the risk is a late-training divergence you must monitor amax and gradient-norm to catch early." This is the chapter's clearest case of a practice that works empirically while the theory of why and when it stays stable is incomplete.

Teacher voice. fp8 is the place to be honest in an interview. "fp8 doubles throughput and it's free" is the answer of someone who hasn't run a frontier pretrain. "fp8 is unstable, avoid it" is the answer of someone who hasn't read the recent results. The senior answer names the mechanism (per-tensor scaling, sensitive ops higher), the prize (~2× tensor-core throughput), and the risk (late divergence the loss hides until it's expensive), and treats it as a monitored bet, not a settled default.

6) One failure walked through — the fp8 run that diverged at step 60,000

Thread the 70B example into the fp8 debate concretely. The team enables end-to-end fp8 to capture the ~2× throughput. For 55,000 steps the loss tracks the bf16 baseline within noise. MFU is excellent. Everyone is happy. Then around step 60,000 the loss starts climbing, slowly at first, then sharply, and within 2,000 steps the run is clearly diverging.

   FP8 LATE DIVERGENCE (the expensive failure)
   ────────────────────────────────────────────
   loss
    │\
    │ \____ tracks bf16 baseline, steps 0–55k  (looks perfect)
    │      \________________
    │                       \___ step ~58k: gradient-norm spikes appear
    │                            \
    │                             \___ step ~60k: loss climbs
    │                                  \
    │                                   \____ step ~62k: clear divergence
    │__________________________________________________→ step
   root cause: a few layers' activation amax drifted past E4M3 range;
   delayed scaling lagged the shift; those layers' matmuls saturated/clipped;
   bad gradients compounded over thousands of steps until the loss broke.

The post-mortem finds it: a handful of deep layers developed activation outliers whose magnitude crept up over training. The delayed (running-statistics) scaling factor lagged the real distribution, so for those layers the fp8 quantization began clipping, the matmuls returned subtly wrong results, and the error compounded. The loss — the optimizer averaging over thousands of tensors — hid it until the corruption was gross enough to break through. This is the silent-corruption shape from file 07, now sourced not from a cosmic ray but from a numerical format the field chose on purpose. The fix in practice: keep those layers in bf16, switch to finer-grained or more responsive scaling, and monitor per-tensor amax and gradient norm as first-class signals — not just the loss.

The cost movement is brutal and asymmetric: fp8 saved ~45% wall-clock for 55,000 steps, then lost the entire run if there was no recent-enough checkpoint to roll back to a pre-divergence step, plus the engineering time to diagnose. The subsystem that pays for fp8's throughput win is observability: you must instrument numerics you never had to watch in bf16.

7) The checkpoint formula meets correlated failure — where √(2 × write × MTBF) bends

File 07's checkpoint-interval rule is genuinely useful, and it rests on an assumption worth naming: failures are independent and arrive at a roughly constant rate (a Poisson process), so MTBF is a stable number you plug in. At a few hundred GPUs that is close enough. At 16,000 GPUs across many racks and switches, it bends in two ways.

First, failures correlate. A rack loses power and 64 GPUs die at once. A spine switch flaps and a whole pod's inter-node links drop together. A bad batch of cables, a cooling event, a kernel bug triggered by a specific input — these take out many ranks in one correlated event, not one independent rank at a time. Correlated failures mean the number of incidents is lower than independence predicts but each is far more destructive, and the checkpoint cadence that's optimal for independent single-rank deaths can be wrong for bursty correlated ones.

Second, MTBF is non-stationary. Hardware fails more in the first days (infant mortality) and again as it ages; a run on freshly racked GPUs sees a different failure rate at week one than week four. Plugging one MTBF number into the formula assumes a flat rate it does not have.

   FAILURE-RATE REALITY at 16k GPUs
   ──────────────────────────────────
   textbook:   independent, constant-rate → MTBF stable → √(2·w·MTBF) exact
   reality:    correlated bursts (rack/switch/cooling) + non-stationary rate
               → fewer-but-bigger incidents, rate drifts over the run
   practice:   checkpoint MORE often than the formula's independent optimum,
               keep MULTIPLE generations (roll back past a correlated event or
               a silently-corrupt save), and treat the formula as a floor.

So the production practice contradicts the clean derivation in a specific way: teams checkpoint more frequently than the independent-failure optimum suggests and keep several checkpoint generations, because the real risk is a correlated event or a corrupt save that the single-MTBF formula does not price. The formula is a floor for the interval, not the answer. This is the file-07 rule re-entering under the harder constraint of correlation — an orphan-mechanism check the module passes.

8) Operational signals — telling a fog-bound run from a healthy one

Healthy behavior. Measured MFU lands within a few points of the cost model's prediction for the chosen config; per-tensor amax (if fp8) sits comfortably inside the format range with margin; gradient norm is steady; checkpoint generations accumulate and a rollback target always exists; the config chosen by the sweep still wins when re-checked at 10× the step count.

First metric to degrade. The gap between predicted and measured MFU. When the cost model says 50% and you measure 38%, something the model can't see — stragglers, fabric contention, an exposed collective, a suboptimal config — is eating the difference, and that gap is the first thing a frontier team chases. For fp8 specifically, the first tell of trouble is rising per-tensor amax creeping toward the format ceiling, before the loss moves.

The misleading metric. The loss curve, again — it tracks the baseline through an fp8 numerics drift, through a straggler, through a suboptimal-but-stable config that's leaving 15% MFU on the table. A clean loss says the optimization is proceeding, not that the run is efficient or that the numerics are safe. The second misleading metric is peak MFU on a short run: a config that peaks high at 1,000 steps can fall behind at 100,000 as the memory profile and numerics shift.

The graph an expert opens first. A predicted-vs-measured MFU panel per config, alongside (for fp8) a per-tensor amax-headroom heatmap across layers, and the checkpoint-generation/rollback-availability view. The diagnostic skill is correlating an MFU gap to a cause: exposed collective → topology/config; rising amax → fp8 layer; widening step-time tail → straggler (file 07); growing memory → recompute/ZeRO config.

9) Boundary of applicability — where these debates even matter

Where the contested edges bite hardest. Frontier-scale dense LLM pretraining — thousands to tens of thousands of GPUs, runs of weeks to months, budgets where 10% MFU is millions of dollars and one diverged run is a catastrophe. Here every debate in this file is a live, expensive decision: fp8 versus bf16, ZeRO-3 versus ZeRO-1+TP, checkpoint cadence against correlated failure.

Where the debates evaporate. Small and mid-scale training — a few nodes, hours to a day. At that scale the clean rules are simply correct: ring all-reduce is N-independent enough, the cost model is close, bf16 is fine, independent-failure checkpointing is right, and ZeRO-3 on a single NVLink pod is great. Bringing frontier-scale paranoia (fp8 monitoring, correlated-failure checkpointing, config sweeps) to a 16-GPU fine-tune is wasted engineering. The boundary is scale: the rules are exact small, contested large.

The scale limit on intuition. The deepest one in the module: "there is an optimal configuration I can derive." True in the small, where one knob varies on a clean curve. False at frontier scale, where the knobs couple, the objective is noisy and expensive, the hardware is heterogeneous and failing, and the optimum drifts as the run lengthens. At scale you do not derive the answer; you search a fog under a budget and defend a good basin.

10) The wrong model to carry, and the right one

The seductive-but-wrong intuition: "distributed training is a solved engineering discipline — pick the right config from the rules and execute." It reads as competence. It is the model of someone who has run clean mid-scale jobs and never sat through a frontier run where the cost model lied by 15%, fp8 diverged at step 60,000, and a switch flap took out a pod between checkpoints. The rules are real, but at the frontier they are priors over a contested, drifting landscape, not a closed-form solution.

The right model: distributed training at scale is an empirical search under physical and budget constraints, where clean tradeoff curves are local truths that couple and drift when composed. You carry the rules as starting points, instrument heavily because the loss hides the failures that matter, sweep and re-check because the optimum moves, and treat fp8, ZeRO choice, and checkpoint cadence as monitored bets rather than settled answers. The senior signal is not knowing the rules — it is knowing exactly where each rule stops being safe to trust.

11) Other edges where intuition frays

  • Large-batch generalization wall — scaling data parallelism grows the global batch, but past a workload-dependent size, convergence per token degrades; "more GPUs = faster" hits a statistical ceiling, not just a communication one, and the critical batch size is itself an open empirical question.
  • MoE changes every rule — mixture-of-experts adds expert parallelism and all-to-all routing traffic with load-imbalance across experts; the dense-model placement rules (TP intra-node, etc.) only partly transfer, and the all-to-all is a new dominant collective.
  • Cost model vs reality gap — analytic MFU predictions routinely miss by 10–20% because they can't model kernel overlap, real NCCL behavior on a contended fabric, or stragglers; trusting the model's number over a measured one is a classic frontier mistake.
  • Activation recompute granularity — "recompute everything" (file 06) is rarely optimal; selective recompute (only the cheap-to-recompute, memory-heavy ops) usually wins, and choosing which ops is unsettled and workload-specific.
  • Optimizer-state precision — whether the Adam moments can live in bf16 or fp8 instead of fp32 (cutting the 8 bytes/param of moments) is actively explored and not settled; it directly attacks the 16-bytes-per-param wall but risks convergence.
  • Asynchronous and decoupled methods resurface — for some regimes (RLHF rollouts, certain MoE setups) fully synchronous training is being questioned again, reopening the file-02 sync-vs-async debate that looked closed for dense pretraining.
  • Network topology beyond the simple ring — at 16k GPUs the fat-tree/rail-optimized fabric, in-network reduction (SHARP), and NCCL's tree/hierarchical algorithms make the file-02 ring formula a rough guide, not a predictor.
  • Checkpoint storage bandwidth wall — at 16k GPUs even async sharded checkpoints can saturate the storage fabric; the bottleneck moves from compute to the checkpoint write path, an edge file 07's formula assumes away.

12) Where this fits the larger systems map

  • Constraint echo — local optima don't compose. The lesson that six clean per-axis curves have no clean joint optimum is the same shape as query-plan optimization in databases (locally optimal joins, globally suboptimal plan) and as microservice tuning (each service optimized, the system slow). Coupled local optima with an expensive global objective is a recurring systems geometry, not a training quirk.
  • Failure geometry — the silent failure, again. fp8 late divergence is the file-07 silent-corruption shape with a new source: the format you chose, not the cosmic ray you didn't. Both hide behind the loss; both need instrumentation outside the primary metric. Same geometry, different layer.
  • Same pressure, harder constraint — MTBF aggregation under correlation. File 07's independent-failure MTBF math re-enters here under correlated, non-stationary failure — the same reliability pressure that forces RAID to worry about correlated disk-batch failures and storage systems to spread replicas across failure domains.
  • Empirical-over-theoretical pattern. "It works in production before the theory explains why" recurs across the curriculum — learned indexes, certain caching heuristics, and now fp8 pretraining. The senior move is the same: ship with guardrails and instrumentation, not with a proof.

Where this appears in production

  • Meta Llama 3 / Llama 3.1 training — published infrastructure notes describe config selection by cost-model-plus-sweep, bf16 pretraining (not full fp8) at the 16k-GPU scale, and correlated/frequent hardware failures driving aggressive checkpointing — the canonical record that frontier practice diverges from clean rules.
  • NVIDIA Transformer Engine (fp8) — the per-tensor delayed-scaling implementation at the center of the fp8 debate; ships the E4M3/E5M2 formats and amax tracking teams monitor for drift.
  • DeepSeek-V3 fp8 training — a widely-cited public example of large-scale fp8 pretraining with fine-grained scaling, a key data point for the "fp8 pretraining works" side.
  • Megatron-LM config heuristics — encodes the TP-intra-node / PP-across-nodes placement rules as defaults, and exposes the knobs whose joint optimum is the search problem of §1.
  • DeepSpeed ZeRO stages 1/2/3 — the ladder whose textbook "stage 3 frees most memory" advice production frequently overrides with ZeRO-1 + TP for fabric reasons (§4).
  • Calculon / analytic LLM-training cost models — the predict-MFU-before-running tools whose 10–20% gap from reality is the §8 first-degrading signal.
  • NCCL tree / hierarchical / SHARP collectives — the real algorithms that make the file-02 ring formula a guide rather than a predictor at scale.
  • PyTorch selective activation checkpointing — the "recompute only some ops" mechanism that beats the textbook "recompute everything," with op selection left to the engineer.
  • OPT-175B / BLOOM training logbooks — public records of how often frontier runs diverge, stall, and get rolled back, evidence that correlated failure and rollback-to-a-good-generation are the real checkpointing problem.
  • Mosaic/Databricks & Hugging Face training stacks — expose parallelism, precision, and checkpoint knobs as config, implicitly handing users the §1 search problem.
  • MoE training (Mixtral, DeepSeek-MoE, GShard-lineage) — where expert parallelism and all-to-all routing rewrite the dense-model placement rules (§11).
  • NVIDIA DCGM / amax & gradient-norm dashboards — the instrumentation frontier teams watch instead of the loss to catch fp8 drift and stragglers early.
  • bf16 as the conservative default — most production pretraining still chooses bf16 over fp8 for the longest runs, the clearest sign the fp8 debate is unsettled.
  • Critical-batch-size studies — empirical work mapping where large-batch convergence degrades, the open question behind the §11 generalization wall.
  • Spare-node pools + multi-generation checkpoint retention — standard frontier infrastructure that exists precisely because the clean MTBF formula under-prices correlated failure.

Pause and recall

  1. Why does picking a parallelism config become a search problem rather than a derivation at frontier scale?
  2. Name the three approaches to choosing a config and one weakness of each.
  3. Under what fabric condition does ZeRO-1 + TP beat ZeRO-3, and why — despite ZeRO-3 freeing more memory?
  4. State both sides of the fp8-pretraining debate fairly.
  5. Walk the fp8 late-divergence failure: what drifts, why the loss hides it, and which signal catches it first.
  6. What two assumptions in √(2 × write × MTBF) break at 16,000 GPUs, and how does production practice respond?
  7. Which single intuition from small-scale training is the most dangerous to carry to frontier scale, and what replaces it?
  8. Why is peak MFU on a short run a misleading metric for config selection?

Interview Q&A

Q1. How do you choose a parallelism configuration for a new model on a new cluster? A. It's a constrained search, not a formula. Start from the placement rules (TP intra-node ≤ 8, PP across nodes, DP outermost, ZeRO to fit memory) to prune to a few candidates, estimate each with an analytic cost model, then sweep the survivors empirically for a few hundred steps and measure end-to-end MFU. Pick the best measured, validate it still wins at a larger step count, and revisit when the model shape or hardware changes. The config is a good basin found under a budget, not a provable optimum. Common wrong answer to avoid: "Use this formula / always TP=8, PP=… " — there's no closed-form optimum; anyone offering one hasn't run the sweep.

Q2. Your cost model predicts 50% MFU; you measure 38%. Where do you look? A. The gap is something the model can't see. Open a profiler timeline for exposed collectives (topology/config issue), a per-rank step-time heatmap for a straggler (file 07), the memory profile for a recompute/ZeRO misconfig, and check fabric contention from co-located jobs. The predicted-vs-measured MFU gap is the primary frontier diagnostic; the cost model is a prior, and a 10–20% miss is normal and worth chasing. Common wrong answer to avoid: "Trust the cost model; the measurement is noise" — the measured number is the truth; the model is the approximation.

Q3. Should you pretrain a frontier model in fp8? A. It's a monitored bet, not a yes/no. fp8 with per-tensor (delayed) scaling and sensitive ops kept in higher precision can roughly double tensor-core throughput and several labs report it matching bf16. The risk is a late-training divergence — a few layers' activations drift past the E4M3 range, delayed scaling lags, matmuls clip, and the loss hides it until it's expensive. So: enable it with sensitive ops higher, monitor per-tensor amax and gradient norm as first-class signals, keep frequent checkpoints to roll back, and many teams still pick bf16 for the longest runs. Name the prize and the risk, don't pick a side blindly. Common wrong answer to avoid: "fp8 is free 2× / fp8 is unstable, never use it" — both ignore that it works with guardrails and fails late without them.

Q4. A teammate reaches for ZeRO-3 because it frees the most memory. When do you push back? A. When the shard group is spread across slow inter-node links. ZeRO-3 all-gathers full parameters per layer every forward and backward, and that traffic crosses the inter-node fabric and contends with the all-reduce. On a single fast-NVLink pod it's great; spread across InfiniBand nodes, ZeRO-1 (optimizer state only) plus tensor parallelism often beats it because it puts far less traffic on the slow wire. The deciding variable is interconnect, not memory freed. Common wrong answer to avoid: "ZeRO-3 is always best because it frees the most memory" — the memory it frees is paid for in inter-node communication.

Q5. Your fp8 run tracked bf16 perfectly for 55,000 steps, then diverged. What happened and how could you have caught it earlier? A. A few deep layers developed activation outliers that crept up over training; the delayed (running-statistics) fp8 scale lagged the shift, so those layers' matmuls began clipping in E4M3, returning subtly wrong results that compounded until the loss broke. The loss hid it because the optimizer averages over thousands of tensors. Catch it earlier by monitoring per-tensor amax headroom (it rises toward the ceiling before the loss moves) and gradient norm, keeping those layers in bf16, and using more responsive scaling. It's the file-07 silent-corruption shape sourced from a chosen numerical format. Common wrong answer to avoid: "The loss would have spiked immediately if fp8 were the problem" — subtle clipping compounds silently; the loss breaks late.

Q6. The √(2 × write × MTBF) checkpoint formula gave you a 20-minute interval. Why might you checkpoint more often at 16,000 GPUs? A. The formula assumes independent, constant-rate failures. At 16k GPUs failures correlate (a rack, a switch, a cooling event takes out many ranks at once) and the rate is non-stationary (infant mortality, aging). Correlated bursts and the risk of saving a silently corrupt checkpoint mean you checkpoint more often than the independent optimum and keep multiple generations to roll back past a bad event or a corrupt save. The formula is a floor, not the answer. Common wrong answer to avoid: "The formula is exact, just plug in MTBF" — independence and a flat rate both break at scale.

Q7. (Cumulative.) End-to-end MFU is 15% below predicted, the loss is clean, nothing has crashed, and you're running fp8 with ZeRO-3 across 8 nodes. Walk the diagnosis across the module. A. Clean loss rules nothing out — it hides stragglers and fp8 drift. Check three module layers in order. First (file 07): per-rank step-time heatmap for a straggler dragging the barrier. Second (this file, §4): ZeRO-3's all-gather crossing the 8-node InfiniBand fabric and contending with the all-reduce — try ZeRO-1 + TP intra-node to cut inter-node traffic. Third (§5): even if MFU recovers, watch fp8 amax headroom, because a stable-but-drifting fp8 config can be both slower-than-ideal and heading toward a late divergence. The 15% gap is most likely the ZeRO-3 inter-node contention; the fp8 monitoring is insurance against a worse failure later. Common wrong answer to avoid: "Loss is clean, so the run is healthy, accept the 15%" — at scale the loss is not a sufficient health signal, and 15% MFU is enormous money.

Design/debug exercise (10 min)

Step 1 — modeled example. Rank three candidate configs for the 70B run on 1,024 H100s (8 GPUs/node, NVLink intra-node, InfiniBand inter-node) by reasoning, not running:

   A) TP=8, PP=1, ZeRO-3, DP=128          → TP fills the node; ZeRO-3 all-gathers
                                             cross 128-way DP over IB → heavy IB contention
   B) TP=8, PP=2, ZeRO-1, DP=64           → TP intra-node, PP across 2 nodes,
                                             ZeRO-1 shards only optimizer state → light IB → likely best
   C) TP=1, PP=1, ZeRO-3, DP=1024         → no model split; ZeRO-3 all-gather of full
                                             70B params per layer over 1024-way IB → IB saturated, worst
   verdict: B. TP saturates fast NVLink, ZeRO-1 keeps inter-node traffic small,
            PP=2 adds a small bubble but avoids ZeRO-3's per-layer all-gather on IB.

Step 2 — your turn. Continue the 70B example. (a) The team wants fp8 to claw back throughput on config B — list the two signals you'd add to the dashboard before enabling it and the one config guardrail you'd set. (b) At 4,096 GPUs the aggregate failure pattern shifts from "one rank every few hours" to "a 64-GPU rack every ~6 hours" — how does that change your checkpoint cadence and retention versus the file-07 formula's answer? (c) Your cost model says config B should hit 48% MFU but you measure 39% — name the three things you'd inspect, in order.

Step 3 — reproduce from memory. Without looking, redraw the foggy-valley picture (cost-model optimum vs the shifted, fog-bound true optimum) and write the one-line truth of this file: at frontier scale every clean tradeoff curve becomes a noisy, drifting search problem with no provably optimal answer — the rules are priors, instrumentation is the answer. Then connect to file 06 in one sentence: mixed precision looked like a clean compute-for-memory trade, but fp8 turns it into a contested numerical bet whose failure hides behind the loss exactly like file 07's silent corruption.

Operational memory

This chapter took the clean rules of the module — the parallelism placement rules, ZeRO's memory ladder, mixed precision, the checkpoint formula — and walked them to the edge where they stop being safe to trust. The important idea is that those rules are local truths: each is exact when you vary one knob on a clean curve, and each frays when composed across thousands of failing, heterogeneous GPUs under a budget. At frontier scale you don't derive the optimal config; you search a foggy, drifting landscape and defend a good basin with heavy instrumentation, because the failures that matter — fp8 drift, stragglers, suboptimal configs, correlated hardware death — all hide behind a clean loss curve.

You learned to treat config selection as cost-model-prune-then-empirical-sweep rather than a formula; to override the "ZeRO-3 frees the most memory" rule with ZeRO-1 + TP when the fabric is slow; to hold both sides of the fp8-pretraining debate and run it as a monitored bet on amax and gradient norm; and to treat the checkpoint formula as a floor that correlated, non-stationary failure pushes you below. That resolves the gap between the module's clean lessons and what frontier runs actually do — the rules are where you start, not where you stop.

Carry this diagnostic forward: when measured MFU sits well below the cost model's prediction, chase the gap (stragglers, fabric contention, ZeRO traffic, fp8 config) rather than trusting the model; when a run looks perfect on the loss, remember the loss is blind to the failures most likely to ruin a frontier run; and when someone states a single right answer for fp8, ZeRO stage, or checkpoint cadence, ask what scale and fabric they're assuming, because the answer flips with both.

Remember:

  • At frontier scale every clean tradeoff curve couples and drifts — no provably optimal config; you search a fog under a budget.
  • Config selection is prune with a cost model, sweep empirically, re-check at scale — not a formula.
  • ZeRO-1 + TP can beat ZeRO-3 when the shard group spans slow inter-node links; the deciding variable is interconnect, not memory freed.
  • fp8 pretraining is contested: ~2× throughput with per-tensor scaling and sensitive ops higher, but a late divergence the loss hides — monitor amax and gradient norm, keep frequent checkpoints.
  • The checkpoint formula is a floor: correlated, non-stationary failure at 16k GPUs pushes you to checkpoint more often and keep multiple generations.
  • The most dangerous small-scale intuition is "I can derive the optimum" — replace it with "I search and instrument, because the clean rule is a prior."

Bridge. We can split the model across GPUs, place each axis on the right wire, trade compute for precision, survive the failures, and we now know exactly where each of those rules stops being safe to trust at scale. But the entire module assumed one thing it never examined: that once a GPU is handed its shard of work, it runs that work fast. The foggy-valley search optimizes how GPUs are arranged and how they talk — yet a single GPU at 35% MFU is wasting two-thirds of the hardware we fought to coordinate, and no parallelism config rescues a kernel that doesn't saturate the tensor cores. The next module opens one GPU and asks the question this module kept assuming away: why does a kernel run slow? It builds the roofline model — memory bandwidth versus compute, arithmetic intensity, kernel-launch overhead, and fusion — that decides whether your expensive H100 is computing or just waiting on memory. → ../09_gpu_acceleration_stack/00-first-principles.md