Skip to content

07. Fault tolerance and checkpointing at scale — surviving the week when something is always broken

~20 min read. The 70B run trains: model split, topology placed, activations recomputed, high MFU. Now scale it to a thousand GPUs for a week. Forty hours in, GPU 487 throws an Xid error and dies. Because the all-reduce is synchronous, all 999 other GPUs freeze waiting for it. If your last checkpoint was eight hours ago, you just lost eight hours × 1,000 GPUs of compute. At this scale the question stops being "how fast does it train" and becomes "how does it survive the failures that are guaranteed to happen."

Built on every prior file. We made a 70B model fit and run fast across 64 GPUs with data, tensor, and pipeline parallelism, 3D placement, and compute-for-memory trades. All of that is single-run, single-moment correctness. This file confronts the module's last pressure — MTBF at scale — where a synchronous all-reduce (file 02) turns one dead rank into a frozen cluster, and survival depends on checkpointing, restart, straggler detection, and elastic recovery rather than raw speed.


What "the run trains" quietly assumed

Every file so far optimized a single training step and assumed it would simply repeat. Fit the model, place the collectives, recompute the activations — and then run the loop a few hundred thousand times. That assumption holds beautifully on 64 GPUs for an afternoon. It falls apart on 1,000 GPUs for a week, because the loop's correctness depended on every GPU being alive and in lockstep every single step, and at that scale they are not.

The arithmetic is unforgiving. Suppose one H100 has a mean time between failures of, say, 50,000 hours — extremely reliable for one card. Put 1,000 of them in a run and the aggregate failure rate is 1,000× higher: expect a failure roughly every 50 hours, and that's just the GPUs. Add NICs, cables, power, host kernels, and cosmic-ray bit flips, and frontier runs see a hardware incident every few hours. The published large-model training logs are full of them. So the design question inverts: not "how do I make a step faster" but "when (not if) a rank dies mid-step, how much work do I lose, and how fast do I recover?"

This file answers that. It covers why synchronous training is so fragile to a single failure, how checkpointing bounds the loss, how to choose checkpoint frequency from the failure rate, how stragglers silently rob throughput, how elastic training recovers without a full restart, and the nastiest failure of all — silent corruption that doesn't crash anything.

What this file solves

A synchronous distributed run freezes the moment any one of its thousands of ranks dies, and without checkpoints every failure restarts from step zero. This file shows how to make a week-long thousand-GPU run survivable: checkpoint the sharded model and optimizer state frequently enough that a failure costs minutes not hours, detect stragglers before a sick-but-alive GPU drags every step, recover via elastic training (torchrun rendezvous) without rescheduling the whole job, and guard against silent data corruption — the bit flip that corrupts the model while the loss keeps falling. The concrete move: pick a checkpoint interval from MTBF, write asynchronous distributed checkpoints, and wire up a rendezvous backend that re-forms the group when a node drops.

Why one dead rank freezes a thousand

The fragility traces straight back to file 02. Data-parallel training is synchronous: every step, all ranks compute gradients and then all-reduce them, and no rank can proceed to optimizer.step() until the all-reduce completes — which requires every participant to send its chunk around the ring. The all-reduce is a barrier. One rank that never sends its chunk blocks the collective forever.

   SYNCHRONOUS ALL-REDUCE as a barrier
   ────────────────────────────────────
   GPU0 ─send→ GPU1 ─send→ ... ─send→ GPU487 ✗ (dead) ─send→ ... ─send→ GPU999
                                       └─ never sends its chunk
   → the ring stalls. Every other GPU blocks in NCCL, waiting.
   → without a timeout, the whole 1,000-GPU job hangs indefinitely.

So a single Xid (NVIDIA's GPU error code), a flapped InfiniBand link, or a host OOM-killer reaping the training process takes one rank out — and the synchronous barrier converts that one local failure into a global freeze. NCCL has a timeout (NCCL_TIMEOUT), so eventually the collective errors out instead of hanging forever, but the default behavior is then to crash the whole job. Without recovery machinery, the run dies and restarts from its last checkpoint.

So the real problem is not "GPUs fail" — they always will. The real problem is that synchronous training has no tolerance for partial failure: the barrier makes every rank's progress depend on every other rank's liveness. How do we bound the damage of a failure that is guaranteed to happen, and recover from it in minutes instead of restarting from zero?

The first answer: checkpoint, and bound the loss

The damage from a crash is "everything since the last save." So save often. A checkpoint is a snapshot of everything needed to resume: the sharded model parameters, the sharded optimizer state (the fp32 moments and master weights from file 01), the learning-rate schedule position, the data loader's position, and the RNG state. Write it to durable storage; on restart, every rank loads its shard and resumes from that step.

The frequency is a direct tradeoff, and you can compute it. Checkpoint too rarely and a failure loses hours of work across thousands of GPUs. Checkpoint too often and the checkpoint writes themselves eat throughput — a 70B model's full state is over a terabyte, and writing it stalls training unless overlapped.

   CHECKPOINT FREQUENCY tradeoff
   ──────────────────────────────
   too rare:  failure → lose hours × 1000 GPUs of compute (expensive)
   too often: every save stalls the step to write >1 TB (also expensive)

   wasted time per failure ≈ checkpoint_interval / 2   (lose half an interval on average)
   overhead from saving    ≈ write_time / checkpoint_interval
   ────────────────────────────────────────────────────────────
   optimal interval balances the two: roughly  √(2 × write_time × MTBF)

That square-root rule (Young's / Daly's formula for optimal checkpoint interval) says: the more often you fail (low MTBF) or the cheaper a save is (fast write), the more frequently you should checkpoint. For a thousand-GPU run failing every ~3 hours with a checkpoint that takes ~1 minute to write asynchronously, checkpointing every ~20–30 minutes is in the right range — losing at most ~15 minutes per failure instead of hours.

Teacher voice. Checkpoint frequency is not a vibe; it's arithmetic on two costs. The cost of failure (lose half an interval) falls as you checkpoint more; the cost of saving (write overhead) rises. The optimum is roughly the square root of their product. When someone asks "how often should we checkpoint," the answer is "what's your MTBF and your write time," not "every N steps because that feels safe."

Making the save cheap: asynchronous, sharded, distributed checkpoints

Writing 1 TB synchronously would stall every checkpoint for minutes. Two ideas make it cheap. First, sharded checkpointing: each rank writes only its shard of the parameters and optimizer state (which under ZeRO/FSDP it already holds), in parallel, to many storage targets at once — turning one 1 TB serial write into 1,000 parallel ~1 GB writes. Second, asynchronous checkpointing: snapshot the tensors to host RAM quickly (a fast GPU→CPU copy), then let a background thread write to durable storage while training continues. The step stalls only for the GPU→CPU copy, not the disk write.

   ASYNC SHARDED CHECKPOINT (PyTorch DCP-style)
   ─────────────────────────────────────────────
   step N:  [compute] [quick GPU→CPU snapshot of this rank's shard] [resume step N+1]
                                    └─ background thread writes shard to storage
                                       (overlapped with steps N+1, N+2, ...)
   each of 1000 ranks writes ~1/1000 of the state, in parallel
   → step stall ≈ snapshot copy time (seconds), not full write time (minutes)

For the 70B run, PyTorch's Distributed Checkpoint (DCP) writes each FSDP rank's shard in parallel and supports async save, so the per-checkpoint training stall drops from minutes to seconds — which is what makes a 20-minute interval affordable.

The second answer: detect the straggler before it robs you

Not every failure is a clean death. A GPU can be alive but slow — thermally throttled, a degraded NVLink lane, ECC errors forcing retries. Because the all-reduce is a barrier, the slowest rank sets the pace for all of them (head-of-line blocking, the same shape as the slow pipeline stage in file 04). One GPU running 20% slow makes the entire thousand-GPU run 20% slow, and nothing crashes — the loss keeps falling, just at lower throughput, and the waste hides in plain sight.

Straggler detection watches per-rank step times and flags a rank whose performance falls below a threshold. PyTorch and NeMo ship straggler detection that has run on Llama-class training at the 1,000-GPU scale: it identifies the slow rank, and the run can drain it, swap in a spare, or terminate so the job is rescheduled off the bad hardware — without a noticeable throughput hit when nothing is wrong.

   STRAGGLER (alive but slow) vs DEAD RANK
   ────────────────────────────────────────
   dead rank:     all-reduce hangs → NCCL timeout → job crashes → restart from ckpt
   straggler:     all-reduce completes, but at the SLOW rank's pace
                  → no crash, no error, throughput silently down 10–30%
                  → detect via per-rank step-time outlier, then evict

Mini-FAQ. "If nothing crashes, why is a straggler urgent?" Because the cost is continuous and invisible. A dead rank is loud — the job stops, you notice, you restart. A straggler is silent — every step is a bit slower, MFU sits 20% below target for days, and the bill is enormous before anyone looks. The dead rank costs you one restart; the straggler costs you a fraction of every step until found.

The third answer: elastic recovery without a full restart

When a rank does die, the crude recovery is to crash the whole job and have the scheduler relaunch all 1,000 processes — minutes of teardown, queue wait, and re-init. Elastic training does better. PyTorch's torchrun puts an elastic agent on each node, and the agents coordinate through a rendezvous backend (the c10d backend is preferred on HPC clusters — it runs on the training nodes with no external dependency). When a node drops, the surviving agents detect it, tear down the process group, re-form a new group with the remaining (or replacement) nodes, and every rank reloads from the last checkpoint and resumes.

   ELASTIC RECOVERY (torchrun + rendezvous)
   ─────────────────────────────────────────
   node fails → elastic agents notice via rendezvous → barrier + re-form group
            → all ranks reload last checkpoint → resume training
   vs. crude restart: no full job reschedule, no cold queue wait
   requires: idempotent init, checkpoint-detect-and-resume logic, a spare-node pool

The training script must be written to cooperate: idempotent initialization, the ability to detect an existing checkpoint and resume rather than overwrite, and ideally a pool of spare nodes so a failed node is replaced rather than the world size shrinking. With elastic recovery and a 20-minute checkpoint interval, a single GPU death costs minutes — re-form, reload, resume — instead of hours.

The picture: the survival loop

The canonical mental model for this file — the loop that keeps a week-long run alive through guaranteed failures:

   THE SURVIVAL LOOP (1,000-GPU, week-long run)
   ══════════════════════════════════════════════

        ┌──────────────── train steps ───────────────┐
        │                                             │
        ▼                                             │
   [step] → [async sharded checkpoint every ~20 min] ─┤  bound loss to half an interval
        │                                             │
        ├─ straggler? ── detect slow rank ── evict ───┤  stop silent throughput loss
        │                                             │
        ├─ rank dies? ── NCCL timeout ── elastic ─────┤  re-form group, reload ckpt
        │               rendezvous re-forms group     │  recover in minutes
        │                                             │
        └─ silent corruption? ── checksum/validate ───┘  catch the bit flip the loss hides

   speed is necessary but not sufficient: at this scale, SURVIVAL is the design axis.

Read it as defense in depth. Checkpoints bound how much a failure can cost. Straggler detection catches the failure that doesn't crash. Elastic rendezvous recovers the failure that does. And corruption detection catches the failure that hides. A frontier run needs all four, because at a thousand GPUs each of these will fire during a week.

The nastiest failure: silent data corruption

The failures so far are loud (a crash) or measurable (a slow rank). The worst kind is silent. A cosmic ray flips a bit in HBM. A marginal GPU computes a matmul slightly wrong on one in a billion ops. A NIC corrupts a gradient chunk in transit and the checksum happens to pass. Nothing crashes. The all-reduce completes. The loss keeps falling — because one corrupted gradient among thousands is noise the optimizer averages over. And the model slowly, invisibly degrades, or a checkpoint is written with corrupt weights and every restart from it inherits the corruption.

This is the failure that haunts frontier teams, because the usual signal (the loss) is exactly the signal that doesn't fire. Defenses are partial and layered: ECC memory catches many bit flips (and reports them as correctable/uncorrectable error counts to watch); checkpoint checksums detect corruption at write/read; periodic gradient-norm and loss-spike monitoring catches gross corruption that does perturb the loss; and deterministic re-execution of a suspect step on different hardware can confirm a silent miscompute. Some teams run periodic "self-check" steps comparing a known input's output across replicas.

Teacher voice. The lesson of silent corruption is that you cannot trust the loss as your only health signal. A falling loss means the average gradient is fine; it says nothing about a single corrupted rank or a bit flip the optimizer smoothed over. At a thousand GPUs you instrument the hardware (ECC counts, NVLink error counters), checksum the checkpoints, and watch gradient norms — because the thing most likely to ruin a week-long run is the failure your primary metric is blind to.

Why frequent checkpoints, not just faster hardware

The plausible alternative: buy more reliable hardware so failures are rare enough to ignore. But MTBF scales with GPU count — even perfect cards fail often enough in aggregate that a thousand-GPU week-long run sees multiple incidents. You cannot buy your way out of the aggregation; you can only bound the cost of each failure. Reliability per card helps the constant, not the scaling, and the scaling is what bites.

A second alternative: asynchronous training (let ranks proceed without the barrier, à la old parameter servers) so one slow or dead rank doesn't block the others. This trades the convergence guarantee — async updates use stale gradients, which hurts large-model convergence (the same reason file 02 averages gradients synchronously rather than letting replicas drift). For frontier dense LLMs, the field keeps synchronous training and pays for fault tolerance with checkpointing and elastic recovery, because convergence quality is worth more than failure tolerance bought by going async. The decision is governed by whether convergence can tolerate staleness — for large LLMs it largely cannot.

Operational signals — is the run surviving well?

Healthy behavior. Checkpoints write asynchronously with a sub-10-second step stall; per-rank step times cluster tightly (no straggler); failures, when they happen, trigger elastic recovery that resumes within minutes; ECC error counts and NVLink error counters stay near zero; gradient norm steady.

First metric to degrade. Per-rank step-time spread — the earliest sign of a straggler, visible before throughput drops noticeably. A widening tail (one rank consistently 10–20% slower) is the tell. Second: rising correctable-ECC counts on a GPU, a precursor to an uncorrectable error and a crash.

The misleading metric. The loss curve — it keeps falling through a straggler (just slower) and through silent corruption (the optimizer averages it out), so a healthy-looking loss does not mean a healthy run. Watch MFU/throughput against target (catches stragglers), per-rank step time (catches the slow rank), and ECC/NVLink error counters (catches degrading hardware) — not the loss alone.

The graph an expert opens first. A per-rank step-time heatmap over the cluster — a slow rank lights up as a hot row, instantly localizing the straggler to a physical GPU. Alongside it, the checkpoint-write-duration trace (to confirm async saves aren't stalling) and the hardware-error dashboard (ECC, Xid, NVLink). The diagnostic skill is correlating a throughput dip to a specific rank, then to a specific node, then evicting it.

Boundary of applicability — where the survival design bends

Where it shines. Long, large synchronous runs (days to months, hundreds to thousands of GPUs) — frontier LLM pretraining, where failures are guaranteed and the compute lost per failure is enormous. Checkpointing + elastic + straggler detection is mandatory infrastructure here, not optional.

Where it becomes pathological. Tiny short runs (a few GPUs, an hour) where failures are rare and the checkpointing/elastic machinery is pure overhead — a single restart-from-zero is cheaper than the engineering. Also pathological: checkpointing so frequently that the write overhead dominates (the square-root rule prevents this), or checkpoints so large and storage so slow that even async snapshots stall (then you need faster local NVMe staging or more sharding).

The scale limit on intuition. "Failures are rare" is true per card and catastrophically false in aggregate — the intuition from single-GPU work that you can ignore hardware failure breaks completely at scale. "The loss tells me the run is healthy" breaks at scale too, because stragglers and silent corruption hide from it. Both intuitions, safe at small scale, become liabilities at a thousand GPUs.

The wrong model to carry, and the right one

The seductive-but-wrong intuition: "a fast run is a good run; optimize throughput and reliability takes care of itself." At a thousand GPUs for a week, the fastest possible per-step design that ignores failure will never finish — it'll crash every few hours and restart from a stale checkpoint, making negative progress. Speed without survival is wasted.

The right model: at scale, survival is a first-class design axis, co-equal with throughput. You design the checkpoint interval from MTBF, instrument for stragglers and silent corruption, and build elastic recovery — because the run that finishes is the one that survives its guaranteed failures cheaply, not the one with the highest peak MFU. A run at 90% MFU that recovers in minutes beats a run at 95% MFU that loses hours per crash.

Other ways failure shows up at scale

  • NCCL timeout / hang — a dead rank stalls the all-reduce; the job hangs until NCCL_TIMEOUT, then crashes. Elastic recovery re-forms the group.
  • Straggler with no crash — one slow GPU drags every step via the barrier; throughput down, loss fine. Detect via per-rank step time, evict.
  • Checkpoint write stalls the step — synchronous or unsharded save; switch to async sharded (DCP) so only the snapshot copy stalls.
  • Restart from corrupt checkpoint — a silent corruption was saved; every resume inherits it. Checksum checkpoints and keep a few generations to roll back.
  • Data loader desync after restart — resume didn't restore the data position, so the model re-sees or skips data. Checkpoint the loader state too.
  • Rendezvous timeout on restart — replacement nodes don't join in time; the c10d rendezvous errors. Tune timeouts and keep a warm spare pool.
  • Rising ECC errors → impending death — correctable-ECC count climbing on one GPU precedes an uncorrectable crash; proactively drain that node.
  • Loss spike from a corrupted batch or gradient — gross corruption that does perturb the loss; skip the step or roll back to the last good checkpoint.

Where this fits the larger systems map

  • Failure geometry — head-of-line blocking, again. A straggler gating the all-reduce is the same shape as the slow pipeline stage (file 04), the slow GPU in the ring (file 02), and the degraded node in a 3D layout (file 05). Synchronous systems are paced by their slowest member at every layer.
  • Same problem, different layer — database checkpoints and WAL. Bounding crash-recovery loss with periodic checkpoints is exactly the database write-ahead-log + checkpoint pattern: snapshot state periodically so a crash replays only since the last snapshot. Same recovery math, storage layer.
  • Constraint echo — MTBF aggregation. The "1,000× the failure rate" arithmetic is the same reliability-engineering truth behind RAID, replicated storage quorums, and distributed-systems failure budgets: independent failure rates add, so redundancy and recovery must scale with component count.
  • Silent failure — the Byzantine shape. A rank that computes wrong but doesn't crash is a (mild) Byzantine fault — the participant that lies rather than dies. The defenses (checksums, re-execution, cross-replica comparison) echo Byzantine fault tolerance, scaled down to "trust but verify the hardware."

Where this appears in production

  • Meta Llama 3 training — published logs detail frequent hardware failures across 16k GPUs, with checkpointing, automated failure detection, and rapid recovery as core infrastructure; straggler detection ran at the 1,000-H100 scale.
  • PyTorch Distributed Checkpoint (DCP) — sharded, async checkpointing where each FSDP rank writes its shard in parallel; the standard save/load path for large runs.
  • PyTorch torchrun + TorchElastic — the elastic agent + rendezvous backend that re-forms the process group when a node drops, without a full job reschedule.
  • c10d rendezvous backend — the on-node rendezvous preferred over etcd on HPC clusters, no external dependency.
  • NVIDIA NeMo resiliency features — straggler detection, fault detection, and auto-resume integrated into the training framework.
  • NVIDIA Xid error codes / DCGM — the GPU error and health-monitoring signals teams watch (ECC, NVLink errors) to predict and localize failures.
  • DeepSpeed checkpoint engine — async, sharded checkpointing integrated with ZeRO state.
  • Slurm / Kubernetes auto-requeue + spare pools — schedulers that relaunch or replace failed nodes so elastic recovery has hardware to re-form on.
  • Google Pathways / TPU resilience — the TPU-pod equivalent: slice rescheduling and checkpoint-based recovery for multi-thousand-chip runs.
  • OPT-175B training logbook (Meta) — a famous public record of how often runs crashed, stalled, and had to be restarted from checkpoints — the canonical evidence that survival, not speed, dominates frontier runs.
  • MosaicML / Databricks resumption — automatic resume-from-checkpoint and node replacement marketed as a core feature for long runs.
  • Checkpoint checksumming / multi-generation retention — guarding against restarting from a silently corrupted checkpoint.

Pause and recall

  1. Why does a single dead rank freeze the entire synchronous run, and what eventually unblocks it?
  2. Why does MTBF collapse as GPU count rises, and roughly how often does a thousand-GPU run fail?
  3. What does a checkpoint contain, and what does the square-root rule say about how often to write it?
  4. How do async and sharded checkpointing each cut the per-checkpoint training stall?
  5. What is a straggler, why is it more insidious than a dead rank, and how do you detect it?
  6. What does elastic training (torchrun rendezvous) do that a crude full restart doesn't?
  7. Why is the loss curve a dangerous primary health signal at scale? Name two failures it hides.
  8. What is silent data corruption, and what defenses catch it since the loss won't?

Interview Q&A

Q1. A 1,000-GPU run crashes every few hours and makes no net progress. The per-step MFU is excellent. What's wrong? A. The run is fast but not survivable. With excellent MFU but frequent crashes and no progress, the checkpoint interval is too long (each crash loses hours) and/or there's no elastic recovery (each crash triggers a full reschedule). Fix: compute the checkpoint interval from MTBF and write time (≈ √(2 × write_time × MTBF)), use async sharded checkpoints to make frequent saves cheap, and add torchrun elastic rendezvous so a failure re-forms the group and resumes in minutes. Speed without survival makes negative progress. Common wrong answer to avoid: "Optimize the step more / add GPUs" — more GPUs raises the failure rate; the problem is recovery, not speed.

Q2. How do you choose a checkpoint interval? A. From two costs: the work lost per failure (≈ half an interval) falls as you checkpoint more often; the write overhead rises. The optimum balances them — roughly √(2 × write_time × MTBF) (Young/Daly). So a low MTBF (frequent failures) or a cheap async write pushes toward more frequent checkpoints. For a thousand-GPU run failing every ~3 hours with a ~1-minute async write, ~20–30 minutes is a reasonable interval. Common wrong answer to avoid: "Every N steps, a round number that feels safe" — the interval must come from MTBF and write cost, not intuition.

Q3. Throughput is 20% below target, nothing has crashed, and the loss is falling normally. Diagnose. A. A straggler — one rank alive but slow (thermal throttle, degraded NVLink, ECC retries). Because the all-reduce is a barrier, the slowest rank paces all of them (head-of-line blocking), so throughput drops with no crash and a normal loss. Find it via a per-rank step-time heatmap (the slow rank lights up), localize to the physical GPU/node, and evict or replace it. Don't trust the loss — it hides the straggler. Common wrong answer to avoid: "The model converged into a slower regime / it's fine" — uniform 20% throughput loss with a tight loss curve is a hardware straggler, not a model effect.

Q4. Why keep synchronous training instead of going async to tolerate failures? A. Async training (stale gradients, no barrier) tolerates a slow/dead rank but degrades convergence — the same reason file 02 averages gradients synchronously rather than letting replicas drift. For frontier dense LLMs, convergence quality is worth more than the failure tolerance async buys, so the field keeps synchronous training and pays for fault tolerance with checkpointing and elastic recovery instead. The decision hinges on whether convergence can tolerate staleness — for large LLMs it largely can't. Common wrong answer to avoid: "Async is strictly better for fault tolerance" — it trades away convergence; that's why synchronous + checkpointing won for LLMs.

Q5. What is silent data corruption and why is it the scariest failure? A. A bit flip or marginal-hardware miscompute that doesn't crash anything — the all-reduce completes, the loss keeps falling (the optimizer averages one bad gradient away), and the model quietly degrades or a corrupt checkpoint is saved. It's scariest because the primary health signal (loss) is blind to it. Defenses: ECC monitoring, checkpoint checksums, gradient-norm/loss-spike watches, deterministic re-execution of suspect steps, and cross-replica self-checks. Common wrong answer to avoid: "The loss would spike if the model were corrupting" — one corrupted gradient among thousands is averaged out; gross corruption spikes, subtle corruption hides.

Q6. (Cumulative.) Trace how a single GPU death propagates through the mechanisms of this module, and what each layer must do to recover. A. The dead rank stops sending its all-reduce chunk (file 02's collective), so the synchronous barrier stalls every rank in its 3D group (files 02/05). NCCL times out and the job would crash. Elastic rendezvous (this file) detects the drop, re-forms the process group (possibly with a spare node), every rank reloads its sharded checkpoint (FSDP shards from file 03 + optimizer state from file 01), restores the data-loader and RNG position, and resumes. The 3D layout (file 05) must be re-established on the new group. Recovery cost = time since last checkpoint + re-form + reload. Common wrong answer to avoid: "Just the dead GPU restarts" — the barrier means the whole synchronous group must re-form and reload; one rank can't recover alone.

Design/debug exercise (10 min)

Step 1 — modeled example. Compute the checkpoint interval and per-failure loss for a 1,000-GPU run:

   MTBF (aggregate) ≈ 3 hours = 180 min
   async sharded checkpoint write ≈ 1 min (overlapped; step stall ~5 s)
   optimal interval ≈ √(2 × write_time × MTBF) = √(2 × 1 × 180) ≈ 19 min  → round to 20
   work lost per failure ≈ interval / 2 ≈ 10 min   (vs hours with a stale checkpoint)
   save overhead ≈ step_stall / interval ≈ 5 s / 20 min ≈ 0.4%   (negligible)

Step 2 — your turn. For the 70B running example scaled to 1,000 H100s for a 10-day run: (a) If the aggregate MTBF is ~2 hours and async writes stall the step ~5 s, what checkpoint interval would you pick and how much work do you lose per failure? (b) Throughput drops 15% with no crash and a normal loss — what's the failure, and which graph do you open to localize it? (c) A restart resumes but the loss is mysteriously higher than before the crash — name two state items the checkpoint may have failed to save.

Step 3 — reproduce from memory. Without looking, redraw the survival loop (train → checkpoint → straggler-detect → elastic-recover → corruption-check), write the checkpoint-interval rule (√(2 × write × MTBF) and "lose half an interval"), and state the one-line truth: at scale, survival is a design axis co-equal with throughput — the loss is not a sufficient health signal. Connect to file 02 in one sentence: the synchronous all-reduce that keeps replicas identical is exactly what makes one dead rank freeze the cluster.

Operational memory

This chapter explained why a run that trains perfectly on 64 GPUs for an afternoon does not survive on 1,000 GPUs for a week: MTBF aggregates, so failures are guaranteed, and the synchronous all-reduce turns any one dead rank into a frozen cluster. The important idea is that at scale survival is a first-class design axis — you bound the cost of guaranteed failures with frequent async sharded checkpoints, catch the slow-but-alive straggler before it silently robs throughput, recover dead ranks with elastic rendezvous instead of full restarts, and guard against silent corruption that the loss curve hides.

You learned to set the checkpoint interval from MTBF and write time (≈ √(2 × write × MTBF)), to make saves cheap with async sharded checkpointing, to detect stragglers via per-rank step time, to wire elastic recovery with torchrun's rendezvous, and to instrument hardware (ECC, NVLink, checksums) because the loss won't warn you. That solves the week-long-run problem: a failure now costs minutes, not hours, and the run finishes.

Carry this diagnostic forward: when a run makes no net progress despite high MFU, it's a survival problem, not a speed problem — fix the checkpoint interval and recovery first. When throughput drops with a healthy loss, suspect a straggler and open the per-rank step-time heatmap. And never trust the loss as your only health signal at scale — instrument the hardware, because the failure most likely to ruin a week is the one your primary metric can't see.

Remember:

  • MTBF aggregates: 1,000 GPUs fail ~1,000× more often than one — failures are guaranteed, design for them.
  • The synchronous all-reduce is a barrier: one dead rank freezes every other rank until the NCCL timeout.
  • Checkpoint interval ≈ √(2 × write_time × MTBF); you lose ~half an interval per failure.
  • Async + sharded checkpointing cuts the per-save step stall from minutes to seconds.
  • Stragglers (alive but slow) silently pace the whole run via the barrier — detect via per-rank step time, not the loss.
  • Elastic rendezvous re-forms the group and resumes in minutes; silent corruption needs ECC/checksums/gradient-norm watches because the loss hides it.

Bridge. We can now keep a thousand-GPU run alive for a week: split the model, place it on the right wires, trade compute for memory, and survive the failures that are guaranteed at scale. That is the full distributed-training picture — every mechanism a response to the memory wall and the coordination cost it created. But one assumption has run silently under all of it: that each individual GPU, once fed its shard of work, actually runs that work fast. A GPU at 35% MFU is wasting two-thirds of the hardware you fought so hard to coordinate — and no amount of better parallelism fixes a kernel that doesn't saturate the tensor cores. The next module opens a single GPU and asks why its kernels run slow: memory-bandwidth limits, kernel-launch overhead, fusion, and the arithmetic-intensity wall that decides whether your expensive H100 is computing or just waiting on memory. → ../09_gpu_acceleration_stack/00-first-principles.md