07. NeMo customization — the training-side framework for the model NIM serves¶

~21 min read. NIM serves a model. Triton hosts it. TensorRT-LLM compiles it. But every one of those layers assumes a model already exists — trained, fine-tuned, aligned. For a model that is actually yours, someone built it on your data, across many GPUs, with the same "keep the hardware fed" pressure that has run through this whole module. NeMo is the framework that does that build. The question is the same one NIM forced: when does the vendor framework earn its place over the PyTorch and HuggingFace you already know?

Built on the same feed the beast invariant, now on the training side, and on the buy-vs-build boundary from NIM — applied to building models instead of serving them. This file introduces distributed-training parallelism (the training echo of file 03's collectives), data curation at scale, and the SFT/PEFT/alignment customization stages, and weighs NeMo against raw PyTorch + HuggingFace the way file 06 weighed NIM against the hand-built stack.

What every serving layer quietly assumed¶

Trace the module backward. NIM serves Llama-3-70B. Triton hosts the engine. TensorRT-LLM compiled it. The roofline, fusion, and NCCL files made the forward pass fast. Every one of those layers takes the model weights as given — they make an existing model run fast on existing hardware. None of them produced the weights.

But a 70B model did not appear from nowhere, and a 70B that knows your domain certainly did not. Someone curated trillions of tokens, ran pretraining across hundreds of GPUs for weeks, then fine-tuned on instruction data, then aligned the model to preferences. That training pipeline hits the same wall as serving — a 70B does not fit on one GPU, so its weights and activations are split across many, and the GPUs spend time in collectives (file 03's all-reduce) keeping the copies in sync. Keeping those training GPUs fed is its own hard problem, and it is the problem NeMo exists to package.

This file is NeMo: NVIDIA's framework for the training and customization side of the model lifecycle — data curation, pretraining, supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), and alignment — and the judgment of when it beats the raw PyTorch + HuggingFace stack most teams reach for first.

What this file solves¶

A team needs to customize Llama-3-70B on their own domain data and finds their HuggingFace Trainer script either OOMs on the 70B or crawls because it splits the model across GPUs naively and the collectives stall. The naive read is "we need more GPUs." The real cause is that customizing a 70B is a distributed training problem — the model spans GPUs, the data must be curated at scale, and the parallelism and collectives must be efficient — and a single-GPU-shaped HuggingFace script was never built for that. This file teaches you when to reach for NeMo, which packages the distributed-training stack (Megatron-Core parallelism, scalable data curation, SFT/PEFT/alignment recipes) the way TensorRT-LLM packaged the inference stack — and when raw PyTorch + HuggingFace is still the right tool.

Why a training framework, and not just a bigger training script¶

A HuggingFace Trainer script is the right tool for an enormous range of work: fine-tune a 7B on a single node, run LoRA on a 13B, prototype anything. It is flexible, familiar, and you control every line. It breaks at the same place serving broke — when the model is too big for one GPU. Then you need the weights, the gradients, the optimizer states, and the activations split across GPUs, with collectives keeping everything consistent, and you need that split to be efficient or the GPUs stall in communication instead of computing. Hand-rolling that on top of a Trainer is exactly the weeks-of-specialist-time problem file 06 described, now on the training side.

NeMo packages it. Under the hood NeMo builds on Megatron-Core, NVIDIA's library for the parallelism that makes large-model training efficient — tensor parallelism (split each layer's matrices across GPUs), pipeline parallelism (assign consecutive layers to different GPUs), and data parallelism (replicate and average gradients), composed together. NeMo wraps that in PyTorch Lightning so you configure a training run instead of writing the distributed plumbing, and it ships recipes for the customization stages — SFT, PEFT (LoRA and friends), and alignment (DPO, RLHF-style methods). The pieces are tuned to work together on NVIDIA clusters, the way the serving stack was tuned for NVIDIA GPUs.

Teacher voice. This is file 06's decision wearing training clothes. NIM was "take the prebuilt serving stack when the optimization is a commodity." NeMo is "take the prebuilt training stack when the distributed-training engineering is undifferentiated." And the counter-case is identical: when the model is small enough for a Trainer, or your method is too custom for a recipe, the flexible tool wins. Same buy-vs-build shape, different side of the lifecycle.

The training script that OOMs on the 70B¶

A team takes their working HuggingFace Trainer fine-tuning script — fine on a 7B on one node — and points it at Llama-3-70B across eight H100s. It OOMs immediately: 70B parameters in BF16 is ~140 GB of weights, and training needs gradients, optimizer states (Adam keeps two more copies), and activations on top — easily 5-6x the weight memory, far past what eight 80 GB cards hold without splitting the model itself.

They reach for the obvious fix: turn on data-parallel multi-GPU. But data parallelism replicates the whole model on each GPU and only splits the batch — so each GPU still needs the full 70B plus its training state, which still doesn't fit. They add gradient checkpointing and offloading; now it fits but crawls, because the naive setup spends most of its time moving data between GPUs and to CPU instead of computing.

The visible break: the script either OOMs or trains so slowly the run would take months. Adding GPUs the naive way doesn't help — data parallelism can't make a model fit that doesn't fit on one GPU.

So the real problem is not that they have too few GPUs; it is that a 70B must be split across GPUs — the model itself, not just the batch — and doing that split efficiently (tensor + pipeline parallelism with collectives on the fast wire) is a distributed-systems problem a single-GPU-shaped script was never built to solve. The fix is not more GPUs or more checkpointing tricks. It is a framework built for model-parallel training.

So how do we split a model too big for one GPU across many, efficiently, without hand-writing the parallelism?

When the model itself must be cut across cards¶

Take the smallest case that breaks data parallelism. One layer's weight matrix is 2 GB; the full model is 140 GB; one GPU holds 80 GB. Data parallelism puts a full copy of all 140 GB on each GPU — impossible. Tensor parallelism instead cuts that 2 GB matrix itself across, say, four GPUs, so each holds 0.5 GB of it and they combine partial results with an all-reduce (file 03's collective) every layer. Now the 140 GB model lives across four GPUs as four 35 GB shards, and it fits. Pipeline parallelism cuts a different way — GPU 0 holds layers 1-20, GPU 1 holds 21-40 — so a long model spans cards by depth. Compose them and a 70B trains across a node; the collectives keeping the shards consistent are the training-side echo of the serving all-reduce.

That composition — tensor across the width, pipeline across the depth, data across replicas — is what NeMo configures for you and a Trainer does not.

Rule: a model too big for one GPU is a distributed-training problem, and the parallelism must be efficient¶

The training rule. When a model fits on one GPU, use the flexible tool (HuggingFace Trainer, raw PyTorch) — it's simpler and you control everything. When the model is too big for one GPU, you must split the model itself across GPUs with tensor + pipeline parallelism (not just the batch with data parallelism), and those splits must keep their collectives on the fast interconnect or the GPUs stall communicating instead of computing. NeMo (built on Megatron-Core) packages that efficient parallelism plus the data-curation and SFT/PEFT/alignment stages; reach for it when the distributed-training engineering would otherwise be undifferentiated specialist toil.

Why the rule exists. The primitive is memory: a 70B's weights, gradients, optimizer states, and activations far exceed one GPU. The constraint is that data parallelism replicates the model, so it can't make a too-big model fit — only splitting the model (tensor/pipeline parallelism) does that, and those splits add per-layer collectives that must be fast (file 03). Hand-rolling that on a Trainer breaks because the parallelism, the collectives, and the memory layout are a distributed-systems problem the script wasn't built for. NeMo relieves it by packaging Megatron-Core's tuned parallelism; the new pressure it creates is the same as NIM's — a vendor framework's conventions, configs, and lock-in to learn and depend on.

1) The four pieces NeMo packages — the model lifecycle, fed¶

NeMo covers the whole "make a model" lifecycle, each piece sized for scale:

        DATA              PRETRAIN              CUSTOMIZE            ALIGN
   ┌───────────┐     ┌────────────────┐    ┌──────────────┐   ┌─────────────┐
   │ NeMo      │     │ Megatron-Core  │    │ SFT          │   │ DPO / RLHF  │
   │ Curator   │ ──▶ │ tensor +       │──▶ │ full or      │──▶│ preference  │
   │ dedup,    │     │ pipeline +     │    │ PEFT (LoRA)  │   │ alignment   │
   │ quality   │     │ data parallel  │    │              │   │             │
   │ filter,   │     │ across 100s of │    │ teach        │   │ choose      │
   │ PII, lang │     │ GPUs           │    │ instructions │   │ between     │
   └───────────┘     └────────────────┘    └──────────────┘   │ good answers│
   trillions of       weeks of fed          hours-to-days       └─────────────┘
   tokens, GPU-        GPUs, collectives     on your data
   accelerated         on the fast wire      (PEFT = cheap)

Data curation (NeMo Curator). Before any training, trillions of tokens must be downloaded, cleaned, quality-filtered, deduplicated (exact and fuzzy), PII-redacted, and language-classified. Curator does this GPU-accelerated, scaling across multi-node multi-GPU using RAPIDS (cuDF/cuML) and Dask/Ray — the same "feed the beast" pressure applied to data processing, not model math.
Pretraining. Megatron-Core's composed parallelism trains the base model across hundreds of GPUs, keeping the collectives on the fast interconnect.
Customization (SFT / PEFT). Teach an existing base model your task. Full SFT updates all weights (expensive); PEFT (LoRA and variants) trains a tiny set of added parameters (cheap, fast, the common path for most teams).
Alignment (DPO, RLHF-style). Once the model gives good answers, alignment teaches it to prefer the better of two good answers using preference data.

Most teams touch only the right half — customize a base model with PEFT and maybe align it — which is exactly where NeMo competes most directly with HuggingFace.

2) The picture — where the model itself gets cut¶

The mental model that lands this file is the two cuts that make a too-big model fit, drawn against the data-parallel approach that can't:

DATA PARALLEL (can't fit a too-big model)     MODEL PARALLEL (NeMo / Megatron-Core)
─────────────────────────────────────────     ────────────────────────────────────
 GPU0: [FULL 140GB model] + batch slice 0       tensor parallel (cut each layer's width):
 GPU1: [FULL 140GB model] + batch slice 1         GPU0:[½ of layer matrices]  ┐ all-reduce
 GPU2: [FULL 140GB model] + batch slice 2         GPU1:[½ of layer matrices]  ┘ every layer
   each GPU needs the WHOLE model → OOM
                                                pipeline parallel (cut the depth):
 only splits the BATCH, replicates weights        GPU2:[layers 1–20] → GPU3:[layers 21–40]
 → useless when the model won't fit on one        activations flow GPU2→GPU3 (pipeline)

                                                compose both → 140GB model spans the cards,
                                                gradients averaged across data-parallel replicas

Data parallelism puts the whole model on every GPU and splits only the batch — it scales throughput once the model fits, but it cannot make a model fit that doesn't. Tensor parallelism cuts each layer's matrices across GPUs (combining partial results with an all-reduce every layer, file 03's collective). Pipeline parallelism cuts the model by depth across GPUs (passing activations between stages). Composed, they make the 140 GB model live across the cards. The collectives are the cost, and keeping them on NVLink/InfiniBand (file 03) is what keeps the training GPUs fed.

3) Customizing the 70B — the running example moves upstream¶

Our endpoint serves Llama-3-70B. Now suppose it must serve our 70B — fine-tuned on our domain documents and aligned to our preferences. That model is the input to everything downstream: NeMo builds it, then TensorRT-LLM compiles it (file 04), Triton serves it (file 05), or a NIM wraps it (file 06).

In NeMo, customizing the 70B looks like: curate the domain corpus with Curator (dedup, quality-filter, PII-strip), then run PEFT (LoRA) rather than full SFT — training a small adapter on top of the frozen 70B base, which fits and finishes in hours-to-days on a node instead of the weeks full pretraining would take, because LoRA updates a tiny fraction of the parameters. Then optionally align with DPO on preference pairs. The output is a customized 70B (or a base-plus-adapter) ready for the serving stack.

The LoRA choice connects straight back to file 06: a base-plus-LoRA model is exactly the "stock base plus adapters" case where a NIM can load the adapter onto its optimized base — so customizing with PEFT keeps the prebuilt-serving path open, while a full-SFT custom model is more likely to need the hand-built stack.

Mini-FAQ. "Why PEFT instead of full SFT for our 70B?" Full SFT updates all 70B parameters — it needs the full model-parallel training setup, lots of GPUs, and produces a wholly new 140 GB model. LoRA freezes the base and trains a small adapter (often <1% of parameters), so it fits more easily, trains far faster and cheaper, and yields a tiny adapter you can swap or serve on top of the stock base. For most domain-customization, PEFT is the right default; full SFT is for when the task needs to move the base model substantially.

4) Why NeMo and not raw PyTorch + HuggingFace?¶

The plausible alternative is the stack every ML engineer already knows: HuggingFace transformers/peft/trl on PyTorch, with accelerate or DeepSpeed for multi-GPU. Why reach for NeMo?

Because under our workload — customizing a 70B that must be split across GPUs, with curation at scale and a path to alignment — the distributed-training engineering is exactly the undifferentiated specialist toil file 06 warned about. NeMo packages Megatron-Core's tuned tensor+pipeline parallelism, GPU-accelerated curation, and SFT/PEFT/alignment recipes that work together on NVIDIA clusters, so you configure a run instead of hand-wiring parallelism and collectives. The honest counter-case is sharp and common: for models that fit on one GPU, for LoRA on a 7B-13B, for rapid prototyping, or when you want the enormous HuggingFace ecosystem and community recipes, the flexible stack is simpler and faster to results. NeMo earns its place at large scale (the 70B that must be model-parallel) and on NVIDIA clusters; HuggingFace + PyTorch wins for the long tail of smaller, faster-iterating, more-custom work.

Why this instead of HuggingFace, under our workload? Our model is a 70B that genuinely must be split across GPUs, with trillions of tokens to curate and alignment ahead. That's the case where NeMo's packaged Megatron-Core parallelism and curation pay for the framework's learning curve. If we were doing LoRA on a 7B, HuggingFace's peft on a single node would get us there faster with a stack we already know.

5) The property that decides the framework: does the model fit on one GPU?¶

The one dimension that decides NeMo-vs-HuggingFace is whether the model — with its training state — fits on a single GPU. Below that line, the flexible stack wins on simplicity; above it, the model must be split and NeMo's packaged parallelism earns its place.

Workload	Fits on one GPU (with training state)?	Framework verdict
LoRA on 7B-13B, single node	yes	HuggingFace `peft` — simpler, familiar, faster to results
Full SFT of a 7B	roughly, with sharding	HuggingFace + DeepSpeed/FSDP; NeMo optional
Full SFT / pretraining of 70B+ across many GPUs	no — must be model-parallel	NeMo (Megatron-Core) — packaged tensor+pipeline parallelism
Trillion-token curation before pretraining	n/a (data, not model)	NeMo Curator — GPU-accelerated, multi-node
Rapid prototyping, exotic custom training loop	yes, usually	raw PyTorch — full control

The asymmetry to remember: data parallelism scales throughput once the model fits but can never make a too-big model fit — only model parallelism (tensor/pipeline) does that. So the question is never "how many GPUs"; it is "does the model fit on one." Below the line, reaching for NeMo's heavy parallelism is overkill; above it, hand-rolling model parallelism on a Trainer is the expensive mistake.

6) The failure walked through: the parallelism config that stalled the GPUs¶

A team moves their 70B fine-tune to NeMo, sets a large tensor-parallel degree (TP=8) to be safe, and launches across two nodes. It runs, but GPU utilization sits low and the step time is far worse than expected. They're confused — they used the framework built for this, so why is it slow?

Trace it. Tensor parallelism does an all-reduce every layer, and at TP=8 spanning two nodes, those collectives cross the slower inter-node InfiniBand instead of staying on intra-node NVLink (file 03's exact pressure). With TP=8 split across nodes, every layer pays a slow cross-node all-reduce, so the GPUs spend most of each step waiting on the collective rather than computing — the training echo of file 03's "all-reduce on PCIe instead of NVLink." The fix was to keep tensor parallelism within a node (TP=4 on the 4 NVLink-connected GPUs) and use pipeline parallelism across nodes (where the inter-stage transfer is less frequent than per-layer all-reduce), matching the parallelism layout to the interconnect topology. The lesson: NeMo packages the parallelism, but choosing the degrees to match the topology is still a real decision — the framework gives you efficient collectives only if you place them on the fast wire.

7) Cost movement: what NeMo buys and what it costs¶

What it fixes: packages Megatron-Core's tensor+pipeline+data parallelism so a too-big model trains across GPUs without hand-wiring; ships GPU-accelerated curation for trillion-token datasets; provides tuned SFT/PEFT/alignment recipes — turning a distributed-systems build into a configured run.
What it costs: a learning curve and conventions distinct from HuggingFace (different config, different checkpoint format, NeMo 2.0's Lightning-based API), heavier setup than a Trainer, and the same vendor-dependency shape as NIM — tuned for NVIDIA clusters, with its own ecosystem to adopt over the larger HuggingFace one.
Which subsystem pays: the training-platform/ML-infra team owns the cluster config, the parallelism layout, and the NeMo conventions; the reward lands as efficient large-model training (fed GPUs, faster convergence) and a curation pipeline that scales. The new pressure: the parallelism degrees must match the interconnect topology (section 6), so the team now owns a placement decision that, done wrong, stalls the very GPUs it was meant to feed.

For the running example: customizing our 70B in NeMo with curated data and LoRA gets an efficient model-parallel run instead of an OOMing Trainer, at the cost of learning NeMo's conventions and owning the parallelism-vs-topology placement — a trade that's clearly right for a 70B and clearly overkill for a 7B LoRA we could do in HuggingFace.

8) Signals: healthy, first to degrade, and the liar¶

Healthy: training GPUs at high utilization with step time close to the compute-bound floor; collectives (all-reduce) overlapped with compute and staying on the fast intra-node wire; loss curve descending smoothly; curation throughput scaling with added nodes.
First metric to degrade: GPU utilization drops while step time climbs — the giveaway that the GPUs are stalling on collectives (tensor-parallel all-reduce crossing a slow link, or a parallelism layout mismatched to the topology, section 6) rather than computing. Tokens/sec/GPU sags before the loss curve shows anything.
The misleading metric: the loss curve looking fine while the run is slow — correctness (loss descending) says nothing about whether the GPUs are fed. A run can converge correctly at a fraction of the achievable speed because the parallelism is stalling on communication.
The graph an expert opens first: per-GPU utilization alongside the collective/communication time per step. High communication time with low compute means the parallelism degrees don't match the interconnect — pull tensor parallelism inside the node, push pipeline parallelism across nodes. This is file 03's interconnect signal, now on the training side.

9) Boundary: where NeMo shines and where it doesn't¶

NeMo shines for large-model training and customization on NVIDIA clusters — the 70B+ that must be model-parallel, trillion-token curation, and the full pretraining-to-alignment lifecycle on a fleet sized for it. There its packaged Megatron-Core parallelism and GPU-accelerated curation turn a distributed-systems build into a configured run, and the collectives stay fed.

It becomes overkill for small-model work: LoRA on a 7B, full SFT of a model that fits a node with FSDP, or rapid prototyping — there NeMo's heavy machinery and learning curve cost more than HuggingFace's simpler, more familiar stack buys. It's also a poorer fit when you need the breadth of the HuggingFace ecosystem (every new model and recipe lands there first) or an unusual custom training loop NeMo's recipes don't express. The scale limit that invalidates naive intuition: "use the framework built for scale for everything" is wrong below the one-GPU line — model parallelism's collectives are pure overhead when the model already fits, and a Trainer is faster to results.

10) Wrong model: "more GPUs (data parallelism) will let me train any model"¶

The seductive wrong idea is that training a bigger model is just a matter of throwing more GPUs at it with data parallelism. Data parallelism replicates the whole model on each GPU and splits only the batch — so it scales throughput once the model fits, but it can never make a model fit that doesn't fit on one GPU. Adding GPUs the data-parallel way to a 70B that OOMs on one card just OOMs on more cards.

Replace it with: a model too big for one GPU must be split itself across GPUs — tensor parallelism (cut each layer's width) and pipeline parallelism (cut the depth) — and those splits add per-layer collectives that must stay on the fast interconnect or the GPUs stall. Data parallelism is for throughput after the model fits; model parallelism is for fitting it at all. NeMo's job is packaging the model parallelism efficiently; "just add GPUs" without it solves nothing for a too-big model.

11) Other failure shapes to recognize¶

Parallelism-topology mismatch. Tensor parallelism spanning nodes so every-layer all-reduce crosses slow InfiniBand (section 6). Fix: keep TP intra-node on NVLink, pipeline parallelism across nodes.
Data parallelism on a too-big model. Expecting more GPUs to fit a 70B; still OOMs. Fix: model parallelism (tensor/pipeline), not just data parallelism.
Skipping curation. Training on raw, deduplicated, PII-laden data; poor model and compliance risk. Fix: run Curator (dedup, quality filter, PII redaction) first.
Full SFT when PEFT would do. Burning GPUs on full fine-tuning when LoRA fits the need cheaper and faster, and keeps the base-plus-adapter serving path open (file 06). Fix: default to PEFT for domain customization.
Framework overkill. Reaching for NeMo's heavy parallelism for a 7B LoRA HuggingFace handles in a notebook. Fix: stay on the flexible stack below the one-GPU line.
Checkpoint-format lock-in. NeMo checkpoints differ from HuggingFace; surprise at conversion time for serving or sharing. Fix: plan the convert-to-serving path (NeMo → TensorRT-LLM / HF) up front.

12) Pattern transfer¶

Same buy-vs-build shape as NIM (file 06). NeMo is to training what NIM is to serving: a packaged vendor stack that's right when the engineering is undifferentiated (large-scale distributed training) and wrong when the flexible tool fits (small models, custom loops). The "vendor framework vs raw control" decision recurs on both sides of the lifecycle.
Same interconnect pressure as NCCL collectives (file 03). Tensor-parallel all-reduce every layer is file 03's collective on the training side; placing it on the fast wire is the same fix. "Keep the collective on NVLink" recurs from serving to training.
Same fit-on-one-unit threshold as model parallelism everywhere. "Does it fit on one GPU?" decides data-vs-model parallelism the way "does the working set fit in cache/RAM?" decides locality strategy at every layer. The fit-or-split decision is a recurring systems shape.

13) Design test — five questions before you reach for NeMo¶

Does your model, with its full training state (gradients + optimizer + activations), fit on one GPU? If yes, prefer the flexible HuggingFace/PyTorch stack.
If it doesn't fit, are you using model parallelism (tensor + pipeline), not just data parallelism, to make it fit?
Is your tensor-parallel all-reduce staying on the fast intra-node wire (NVLink), with pipeline parallelism across the slower inter-node links?
Have you curated the data (dedup, quality filter, PII) before training, at the scale your corpus needs?
Is PEFT (LoRA) enough for your customization — cheaper, faster, and keeping the base-plus-adapter serving path open — or do you genuinely need full SFT?

Where this appears in production¶

The framework and its parts

NVIDIA NeMo Framework — end-to-end training/customization for LLMs and multimodal models: data curation, pretraining, SFT/PEFT, alignment; the training-side counterpart to the serving stack.
Megatron-Core — the parallelism library NeMo builds on; tensor, pipeline, and data parallelism composed for efficient large-model training across hundreds of GPUs.
NeMo Curator — GPU-accelerated, multi-node data curation (download, clean, quality-filter, exact/fuzzy dedup, PII redaction, language ID) using RAPIDS and Dask/Ray; curated the 8T+-token Nemotron-4 pretraining set.
NeMo 2.0 (PyTorch Lightning + MegatronStrategy) — the Lightning-based API that configures parallelism via tensor_model_parallel_size / pipeline_model_parallel_size instead of hand-wiring it.
PEFT / LoRA in NeMo — trains a small adapter on a frozen base; the cheap, fast default for domain customization, and the base-plus-adapter form NIM can serve (file 06).
Alignment (DPO, RLHF-style) in NeMo — preference-based alignment after SFT; teaches the model to prefer the better of two good answers.

Where it shows up

NVIDIA Nemotron models — pretrained and curated with NeMo + Curator at multi-trillion-token scale; the dogfooded proof of the stack.
Enterprises building domain LLMs — curate proprietary corpora with Curator and customize a base model with NeMo PEFT for healthcare, finance, telco domains.
HuggingFace transformers + peft + trl — the flexible counterpart; the standard comparison for small-model, single-node, fast-iterating customization.
DeepSpeed / PyTorch FSDP — the non-NVIDIA-framework way to shard a large model's training state; what teams use when they want model-scaling without NeMo's conventions.
Megatron-LM — the research training library Megatron-Core descends from; where large-scale parallelism techniques are pioneered.
NeMo → TensorRT-LLM export — a model customized in NeMo is compiled by TensorRT-LLM (file 04) for serving; the training-to-serving handoff.
Domain-adaptive pretraining (DAPT) — continued pretraining of a base model on domain data via NeMo, between generic pretraining and task SFT.
Synthetic data generation pipelines — Curator and NeMo used to generate and filter synthetic training data at scale for instruction tuning.

Pause and recall¶

Why can data parallelism never make a 70B fit when it OOMs on one GPU?
What do tensor parallelism and pipeline parallelism each cut, and what collective does tensor parallelism add?
What four stages of the model lifecycle does NeMo package?
State the NeMo-vs-HuggingFace rule in one sentence (the deciding property).
Why is PEFT (LoRA) the common default for domain customization, and how does it connect to NIM (file 06)?
Why did TP=8 across two nodes stall the GPUs, and what's the fix?
What does NeMo Curator do before any training happens?
Why is a smoothly descending loss curve a misleading signal about training speed?

Interview Q&A¶

Q1. Your HuggingFace Trainer script OOMs trying to fine-tune a 70B across eight GPUs. You turn on data-parallel multi-GPU and it still OOMs. Why, and what do you actually need? A. Data parallelism replicates the whole model on each GPU and only splits the batch, so each GPU still needs all 140 GB of weights plus gradients, optimizer states, and activations — which doesn't fit, on one GPU or eight. You need model parallelism: tensor parallelism to cut each layer's matrices across GPUs and pipeline parallelism to cut the model by depth, composed so the model itself spans the cards. NeMo (Megatron-Core) packages that. Common wrong answer to avoid: "Add more GPUs with data parallelism." Data parallelism scales throughput after the model fits; it can't make a too-big model fit.

Q2. When does NeMo earn its place over HuggingFace + PyTorch, and when is it overkill? A. NeMo earns it when the model is too big for one GPU and must be model-parallel — large-scale pretraining or full SFT of a 70B+ on NVIDIA clusters, plus trillion-token curation — because hand-rolling efficient tensor+pipeline parallelism is undifferentiated specialist toil. It's overkill below the one-GPU line: LoRA on a 7B, single-node fine-tunes, or prototyping, where HuggingFace's simpler, more familiar stack and bigger ecosystem get you to results faster. Common wrong answer to avoid: "Use the framework built for scale for everything." Model parallelism's collectives are pure overhead when the model already fits a GPU.

Q3. You moved a 70B fine-tune to NeMo, set tensor-parallel degree 8 across two nodes, and GPUs sit at low utilization with high step time. Diagnose. A. Tensor parallelism does an all-reduce every layer; at TP=8 spanning two nodes, that collective crosses slow inter-node InfiniBand instead of fast intra-node NVLink, so the GPUs stall on communication every layer instead of computing. Keep tensor parallelism within a node (TP=4 on NVLink) and use pipeline parallelism across nodes, where transfers are less frequent — match the parallelism layout to the interconnect topology. Common wrong answer to avoid: "Increase the tensor-parallel degree further." More TP means more frequent all-reduce; spanning it across nodes makes the stall worse, not better.

Q4. For customizing the 70B on your domain data, why default to PEFT (LoRA) over full SFT, and how does that choice interact with serving? A. LoRA freezes the 70B base and trains a small adapter (often <1% of parameters), so it fits more easily, trains far faster and cheaper, and produces a tiny swappable adapter — versus full SFT updating all 140 GB. It also keeps the base-plus-adapter serving path open: a NIM can load the LoRA adapter onto its optimized base (file 06), whereas a full-SFT custom model is more likely to need a hand-built serving stack. Common wrong answer to avoid: "Always full fine-tune for best quality." Full SFT is expensive and only needed when the task must move the base substantially; PEFT is the right default for most domain customization.

Q5. The loss curve is descending smoothly but the run is slow and will take weeks. Is this a data problem, a model problem, or an infrastructure problem? A. An infrastructure problem — the loss descending says the math is correct, but says nothing about whether the GPUs are fed. A correct run can crawl because the parallelism is stalling on collectives (mismatched to the interconnect topology). Look at per-GPU utilization and per-step communication time, not the loss curve, to find the stall. Common wrong answer to avoid: "The loss is fine, so training is fine." Loss measures convergence, not GPU efficiency; a fed and a starved run can have identical loss curves at very different speeds.

Q6. (Cumulative.) Trace our own 70B from data to served endpoint across the module. Which tool owns each stage? A. NeMo Curator curates the domain corpus (dedup, quality, PII); NeMo + Megatron-Core trains/customizes the 70B with model parallelism and PEFT, keeping collectives on the fast wire (file 03's pressure on the training side); TensorRT-LLM compiles the result into an engine with in-flight batching and paged KV (file 04); Triton serves it with dynamic batching and packing (file 05); or a NIM wraps the engine-plus-server with an OpenAI API (file 06) — and because we used LoRA, a NIM can serve our adapter on its optimized base. NeMo produces the model the rest of the module makes fast. Common wrong answer to avoid: "NeMo and TensorRT-LLM do the same thing." NeMo builds the model (training side); TensorRT-LLM makes an existing model's forward pass fast (serving side) — different ends of the lifecycle.

Design/debug exercise (10 min)¶

Step 1 — Model it. Take a 70B (~140 GB weights) on 80 GB H100s. Show why one GPU can't hold even the weights plus training state (gradients + Adam's two optimizer copies ≈ 3-4x weights → ~500 GB+). Then show tensor parallelism cutting each layer across 4 NVLink GPUs (≈ 35 GB shard each, fits) with an all-reduce every layer, and pipeline parallelism cutting depth across nodes. Write the memory per GPU under data-parallel (impossible) vs tensor-parallel-4 (fits).

Step 2 — Your turn. For our 70B domain customization, lay out a parallelism plan on 8 H100s across 2 nodes (4 NVLink GPUs each): pick tensor-parallel and pipeline-parallel degrees that keep the every-layer all-reduce intra-node, and decide PEFT vs full SFT. Argue from the interconnect (file 03) why TP must stay within a node, and from file 06 why LoRA keeps the NIM serving path open. Tie it to the rule: split the model only as much as the topology rewards.

Step 3 — Reproduce from memory. Redraw the data-parallel-vs-model-parallel diagram from section 2 (the two cuts and where the all-reduce lives) and the four-stage lifecycle from section 1. Then state in one sentence how this file connects to file 03 (tensor-parallel all-reduce is the same collective on the training side) and to file 06 (same buy-vs-build shape, training instead of serving; LoRA keeps the NIM path open).

Operational memory¶

This chapter explained why a HuggingFace Trainer that fine-tunes a 7B happily OOMs or crawls on a 70B: customizing a model too big for one GPU is a distributed-training problem — the model itself must be split across GPUs, and the splits add per-layer collectives that must stay on the fast wire. The important idea is that data parallelism scales throughput only after the model fits a GPU; making a too-big model fit at all requires model parallelism (tensor cuts the width, pipeline cuts the depth), which is the engineering NeMo packages via Megatron-Core.

You learned when to reach for NeMo: above the one-GPU line, where its packaged parallelism, GPU-accelerated curation, and SFT/PEFT/alignment recipes turn a distributed-systems build into a configured run; and to default to PEFT (LoRA) for domain customization, which trains cheaply and keeps the base-plus-adapter serving path open for a NIM. That solves the opening OOM/crawl because the 70B now lives across the cards efficiently instead of failing to fit or stalling on misplaced collectives.

Carry this diagnostic forward: before choosing a training framework, ask whether the model fits on one GPU — below the line, prefer the flexible HuggingFace stack; above it, use model parallelism and match the parallelism degrees to the interconnect topology. If a NeMo run is slow with a fine loss curve, look at per-GPU utilization and per-step communication time before blaming the data or the model — a mismatched parallelism layout stalls the GPUs the framework was meant to feed.

Remember:

Data parallelism replicates the model and splits the batch — it can't make a too-big model fit; only model parallelism (tensor + pipeline) does.
NeMo packages Megatron-Core parallelism, scale data curation, and SFT/PEFT/alignment; it earns its place above the one-GPU line and is overkill below it.
Tensor-parallel all-reduce happens every layer — keep it intra-node on NVLink, push pipeline parallelism across nodes (file 03's pressure on the training side).
Default to PEFT (LoRA) for domain customization: cheap, fast, and keeps the base-plus-adapter NIM serving path open (file 06).
A smooth loss curve says the math is right, not that the GPUs are fed — watch utilization and communication time for stalls.
Next pressure: training a 70B needs many GPUs for weeks and serving needs only some of them — so once the model exists, who decides which jobs get which GPUs, and how do you stop expensive cards sitting idle?

Bridge. NeMo trains the model across many GPUs; the serving stack runs it on some GPUs. Both assume the GPUs are there and yours to use — but in a real organization the same fleet must run training jobs that want hundreds of cards for weeks, serving jobs that want a few cards always-on, and experiments that want a slice for an hour. Who decides which job gets which GPU? What happens when a 70B reserves four cards it barely touches, or a tiny experiment holds a whole H100 at 5%? The next file is the scheduling and partitioning layer — MIG, the Kubernetes GPU operator, SLURM — that shares expensive GPUs across all of it, and confronts the cost pressure that has been lurking under the whole module: an idle GPU bills the same as a busy one. → 08-gpu-cluster-scheduling-and-mig.md