08. GPU cluster scheduling and MIG — the idle GPU bills the same as the busy one¶
~22 min read. Every file so far asked "what is the GPU waiting for, and how do we stop the wait." This one asks the question lurking under all of them: what about the GPUs waiting on nothing at all? A 70B reserves four cards it barely saturates. A tiny experiment holds a whole H100 at 5%. The fleet is "fully allocated" and half-idle, and every idle card bills the same as a busy one. This is the scheduling and partitioning layer that shares expensive GPUs across many jobs — and the cost pressure that justifies the entire module.
Built on the whole module: the same feed the beast invariant, now at fleet granularity. The roofline fought idle SMs; the server fought idle GPUs between requests; this file fights idle GPUs reserved by jobs that don't fill them — the slowest, most expensive idle of all. It introduces MIG (hardware partitioning of one GPU into isolated instances), the Kubernetes GPU operator / device plugin (how a cluster schedules GPUs), SLURM (the batch scheduler for training), and bin-packing under the dominant idle-GPU cost pressure.
What "fully allocated" hides¶
The module's throughput climb is done. The 70B endpoint hits its target; the model is trained and served fast. Now zoom out from one endpoint to the fleet: dozens of GPUs running a mix of jobs. The dashboard says the cluster is fully allocated — every GPU is assigned to some job. The bill confirms you're paying for all of them. And yet aggregate utilization across the fleet is 40%.
The contradiction is the same shape as the module's opening, one layer up. There, eight H100s were "in use" at 38% utilization because the GPUs were waiting for data. Here, the fleet is "fully allocated" at 40% because the GPUs are reserved but not filled: a serving job that holds a card at 30%, a training job that reserved 8 GPUs and uses 6, an experiment that grabbed a whole H100 to run a notebook. Allocation is not utilization. A reserved-and-idle GPU is the most expensive waste in the stack, because — unlike an idle SM that costs nanoseconds — an idle card bills 24/7 whether or not it computes.
This file is the layer that attacks that waste: partitioning one GPU into isolated slices for jobs too small to fill it (MIG), scheduling GPUs across many jobs (the Kubernetes GPU operator and device plugin, SLURM for batch training), and packing jobs onto cards to drive utilization up (bin-packing). The pressure is no longer latency or bandwidth — it is cost.
What this file solves¶
A team's GPU fleet shows "100% allocated" but 40% utilized, and finance is asking why the GPU bill is climbing while throughput isn't. The naive read is "we need more GPUs." The real cause is that GPUs are reserved in whole-card units by jobs that don't fill a whole card, so allocation looks full while silicon sits idle and bills the same as if it were busy. This file teaches you to drive utilization up instead of buying more cards: partition a GPU into isolated slices for small jobs (MIG), schedule GPUs across the fleet with the right granularity (K8s device plugin / SLURM), and bin-pack so a card's tenants add up to its capacity — confronting the cost pressure that justifies every optimization in this module.
Why a scheduler and partitioning, and not just more GPUs¶
The instinct when the fleet looks full is to buy more cards. But the fleet isn't full of work — it's full of reservations. The GPU is allocated in coarse, whole-card units: ask for a GPU, you get an entire H100, even if your job uses 5% of it. Most jobs don't fill a card. A safety classifier (file 05) uses a few percent. A small experiment uses a sliver. A serving replica might use a third. Each holds a whole card, and the other 95%, 80%, or 67% sits idle — billing.
Two mechanisms fix this. Partitioning makes the unit of allocation smaller than a whole GPU, so a small job gets a small slice and several share a card. Scheduling decides which job lands on which GPU (or slice), packing them so a card's tenants sum to its capacity instead of one job hoarding it. Together they convert "one job per card at low utilization" into "several jobs per card at high utilization" — fewer cards for the same work, which is money.
Teacher voice. This is the same fight as every other file, at the slowest clock and the highest cost. The roofline kept SMs fed across nanosecond weight-reads; in-flight batching kept a GPU fed across millisecond tokens; the serving server kept GPUs fed across second-scale requests; the scheduler keeps the fleet fed across hour-to-week-scale jobs. Same enemy — idle hardware — but here the idle unit is a whole card billing around the clock, so the waste is the largest of all.
The fleet that's full and idle at the same time¶
A team runs a shared GPU cluster: serving replicas, training jobs, and a pile of experiments. They allocate one GPU per job because that's the default unit. The cluster fills up — every GPU has an owner — and new jobs queue, waiting for a free card. So they buy more GPUs. The new ones fill too, and utilization stays at 40%.
The visible break: nvidia-smi across the fleet shows most GPUs at single-to-low-double-digit utilization, while the scheduler reports zero free GPUs and a growing queue. Buying more cards made the bill bigger and the utilization no better — the new cards got reserved by jobs that don't fill them either. The queue and the idleness coexist: jobs wait for whole cards while allocated cards sit mostly empty.
So the real problem is not too few GPUs; it is that GPUs are reserved in whole-card units by jobs that don't fill a whole card, so allocation hits 100% long before utilization does, and buying more cards just adds more under-filled reservations. The fix is not more hardware. It is making the allocation unit match the job size, and packing jobs so cards fill up.
So how do we let a small job take a small slice of a GPU, and pack jobs so a card's tenants add up to its capacity?
When five jobs could share one card but each holds its own¶
Take the smallest case. Five small jobs — a classifier, two embedders, two notebooks — each use ~15% of an H100. Allocated one-card-each: five whole H100s, each ~85% idle, five cards billing for the work of less than one. Partition each H100 into MIG slices (say seven 1g.10gb instances) and the five jobs land on slices of a single card, which now runs at ~75% with hard isolation between tenants. Five cards become one. The other four can run other work or be switched off. Same jobs, one-fifth the silicon, one-fifth the bill for that work.
That is the whole game: when the allocation unit is a whole card and jobs are smaller than a card, you waste the difference; shrink the unit and pack, and the waste collapses.
Rule: match the allocation unit to the job, then pack the cards¶
The scheduling rule. Drive utilization, not allocation. Match the allocation unit to the job size — partition a GPU with MIG when jobs are smaller than a card and need isolation, time-slice or MPS when they can share softly — and schedule so a card's tenants sum to its capacity (bin-packing). A 70B that needs more than one whole GPU is the opposite case: it gets dedicated cards (file 03's tensor-parallel pair) and cannot be partitioned. The unit of waste is a reserved-and-idle card billing 24/7, so the fleet's job is to keep every card filled with work, not just assigned to an owner.
Why the rule exists. The primitive is economics: a GPU bills whether or not it computes, so idle reserved silicon is pure cost. The constraint is that the default allocation unit is a whole card, while most jobs are smaller than a card — so allocation saturates while utilization stays low, and buying more cards just multiplies the under-filled reservations. Coarse whole-card allocation breaks because it can't pack small jobs; MIG and the scheduler relieve it by shrinking the unit and packing. The new pressure they create: isolation-vs-utilization and complexity — partitioning trades flexibility for guarantees, and a misconfigured scheduler can starve or thrash the very cards it was meant to fill.
1) MIG — cutting one GPU into isolated GPUs¶
MIG (Multi-Instance GPU) partitions a single physical GPU into up to seven hardware-isolated GPU instances, each with its own dedicated SMs, L2 cache slices, memory channels, and fault domain. This is not time-sharing — it's a hardware split. A job on one MIG instance physically cannot touch the SMs or memory of another, so there's no noisy-neighbor interference and no shared failure: one instance crashing doesn't take down the others.
WHOLE-CARD ALLOCATION MIG-PARTITIONED H100 (80GB, 7 slices)
───────────────────── ─────────────────────────────────────
H100 [ job A: 15% used ] H100 ┌──────┬──────┬──────┬──────┐
[ 85% idle ] ← billing │1g.10 │1g.10 │1g.10 │1g.10 │ hard
│ jobA │ jobB │ jobC │ jobD │ isolated:
one job owns the whole card ├──────┼──────┼──────┴──────┤ own SMs,
the rest bills for nothing │1g.10 │1g.10 │ (or 2g.20) │ cache,
│ jobE │ jobF │ jobG │ mem, fault
└──────┴──────┴─────────────┘
7 jobs, hard-isolated, card ~full
H100 profiles are fixed slices — 1g.10gb (one-seventh compute, 10 GB), 2g.20gb, 3g.40gb, up to 7g.80gb (the whole card) — and you choose a layout that matches your jobs. The isolation is MIG's defining property and its defining limit: instances are hard-partitioned, so a 1g.10gb instance gets exactly one-seventh of the compute and cannot borrow idle capacity from a quiet neighbor. You trade the ability to dynamically share for strong guarantees — predictable performance and security between tenants, which is exactly what a multi-tenant or regulated environment needs.
For the running example, MIG is for the small models, not the 70B: the safety classifier and embedder from file 05 each fit comfortably in a 1g.10gb slice with hard isolation, so several share one card instead of each hoarding one. The 70B does the opposite — it needs more than a whole card, so it spans dedicated GPUs and MIG is irrelevant to it.
2) The picture — MIG vs time-slicing vs MPS, the three ways to share¶
MIG isn't the only way to share a GPU. The mental model that lands this file is the three sharing modes and what each trades:
THREE WAYS TO SHARE ONE GPU
───────────────────────────
MIG (hardware partition) TIME-SLICING (round-robin) MPS (concurrent, soft)
┌────┬────┬────┐ ┌──────────────┐ ┌──────────────┐
│ A │ B │ C │ │ A B A B │ ← take turns │ A + B together│
└────┴────┴────┘ │ on full GPU │ in time │ share SMs │
dedicated SMs/mem └──────────────┘ └──────────────┘
HARD isolation NO isolation NO memory isolation
fixed slices, no overcommit overcommit, context-switch cost concurrent, low overhead
predictable, secure dev clusters, bursty light work cooperative trusted jobs
"guaranteed slice" "share by taking turns" "run at the same time"
- MIG hard-partitions the silicon: dedicated resources, full isolation, fixed slices, no overcommit. Use it for production multi-tenant isolation and predictable performance.
- Time-slicing lets jobs take turns on the full GPU, round-robin. No memory or fault isolation, and a context-switch cost, but it lets many jobs oversubscribe a GPU — good for dev clusters and bursty, light work where strict isolation doesn't matter.
- MPS (Multi-Process Service) runs multiple processes' kernels concurrently on the GPU's SMs with low overhead, but no memory isolation — good for cooperative, trusted jobs that want to genuinely overlap.
The choice is isolation-vs-flexibility: MIG gives guarantees and gives up dynamic sharing; time-slicing and MPS give flexible oversubscription and give up isolation. Production multi-tenant serving leans MIG; experimentation clusters lean time-slicing.
3) The fleet scheduled — the running example becomes a cluster¶
Our endpoint is no longer one thing; it's part of a fleet. The 70B serving stack (files 04-05) wants dedicated tensor-parallel GPU pairs, always on. The small models (embedder, classifier) want MIG slices, packed several per card. A nightly NeMo customization run (file 07) wants a big batch allocation for a few hours. Experiments want slivers for an hour each. One fleet, wildly different job shapes.
Two schedulers split the work by job type. Kubernetes, with the NVIDIA GPU operator, schedules the serving and inference side: the operator installs the device plugin (which advertises GPUs and MIG slices to the cluster as schedulable resources like nvidia.com/gpu or nvidia.com/mig-1g.10gb), plus driver, monitoring, and feature-discovery components. A pod requests a whole GPU or a MIG slice, and Kubernetes bin-packs pods onto nodes. SLURM schedules the batch training side: the NeMo run submits as a SLURM job requesting N GPUs for a time budget, queues, runs when the cards free up, and releases them — the classic HPC batch model that fits long, fixed-size training jobs better than a long-running pod.
The packing decision threads back through the whole module: the 70B can't be partitioned (file 03's tensor-parallel pair needs whole cards on NVLink), the small models should be packed onto MIG slices (file 05's concurrent-execution insight, now at the hardware-partition level), and the training run is a periodic burst (file 07) that SLURM time-shares against the always-on serving. The fleet's utilization is the sum of getting all three placements right.
Mini-FAQ. "Why Kubernetes for serving but SLURM for training?" They fit different job shapes. Serving is long-running services that scale up and down with traffic, need rolling updates and health checks, and coexist with the rest of the microservice fleet — Kubernetes' world. Training is a fixed-size, long, batch job that wants to queue for a big GPU allocation, run to completion, and release — HPC's world, where SLURM's gang-scheduling and queueing fit better. Many orgs run both, or run training on Kubernetes too (with Volcano/Kueue for gang-scheduling), but the batch-vs-service split is why SLURM persists for training.
4) Why partition-and-pack and not just buy more GPUs?¶
The plausible alternative is the one finance keeps hearing: the cluster is full, buy more cards. Why partition and schedule instead?
Because under our workload the cluster isn't full of work — it's full of under-filled reservations, and more cards just add more of them. Most jobs are smaller than a whole GPU, so whole-card allocation strands the difference; buying hardware multiplies the strand. Partitioning matches the unit to the job (a 1g.10gb slice for a job that uses a seventh of a card) and scheduling packs the slices so a card's tenants sum to capacity — turning a 40%-utilized fleet into an 80%-utilized one with the same cards. That's halving the effective GPU cost without buying anything. The honest limit: this only helps jobs smaller than a card; a job that needs a whole card or more (the 70B) can't be packed and genuinely needs its own cards. So the real decision is per-job: pack the small, dedicate the large, and buy only when the packed fleet is genuinely saturated.
Why this instead of buying, under our workload? Our fleet is a mix dominated by jobs smaller than a card, reserved one-per-card at 40% utilization. Partitioning and bin-packing drive that toward 80% on the same hardware — the cheapest throughput in the module. Buying more cards before packing just adds idle silicon that bills 24/7.
5) The property that decides the strategy: job size relative to a card¶
The one dimension that decides the sharing strategy is how big each job is relative to a whole GPU. Below a card, partition and pack; at or above a card, dedicate.
| Job | Size vs one GPU | Strategy |
|---|---|---|
| 70B serving (TP-pair) | needs >1 whole GPU | dedicated cards on NVLink; cannot partition |
| Serving replica (mid-size model) | ~30-50% of a card | MIG 3g.40gb, or share via MPS if cooperative |
| Embedder / classifier | ~5-15% of a card | MIG 1g.10gb, several per card, hard-isolated |
| Dev notebook / experiment | a sliver, bursty | time-slicing (oversubscribe), no isolation needed |
| Nightly training burst | N whole GPUs for hours | SLURM batch allocation, released after |
The asymmetry to remember: partitioning and packing only help jobs smaller than a card — they're powerless on a 70B that needs more than one. So a fleet's utilization strategy is bimodal: aggressively pack everything below the card line, dedicate whole cards to everything above it, and never try to slice a model that doesn't fit one GPU to begin with. Getting this split right is the difference between a 40%- and an 80%-utilized fleet.
6) The failure walked through: the MIG slice that starved a bursty job¶
A team MIG-partitions their inference cards into 1g.10gb slices to pack small models, and puts a bursty request-handler — quiet most of the time, occasional heavy spikes — onto one slice. Most of the time it's fine. During spikes, that job's latency blows up while six neighboring slices sit nearly idle on the same card. The team is confused: there's clearly free compute right next door.
Trace it. MIG is a hard partition — the 1g.10gb slice gets exactly one-seventh of the card's compute and cannot borrow idle capacity from quiet neighbors, even on the same physical GPU. That's the whole point of MIG (isolation, predictability), but it's the wrong tool for a bursty job that needs to occasionally use more than its slice. The fix was to move the bursty job off MIG onto a time-sliced or MPS-shared GPU where it can use spare capacity when neighbors are quiet, and keep MIG for the steady, predictable small models that genuinely fit in a slice. The lesson: MIG's hard isolation is a guarantee and a ceiling — it gives predictable performance by forbidding the borrowing that a bursty workload depends on. Match the sharing mode to the workload's burstiness, not just its average size.
7) Cost movement: what scheduling buys and what it costs¶
- What it fixes: raises fleet utilization by matching the allocation unit to the job (MIG slices for small jobs) and packing cards (bin-packing), so the same hardware does more work — fewer cards for the same workload, the largest cost lever in the module.
- What it costs: complexity (the GPU operator's components, MIG layout planning, scheduler tuning, the K8s-vs-SLURM split) and the isolation-vs-flexibility trade (MIG's hard slices can't borrow; time-slicing/MPS share but don't isolate). A misconfigured scheduler can starve jobs or thrash a card, idling what it meant to fill.
- Which subsystem pays: the platform/infra team owns the scheduler config, the MIG layouts, and the placement policy; the reward lands directly in the GPU bill (utilization up means cards down) and as shorter queues. The new pressure: utilization and isolation pull against each other — pack too hard with the wrong mode and you get noisy neighbors or starved bursts; isolate too hard (all MIG) and you strand the spare capacity bursty jobs need.
For the running example: MIG-packing the small models and SLURM-scheduling the nightly training against the always-on 70B serving turns a 40%-utilized fleet toward 80% on the same cards — roughly halving the effective GPU cost for that work — at the price of owning the scheduler, the MIG layouts, and the per-job placement decisions.
8) Signals: healthy, first to degrade, and the liar¶
- Healthy: fleet utilization (not allocation) is high — most cards doing real work, not just owned; MIG slices filled by jobs sized to them; the scheduler queue short; bin-packing keeping nodes' GPU requests near capacity without thrash.
- First metric to degrade: the gap between allocation and utilization widens — cards 100% allocated but average utilization sliding — which means jobs are reserving more than they use, or small jobs are still grabbing whole cards. Cost-per-useful-token climbs before any queue or latency alarm fires.
- The misleading metric: allocation percentage — "the cluster is 100% allocated" looks healthy and is the metric that tempts you to buy more GPUs, while it says nothing about whether the cards are doing work. A fully-allocated, half-utilized fleet is the exact failure this file fixes, and allocation is the liar that hides it.
- The graph an expert opens first: per-GPU (and per-MIG-instance) utilization across the fleet next to allocation — a sea of allocated-but-idle cards means whole-card allocation of sub-card jobs (partition and pack); a few cards thrashing means over-aggressive time-slicing (back off oversubscription or move to MIG). This is the fleet-scale version of the module's opening "38% utilization" graph.
9) Boundary: where partitioning shines and where it doesn't¶
MIG and bin-packing shine for fleets dominated by jobs smaller than a card with a need for isolation — multi-tenant inference, many small models, regulated environments where tenants must not interfere. There, partitioning collapses many under-filled cards into a few well-packed ones and hard isolation gives predictable, secure performance.
It becomes pathological when applied to the wrong shape. MIG on a bursty job strands the spare capacity the burst needs (section 6). MIG on a model that needs a whole card or more (the 70B) is impossible — you can't slice up to a model larger than a slice. Time-slicing on latency-sensitive production work adds context-switch jitter and noisy neighbors. And over-aggressive packing of any kind thrashes a card. The scale limit that invalidates naive intuition: "partition everything for max utilization" is wrong — large jobs need whole cards, bursty jobs need borrowable capacity, and isolation-sensitive jobs need MIG's guarantees; the right fleet uses different sharing modes for different job shapes, not one mode everywhere.
10) Wrong model: "the cluster is fully allocated, so we need more GPUs"¶
The seductive wrong idea — the one finance and ops both reach for — is that a fully-allocated cluster with a job queue means you're out of GPUs and must buy more. Allocation is not utilization. A fully-allocated fleet can be half-idle because jobs reserve whole cards they don't fill, and buying more cards just adds more under-filled reservations at full price.
Replace it with: the unit of waste is a reserved-and-idle card, and the fix is to raise utilization, not allocation — match the allocation unit to the job size (MIG slices for sub-card jobs), pack cards so tenants sum to capacity, and dedicate whole cards only to jobs that need them. Buy more GPUs only when the packed fleet is genuinely saturated. The allocation number is the liar; the utilization-next-to-allocation graph is the truth.
11) Other failure shapes to recognize¶
- Allocation-utilization gap. Cluster 100% allocated, 40% utilized; the buy-more-GPUs trap. Fix: partition sub-card jobs onto MIG slices and bin-pack; raise utilization, not allocation.
- MIG on a bursty job. Hard slice can't borrow idle neighbor capacity during spikes (section 6). Fix: time-slicing/MPS for bursty work; MIG for steady, predictable jobs.
- Slicing a too-big model. Trying to MIG-partition for a 70B that needs more than a card. Fix: dedicate whole cards on NVLink (file 03); MIG is for sub-card jobs only.
- Time-slicing latency-sensitive serving. Context-switch jitter and noisy neighbors hurt P99. Fix: MIG (hard isolation) or dedicated cards for latency-critical work.
- Wrong scheduler for the shape. Long batch training as a never-ending pod, or always-on serving as a SLURM job. Fix: SLURM for batch training, Kubernetes for long-running services.
- Device-plugin/MIG-strategy misconfig. Cluster doesn't see MIG slices as schedulable, so the operator advertises whole cards. Fix: set the GPU operator's MIG strategy and device plugin to expose the slices.
12) Pattern transfer¶
- Same idle-hardware fight as the whole module, slowest clock, highest cost. The roofline kept SMs fed (nanoseconds); the serving server kept GPUs fed across requests (milliseconds); the scheduler keeps the fleet fed across jobs (hours-to-weeks). The idle unit grows from an SM to a whole 24/7-billing card — the same "don't let the hardware idle" invariant, now where the money is.
- Same bin-packing shape as OS/VM scheduling. Packing GPU jobs onto cards is the OS packing processes onto cores and the hypervisor packing VMs onto hosts — match the request to the resource, oversubscribe carefully, keep the resource busy without thrashing. The scheduling tradeoff recurs from cores to cards.
- Same isolation-vs-utilization tension as multi-tenancy everywhere. MIG's hard partition vs time-slicing's soft share is the same trade as VMs vs containers, or dedicated vs shared database instances: stronger isolation costs flexible capacity-sharing. "Guarantees cost the ability to borrow" recurs at every shared-resource layer.
13) Design test — five questions before you size a GPU fleet¶
- Are you measuring utilization, not just allocation — and is the gap between them small?
- For each job smaller than a card, is it on a right-sized MIG slice (or a soft-shared GPU), not hoarding a whole card?
- For bursty jobs, are you using a sharing mode that can borrow idle capacity (time-slicing/MPS) rather than a hard MIG slice that strands the burst?
- Are whole cards dedicated only to jobs that need a full card or more (the 70B's tensor-parallel pair), never partitioned?
- Is batch training on SLURM (queue-and-release) and long-running serving on Kubernetes (services), with the device plugin actually advertising MIG slices?
Where this appears in production¶
The mechanisms and tools
- MIG (Multi-Instance GPU) — hardware-partitions an A100/H100 into up to 7 isolated GPU instances with dedicated SMs, cache, memory, and fault domains; predictable performance and security for multi-tenant sub-card jobs.
- MIG profiles (
1g.10gb,2g.20gb,3g.40gb,7g.80gb) — the fixed slice sizes you compose into a layout matching your jobs; the H100 supports up to seven1g.10gbinstances. - GPU instances (GI) and compute instances (CI) — the MIG hierarchy: a GI is the hard-isolated partition; a CI subdivides a GI with soft isolation sharing the GI's memory.
- NVIDIA GPU operator — installs and manages the GPU stack on every Kubernetes node (driver, device plugin, monitoring, feature discovery, MIG manager); automates what would be manual per-node setup.
- Kubernetes device plugin — advertises GPUs and MIG slices to the cluster as schedulable resources (
nvidia.com/gpu,nvidia.com/mig-1g.10gb) so pods can request them. - Time-slicing — oversubscribes a GPU by round-robin turns; no isolation, the dev-cluster default for bursty light work.
- MPS (Multi-Process Service) — runs multiple processes' kernels concurrently on the SMs with low overhead; cooperative, trusted sharing without memory isolation.
- SLURM — the HPC batch scheduler for training: queue an N-GPU job for a time budget, gang-schedule it, run, release; fits fixed-size long training runs.
Where it shows up
- Cloud GPU instances with MIG (AWS EKS, GCP, Crusoe, Scaleway) — offer MIG-backed slices so multiple pods share one physical card at lower cost per pod.
- Multi-tenant inference platforms — MIG-isolate tenants on shared cards for predictable performance and security, packing many small models per GPU.
- NVIDIA DGX / SuperPOD clusters — run SLURM for large training jobs (NeMo pretraining, file 07) requesting hundreds of GPUs.
- Kubernetes AI platforms (Run:ai, Volcano, Kueue, KAI Scheduler) — add gang-scheduling and queueing on top of Kubernetes so training jobs get all-or-nothing GPU allocations and fair-share queues.
- Dynamic Resource Allocation (DRA, GA in recent Kubernetes) — finer-grained GPU partitioning and sharing in Kubernetes, evolving beyond the original device-plugin model.
- Dev/notebook clusters — time-slice GPUs so many data scientists share cards for bursty interactive work without dedicating a card each.
- Cost-optimization / FinOps teams — track utilization-vs-allocation to catch the fully-allocated-but-idle fleet and justify packing over buying.
- Vision/embedding serving fleets — MIG-pack many small models per card, the hardware-partition version of file 05's concurrent execution.
Pause and recall¶
- Why does a fleet hit 100% allocation long before 100% utilization?
- Why doesn't buying more GPUs fix a fully-allocated, half-utilized fleet?
- What does MIG partition, and what makes its isolation "hard"?
- Name the three ways to share one GPU and the key thing each trades.
- Why can't a bursty job borrow idle neighbor capacity on a MIG slice, and what should it use instead?
- Why is the 70B excluded from MIG, while the embedder is a good fit?
- Why Kubernetes for serving but SLURM for training?
- Which metric is the "liar" that tempts teams to buy more GPUs?
Interview Q&A¶
Q1. Your GPU cluster reports 100% allocated with a job queue, but fleet utilization is 40%. Finance wants to buy more GPUs. What do you do? A. Don't buy yet — allocation isn't utilization. The fleet is full of whole-card reservations by jobs smaller than a card, so allocation saturates while silicon idles, and more cards just add more under-filled reservations at full price. Partition the sub-card jobs onto right-sized MIG slices, bin-pack so a card's tenants sum to capacity, and dedicate whole cards only to jobs that need them — driving utilization up on the same hardware. Buy only once the packed fleet is genuinely saturated. Common wrong answer to avoid: "100% allocated means we're out of GPUs, buy more." Allocation hides idle silicon; the fix is raising utilization, not adding cards.
Q2. What does MIG give you that time-slicing doesn't, and when would you choose time-slicing anyway? A. MIG hard-partitions the GPU — dedicated SMs, cache, memory, and fault domain per instance — giving isolation, predictable performance, and security with no noisy neighbors. Time-slicing has no isolation (jobs take turns on the full GPU) but allows oversubscription and lets a job use spare capacity when neighbors are quiet. Choose time-slicing for bursty, light, dev-cluster work where isolation doesn't matter and you want flexible sharing; choose MIG for production multi-tenant isolation and predictable performance. Common wrong answer to avoid: "MIG is strictly better, always use it." MIG's hard slices can't borrow idle capacity — wrong for bursty jobs that occasionally need more than their slice.
Q3. You put a bursty handler on a 1g.10gb MIG slice; during spikes its latency explodes while six neighbor slices sit idle. Why, and what's the fix?
A. MIG is a hard partition: the slice gets exactly one-seventh of the card and physically cannot borrow idle capacity from quiet neighbors — that's the isolation guarantee, and the ceiling. A bursty job needs to occasionally exceed its slice, which MIG forbids. Move it to a time-sliced or MPS-shared GPU where it can use spare capacity when neighbors are quiet, and keep MIG for steady, predictable small models.
Common wrong answer to avoid: "Give it a bigger MIG slice." A bigger fixed slice still can't borrow and wastes the extra when the job is quiet — the issue is hard partitioning, not slice size.
Q4. Why run Kubernetes for serving and SLURM for training instead of one scheduler for both? A. They fit different job shapes. Serving is long-running services that scale with traffic, need rolling updates and health checks, and live with the microservice fleet — Kubernetes' model. Training is a fixed-size, long batch job that wants to queue for a large all-or-nothing GPU allocation, run, and release — SLURM's HPC gang-scheduling model. You can run training on Kubernetes with Volcano/Kueue for gang-scheduling, but the batch-vs-service distinction is why SLURM persists for training. Common wrong answer to avoid: "Kubernetes does everything, drop SLURM." Without gang-scheduling, K8s can deadlock partial training allocations; the batch model fits training better unless you add it.
Q5. Which jobs in the 70B endpoint's fleet should be MIG-packed, and which must not be?
A. Pack the small models — embedder (~10%), safety classifier (~5%) — onto MIG slices several per card, hard-isolated; they fit a 1g.10gb slice and waste a whole card otherwise. The 70B must not be partitioned: it needs more than one whole GPU and spans a dedicated tensor-parallel pair on NVLink (file 03). MIG only helps jobs smaller than a card; it's powerless and irrelevant for a model larger than a whole GPU.
Common wrong answer to avoid: "MIG the whole fleet for max utilization." You can't slice up to a model bigger than a slice, and bursty jobs need borrowable capacity MIG denies.
Q6. (Cumulative.) Connect this file's "38% utilization" to the module's opening "38% utilization." Same problem or different? A. Same invariant — idle hardware — at different scales. The opening was idle SMs within a card because the GPU waited on memory (the roofline). This file is idle cards within a fleet because jobs reserve whole cards they don't fill. The fix differs by scale: there it was batching, fusion, and compilation to feed the SMs; here it's partitioning and bin-packing to feed the cards. Every file fought "feed the beast"; this one fights it where an idle unit bills 24/7, so the waste — and the cost lever — is the largest in the module. Common wrong answer to avoid: "Different problems entirely." They're the same idle-hardware invariant at different clocks; recognizing that is the module's whole point.
Design/debug exercise (10 min)¶
Step 1 — Model it. Take five small jobs each using ~15% of an H100, allocated one-card-each: 5 cards, each ~85% idle, ~75% of five cards' cost wasted. Now MIG-partition one H100 into seven 1g.10gb slices and pack all five jobs onto it: 1 card at ~75% utilization, 4 cards freed. Write the before/after card count and the utilization, and note that the work didn't change — only the allocation unit did.
Step 2 — Your turn. For the full 70B endpoint fleet (70B serving on TP-pairs, embedder ~10%, classifier ~5%, a nightly NeMo training burst, and a handful of bursty experiments), assign each job a sharing strategy: dedicated cards, MIG slice, time-slicing, or SLURM batch. Justify each from job size and burstiness, and explain why the 70B is excluded from MIG and why the bursty experiments should not be on MIG. Estimate the fleet utilization before (one-card-each) and after (packed).
Step 3 — Reproduce from memory. Redraw the three-ways-to-share diagram from section 2 (MIG vs time-slicing vs MPS and what each trades) and the whole-card-vs-MIG-partition diagram from section 1. Then state in one sentence how this file connects to file 05 (MIG-packing small models is concurrent execution at the hardware-partition level) and to the module opening (same idle-hardware invariant, now at fleet scale where an idle card bills 24/7).
Operational memory¶
This chapter explained why a fleet can report 100% allocation while sitting at 40% utilization: GPUs are reserved in whole-card units by jobs that don't fill a whole card, so allocation saturates long before utilization, and buying more cards just multiplies the under-filled reservations. The important idea is that the unit of waste is a reserved-and-idle card billing 24/7, so the fleet's job is to raise utilization, not allocation — by matching the allocation unit to the job size and packing cards.
You learned to drive utilization three ways: partition a GPU with MIG into hard-isolated slices for steady sub-card jobs, share softly (time-slicing/MPS) for bursty work that needs to borrow capacity, and schedule with the right tool (Kubernetes + GPU operator for serving, SLURM for batch training) so cards pack toward capacity. That solves the opening failure because the small models land on slices several-per-card instead of hoarding cards, while the 70B keeps its dedicated tensor-parallel pair — turning a 40%-utilized fleet toward 80% on the same hardware.
Carry this diagnostic forward: watch utilization next to allocation, not allocation alone. If the gap is wide, you have sub-card jobs on whole cards — partition and pack before buying. If a MIG-sliced job stalls while neighbors idle, it's bursty and needs a borrowable sharing mode, not a bigger slice. And never try to partition a model larger than a card — dedicate whole cards to it.
Remember:
- Allocation is not utilization; a fully-allocated fleet can be half-idle, and every idle card bills 24/7.
- MIG hard-partitions a GPU into isolated slices (own SMs, cache, memory) — predictable and secure, but the slice can't borrow idle neighbor capacity.
- Three ways to share: MIG (hard isolation, no overcommit), time-slicing (oversubscribe, no isolation), MPS (concurrent, no memory isolation) — pick by isolation-vs-flexibility.
- Pack jobs smaller than a card; dedicate whole cards to jobs that need a card or more (the 70B); buy only when the packed fleet is saturated.
- Kubernetes (GPU operator + device plugin) schedules serving; SLURM queues batch training — different job shapes, different schedulers.
- Next pressure: we've fed the SMs, the GPUs, and the fleet — but every layer was an NVIDIA-shaped answer, and the honest questions are which of these are settled, which are contested, and how much we've locked ourselves to one vendor.
Bridge. We've now closed every gap between the model's math and the hardware's peak, from idle SMs up to idle cards in a fleet. But the whole stack was a single vendor's answer — TensorRT-LLM, Triton, NIM, NeMo, MIG, NCCL — and a thoughtful engineer should be uneasy about how cleanly it all fit. Which of these mechanisms are physics (true on any accelerator) and which are NVIDIA conventions you could have done differently? Where does the published throughput number come with an asterisk? How much of this stack can you actually leave, and what would it cost? The final file steps back to the boundaries, the contested practices, the things that work without theory, and an honest accounting of the lock-in the convenience of this module quietly bought. → 09-boundary-tradeoff-review.md