Skip to content

04. GPU Scheduling and Node Pools — Scarcity changes the rules

⏱️ Estimated time: 26 min | Level: advanced

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Why GPU scheduling is different from CPU scheduling

A GPU is scarce, discrete, and expensive in ways CPU cores are not. Most workloads cannot cleanly request one-third of a GPU and move on. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? That scarcity makes placement policy a first-class design decision. See. Count alone is never the full story for accelerators. Now watch.

scarcity loop
┌────────────┐    ┌────────────┐    ┌────────────┐
│ request    │ -> │ schedule   │ -> │ gpu node   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  quota check | fit check | expensive slot
Rare hardware changes every policy around it.
Kubernetes usually exposes GPUs as extended resources like nvidia.com/gpu. If the node does not advertise GPUs, the scheduler cannot place jobs. GPU memory, topology, and interconnect matter beyond simple counts. Training jobs and inference jobs stress the same GPU very differently. Idle GPU time still costs real money, so utilization matters. Queueing work can be wiser than keeping many GPUs warm. So what to do? Balance GPU, CPU, and memory requests together. Separate accelerator workloads from general app workloads. Monitor utilization and waiting time, not only running pods. Publish a clear priority policy for scarce hardware.

Device plugins teach kubelet what hardware exists

Device plugins report accelerator inventory so Kubernetes can schedule correctly. Without that plugin, fancy GPU nodes look ordinary to the cluster. The NVIDIA device plugin is the common implementation today. GPU Operator often bundles drivers, toolkit, exporter, and plugin together. See. Discovery first, scheduling second. Now watch.

device discovery
┌────────────┐    ┌────────────┐    ┌────────────┐
│ driver     │ -> │ plugin     │ -> │ node status │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  kernel ready | gpu seen | resource exposed
Allocatable resources appear only after setup succeeds.
Node allocatable resources show whether the plugin advertised GPUs properly. Version skew between driver and CUDA stack causes silent misery. Broken health checks can mark a GPU unavailable even on live nodes. Container runtime must expose device files and libraries correctly. GPU nodes often take longer to become ready after creation. Alerting on dropped GPU inventory catches drift early. So what to do? Pin one tested driver stack before scaling fleet size. Separate driver upgrades from workload rollout windows. Inspect allocatable resources during every GPU incident. Keep node bootstrap logs accessible for longer than usual.

Taints, tolerations, and selectors protect expensive nodes

GPU nodes should not accept random web pods by default. Taints repel unwanted workloads unless pods carry matching tolerations. Selectors and affinity target the exact hardware family you need. Labels should encode GPU model, zone, and intended workload class. See. Placement needs both allow rules and deny rules. Now watch.

placement policy
┌────────────┐    ┌────────────┐    ┌────────────┐
│ taint      │ -> │ tolerate   │ -> │ select     │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  repel default | allow gpu job | choose model
Protection and targeting work together.
A taint like accelerator=nvidia:NoSchedule protects the pool by default. Only GPU workloads should receive the matching toleration. Selectors can distinguish A100 from T4 or L4 nodes. Affinity can express preferred versus required hardware choices. Topology spread still matters so one node failure hurts less. Dedicated node pools simplify cost attribution and troubleshooting. So what to do? Do not spray GPU tolerations across every namespace. Keep label vocabulary small and reviewed regularly. Document the default landing zone for unspecialized pods. Audit selectors after every hardware generation change.

Multi-GPU jobs care about topology, not just quantity

Eight GPUs on one node behave differently from eight across four nodes. Distributed training depends on network bandwidth and local interconnect quality. MIG can slice some GPUs into smaller isolated profiles. That helps inference packing, but it adds a planning tax. See. Topology is a performance input, not a footnote. Now watch.

topology choices
┌────────────┐    ┌────────────┐    ┌────────────┐
│ 1 node     │ -> │ many nodes │ -> │ mig slices │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  fast local links | network hop | smaller chunks
Same count, very different throughput.
NCCL performance changes dramatically with placement and interconnect. Gang scheduling helps when all workers must start together or wait. Pod anti-affinity can accidentally hurt training locality. Dataset access and storage throughput must match the training shape. Request full GPU counts explicitly for each worker or launcher. Use MIG only when its operational complexity is justified. So what to do? Benchmark each topology instead of trusting vendor slides. Tag MIG profiles clearly so users request the right slice. Choose a job scheduler for all-or-nothing distributed launches. Watch interconnect saturation, not only GPU core utilization.

Fairness and cost control matter as much as raw scheduling

One giant training job can starve many tiny inference workloads. Quotas, queues, and priority classes make those tradeoffs explicit. Without policy, the loudest team often wins scarce GPU time. Platform teams must show both wait time and cost per experiment. See. Scarce hardware needs policy before it needs another dashboard. Now watch.

fleet fairness
┌────────────┐    ┌────────────┐    ┌────────────┐
│ queue      │ -> │ quota      │ -> │ fleet      │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  job waits | team share | gpu allocated
Fairness is part of platform design.
Reserve some GPU capacity for latency-sensitive serving paths. Use preemption carefully because partial training loss can be costly. Collect GPU memory, utilization, queue time, and job age. Batch work tolerates queues better than live user traffic. Spot GPUs change cost and interruption risk at the same time. Publish a request playbook so users size jobs realistically. So what to do? Define quota per team before the fleet gets crowded. Review stale reservations and zombie jobs every week. Add TTL cleanup for finished or failed training pods. Make experiment cost visible next to success metrics.

Where this lives in the wild

  • Research clusters use tainted GPU pools so notebooks do not land on premium nodes.
  • Inference platforms use MIG or small GPU pools to avoid wasting full accelerators.
  • Distributed training stacks pair Kubernetes with gang schedulers for worker coordination.
  • FinOps teams track queue time and GPU-hours to keep experiment costs sane.

Pause and recall

  1. Why is a GPU request different from a CPU request?
  2. What role does the device plugin play before scheduling even begins?
  3. Why combine taints with selectors for GPU nodes?
  4. Why does topology matter for multi-GPU jobs?

Interview Q&A

Q: Why are taints useful on GPU node pools? A: They keep general workloads off expensive nodes unless a pod explicitly opts in. That protects both cost efficiency and placement clarity. Common wrong answer to avoid: “Because GPUs are insecure without taints.”

Q: Why can a job request the right GPU count and still perform poorly? A: Topology, storage throughput, and interconnect bandwidth may still be wrong for the job. Count is necessary, but it does not guarantee good communication patterns. Common wrong answer to avoid: “Because Kubernetes scheduled the wrong CUDA version.”

Q: Why might MIG help some teams and hurt others? A: It improves packing for small inference jobs but adds profile management and operational complexity. Not every fleet gains enough efficiency to justify that extra control surface. Common wrong answer to avoid: “Because MIG is only for legacy GPUs.”

Q: Why is fairness policy part of scheduling design? A: GPU scarcity creates real opportunity cost, so teams need explicit sharing rules. Otherwise queue time, cost, and criticality stay hidden until conflict explodes. Common wrong answer to avoid: “Because finance teams ask for it later.”

Apply now (5 min)

Imagine you own a fleet with T4, L4, and A100 nodes. Write one taint and two labels you would place on each pool. Now pick one inference workload and one training workload. Decide selectors, tolerations, and quota rules for both. Finally, note one metric that proves the fleet is fragmented.

Bridge. GPUs scheduled. But containers need persistent data. → 05 → 05-storage-in-kubernetes.md