04. GPU Scheduling and Node Pools — Scarcity changes the rules¶
⏱️ Estimated time: 26 min | Level: advanced
ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.
Why GPU scheduling is different from CPU scheduling¶
A GPU is scarce, discrete, and expensive in ways CPU cores are not. Most workloads cannot cleanly request one-third of a GPU and move on. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? That scarcity makes placement policy a first-class design decision. See. Count alone is never the full story for accelerators. Now watch.
scarcity loop
┌────────────┐ ┌────────────┐ ┌────────────┐
│ request │ -> │ schedule │ -> │ gpu node │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
quota check | fit check | expensive slot
Rare hardware changes every policy around it.
Device plugins teach kubelet what hardware exists¶
Device plugins report accelerator inventory so Kubernetes can schedule correctly. Without that plugin, fancy GPU nodes look ordinary to the cluster. The NVIDIA device plugin is the common implementation today. GPU Operator often bundles drivers, toolkit, exporter, and plugin together. See. Discovery first, scheduling second. Now watch.
device discovery
┌────────────┐ ┌────────────┐ ┌────────────┐
│ driver │ -> │ plugin │ -> │ node status │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
kernel ready | gpu seen | resource exposed
Allocatable resources appear only after setup succeeds.
Taints, tolerations, and selectors protect expensive nodes¶
GPU nodes should not accept random web pods by default. Taints repel unwanted workloads unless pods carry matching tolerations. Selectors and affinity target the exact hardware family you need. Labels should encode GPU model, zone, and intended workload class. See. Placement needs both allow rules and deny rules. Now watch.
placement policy
┌────────────┐ ┌────────────┐ ┌────────────┐
│ taint │ -> │ tolerate │ -> │ select │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
repel default | allow gpu job | choose model
Protection and targeting work together.
Multi-GPU jobs care about topology, not just quantity¶
Eight GPUs on one node behave differently from eight across four nodes. Distributed training depends on network bandwidth and local interconnect quality. MIG can slice some GPUs into smaller isolated profiles. That helps inference packing, but it adds a planning tax. See. Topology is a performance input, not a footnote. Now watch.
topology choices
┌────────────┐ ┌────────────┐ ┌────────────┐
│ 1 node │ -> │ many nodes │ -> │ mig slices │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
fast local links | network hop | smaller chunks
Same count, very different throughput.
Fairness and cost control matter as much as raw scheduling¶
One giant training job can starve many tiny inference workloads. Quotas, queues, and priority classes make those tradeoffs explicit. Without policy, the loudest team often wins scarce GPU time. Platform teams must show both wait time and cost per experiment. See. Scarce hardware needs policy before it needs another dashboard. Now watch.
fleet fairness
┌────────────┐ ┌────────────┐ ┌────────────┐
│ queue │ -> │ quota │ -> │ fleet │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
job waits | team share | gpu allocated
Fairness is part of platform design.
Where this lives in the wild¶
- Research clusters use tainted GPU pools so notebooks do not land on premium nodes.
- Inference platforms use MIG or small GPU pools to avoid wasting full accelerators.
- Distributed training stacks pair Kubernetes with gang schedulers for worker coordination.
- FinOps teams track queue time and GPU-hours to keep experiment costs sane.
Pause and recall¶
- Why is a GPU request different from a CPU request?
- What role does the device plugin play before scheduling even begins?
- Why combine taints with selectors for GPU nodes?
- Why does topology matter for multi-GPU jobs?
Interview Q&A¶
Q: Why are taints useful on GPU node pools? A: They keep general workloads off expensive nodes unless a pod explicitly opts in. That protects both cost efficiency and placement clarity. Common wrong answer to avoid: “Because GPUs are insecure without taints.”
Q: Why can a job request the right GPU count and still perform poorly? A: Topology, storage throughput, and interconnect bandwidth may still be wrong for the job. Count is necessary, but it does not guarantee good communication patterns. Common wrong answer to avoid: “Because Kubernetes scheduled the wrong CUDA version.”
Q: Why might MIG help some teams and hurt others? A: It improves packing for small inference jobs but adds profile management and operational complexity. Not every fleet gains enough efficiency to justify that extra control surface. Common wrong answer to avoid: “Because MIG is only for legacy GPUs.”
Q: Why is fairness policy part of scheduling design? A: GPU scarcity creates real opportunity cost, so teams need explicit sharing rules. Otherwise queue time, cost, and criticality stay hidden until conflict explodes. Common wrong answer to avoid: “Because finance teams ask for it later.”
Apply now (5 min)¶
Imagine you own a fleet with T4, L4, and A100 nodes. Write one taint and two labels you would place on each pool. Now pick one inference workload and one training workload. Decide selectors, tolerations, and quota rules for both. Finally, note one metric that proves the fleet is fragmented.
Bridge. GPUs scheduled. But containers need persistent data. → 05 → 05-storage-in-kubernetes.md