07. Autoscaling and Capacity — Grow fast, waste less¶
⏱️ Estimated time: 24 min | Level: advanced
ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.
Requests and limits decide how the cluster even thinks¶
Autoscaling starts with resource requests because scheduling needs declared intent. Limits cap usage bursts, but they do not replace proper requests. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Bad requests create bad packing, noisy neighbors, and fake utilization math. See. Capacity math begins before any autoscaler loop wakes up. Now watch.
resource basics
┌────────────┐ ┌────────────┐ ┌────────────┐
│ request │ -> │ schedule │ -> │ run │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
fit node | reserve room | actual usage
Scheduler and autoscaler trust declared intent first.
HPA scales pods, while VPA tunes pod size¶
HPA adds or removes replicas when a chosen metric crosses target. VPA recommends or updates CPU and memory requests for each pod. They solve different problems, so mixing them blindly creates confusion. Feedback loops need clear ownership or they fight each other. See. First decide whether you need more pods or bigger pods. Now watch.
pod autoscaling
┌────────────┐ ┌────────────┐ ┌────────────┐
│ metric │ -> │ hpa │ -> │ vpa │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
load signal | replica count | request size
Count and size are separate levers.
Cluster Autoscaler and Karpenter add or remove nodes¶
When pods cannot fit, node-level autoscalers decide whether to add machines. Cluster Autoscaler reasons from node groups and pending pods. Karpenter reasons more directly from pod requirements and available instance types. Both must balance speed, price, and placement constraints. See. Node autoscaling is slower, heavier, and costlier than pod autoscaling. Now watch.
node scaling
┌────────────┐ ┌────────────┐ ┌────────────┐
│ pending pod │ -> │ autoscaler │ -> │ new node │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
unschedulable | capacity plan | instance boots
Pending pods are the demand signal here.
Bin packing is a cost and reliability game¶
Packing pods tightly saves money until it removes all failure headroom. Packing loosely feels safe until bills expose the hidden waste. Good capacity design chooses a deliberate middle, not an accidental one. The right answer depends on startup time, SLA, and workload shape. See. Utilization and resilience must be tuned together. Now watch.
packing tradeoff
┌────────────┐ ┌────────────┐ ┌────────────┐
│ tight pack │ -> │ balanced │ -> │ loose pack │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
cheap but risky | better mix | safe but costly
Every fleet lives on this spectrum.
GPU capacity needs even more headroom discipline¶
GPU node startup is often slower and availability is often tighter. That means capacity mistakes surface as queues, not just mild latency. Inference fleets may need warm GPUs while training fleets can wait. Fragmentation can leave plenty of total GPUs but no usable shape. See. Rare hardware punishes lazy capacity planning much faster. Now watch.
gpu capacity
┌────────────┐ ┌────────────┐ ┌────────────┐
│ demand │ -> │ shape │ -> │ fleet │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
jobs queue | fit or fail | limited pool
Total capacity and usable capacity can differ.
Where this lives in the wild¶
- Consumer APIs use HPA for replicas and node autoscaling for weekend or festival bursts.
- ML serving clusters keep warm GPU headroom because cold starts hurt latency badly.
- Batch analytics workers scale from queue lag while spot nodes absorb cheap overflow.
- FinOps dashboards compare requested versus used resources to expose waste by team.
Pause and recall¶
- Why do requests matter before autoscaling even starts?
- How do HPA and VPA solve different capacity problems?
- Why is node autoscaling slower than pod autoscaling?
- What makes GPU capacity planning harsher than CPU capacity planning?
Interview Q&A¶
Q: Why can high utilization still mean poor capacity design? A: The fleet may be packed so tightly that one failure causes major disruption. High utilization is good only when recovery and latency targets still hold. Common wrong answer to avoid: “Because utilization should always stay below fifty percent.”
Q: Why is Karpenter attractive for mixed workloads? A: It can choose instance types closer to actual pending pod needs. That flexibility can reduce waste compared with rigid node-group planning. Common wrong answer to avoid: “Because it replaces HPA.”
Q: Why do some teams keep warm spare capacity? A: Node startup and application startup time may be too slow for the SLA. A little idle headroom can be cheaper than missed user traffic. Common wrong answer to avoid: “Because autoscalers are unreliable by nature.”
Q: Why is fragmentation a real GPU problem? A: You may have enough total GPUs, but not in the shapes or locations jobs require. That turns capacity into stranded inventory instead of usable throughput. Common wrong answer to avoid: “Because GPU drivers report the wrong numbers.”
Apply now (5 min)¶
Pick one API and write its steady load, burst load, and cold-start time. Now choose one HPA metric and one max replica value. Decide how much node headroom you need during a rollout. If the workload used GPUs, decide whether warm spare capacity is worth it. Finally, write one sign that the cluster is packed too tightly.
Bridge. Scaling automated. But how do we deploy safely without downtime? → 08 → 08-rollouts-and-health.md