Skip to content

07. Autoscaling and Capacity — Grow fast, waste less

⏱️ Estimated time: 24 min | Level: advanced

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Requests and limits decide how the cluster even thinks

Autoscaling starts with resource requests because scheduling needs declared intent. Limits cap usage bursts, but they do not replace proper requests. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Bad requests create bad packing, noisy neighbors, and fake utilization math. See. Capacity math begins before any autoscaler loop wakes up. Now watch.

resource basics
┌────────────┐    ┌────────────┐    ┌────────────┐
│ request    │ -> │ schedule   │ -> │ run        │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  fit node | reserve room | actual usage
Scheduler and autoscaler trust declared intent first.
CPU requests drive many utilization-based autoscaling decisions. Memory requests affect node packing and eviction risk strongly. Over-requesting wastes fleet space even when workloads stay mostly idle. Under-requesting causes surprise throttling and unstable latency. Limits are safety rails, not a substitute for measurement. GPU jobs need balanced CPU, memory, and storage requests too. So what to do? Collect baseline usage before choosing default request values. Separate peak bursts from steady-state demand in planning. Review requests after big code or model changes. Teach teams why request honesty helps everyone share the cluster.

HPA scales pods, while VPA tunes pod size

HPA adds or removes replicas when a chosen metric crosses target. VPA recommends or updates CPU and memory requests for each pod. They solve different problems, so mixing them blindly creates confusion. Feedback loops need clear ownership or they fight each other. See. First decide whether you need more pods or bigger pods. Now watch.

pod autoscaling
┌────────────┐    ┌────────────┐    ┌────────────┐
│ metric     │ -> │ hpa        │ -> │ vpa        │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  load signal | replica count | request size
Count and size are separate levers.
HPA is great for stateless services with horizontal elasticity. VPA is useful for discovering request drift and right-sizing workloads. Running both requires care, especially for CPU-based signals. Stateful systems may scale size more safely than replica count. Startup-heavy apps need stabilization windows to avoid oscillation. Queue-driven systems often want custom metrics, not CPU. So what to do? Choose one primary scaling lever per workload first. Review VPA recommendations before enabling automatic updates widely. Document why a metric matches user experience or job throughput. Watch for oscillation after every policy change.

Cluster Autoscaler and Karpenter add or remove nodes

When pods cannot fit, node-level autoscalers decide whether to add machines. Cluster Autoscaler reasons from node groups and pending pods. Karpenter reasons more directly from pod requirements and available instance types. Both must balance speed, price, and placement constraints. See. Node autoscaling is slower, heavier, and costlier than pod autoscaling. Now watch.

node scaling
┌────────────┐    ┌────────────┐    ┌────────────┐
│ pending pod │ -> │ autoscaler │ -> │ new node   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  unschedulable | capacity plan | instance boots
Pending pods are the demand signal here.
Node launch time can dominate recovery during sudden traffic spikes. Warm pools or spare headroom reduce that cold-start pain. Taints, affinity, and topology constraints shape which nodes are useful. Bin-packing efficiency directly affects cloud bills at scale. Karpenter can choose richer instance mixes than fixed node groups. Still, simpler node groups are easier for some teams to operate. So what to do? Measure node startup time before promising rapid elasticity. Leave buffer capacity for rollouts and zone failures. Keep instance-family choices aligned with workload patterns. Audit why pods stayed pending before adding more node types.

Bin packing is a cost and reliability game

Packing pods tightly saves money until it removes all failure headroom. Packing loosely feels safe until bills expose the hidden waste. Good capacity design chooses a deliberate middle, not an accidental one. The right answer depends on startup time, SLA, and workload shape. See. Utilization and resilience must be tuned together. Now watch.

packing tradeoff
┌────────────┐    ┌────────────┐    ┌────────────┐
│ tight pack │ -> │ balanced   │ -> │ loose pack │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  cheap but risky | better mix | safe but costly
Every fleet lives on this spectrum.
Bursty APIs often need buffer because user latency is unforgiving. Batch jobs can tolerate queues and denser packing much better. Topology spread constraints protect availability but reduce packing efficiency. Reserved nodes for critical traffic can coexist with opportunistic spare usage. Spot capacity improves cost, but interruption handling must be real. Measure waste per team so optimization becomes a shared conversation. So what to do? Set target utilization ranges, not vague feelings. Reserve explicit headroom for the busiest user-facing paths. Classify workloads by interruption tolerance before using spot. Revisit packing policy after architecture or traffic shifts.

GPU capacity needs even more headroom discipline

GPU node startup is often slower and availability is often tighter. That means capacity mistakes surface as queues, not just mild latency. Inference fleets may need warm GPUs while training fleets can wait. Fragmentation can leave plenty of total GPUs but no usable shape. See. Rare hardware punishes lazy capacity planning much faster. Now watch.

gpu capacity
┌────────────┐    ┌────────────┐    ┌────────────┐
│ demand     │ -> │ shape      │ -> │ fleet      │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  jobs queue | fit or fail | limited pool
Total capacity and usable capacity can differ.
Track pending time by GPU model, count, and job type. Keep some serving capacity insulated from experimental jobs. Batch queue depth can guide when to buy or rent more GPUs. MIG and mixed node pools may improve fit for smaller inference jobs. Cost per served request can justify warm capacity for latency-critical paths. Explain the tradeoff between idle reserve and missed demand clearly. So what to do? Publish separate autoscaling policies for training and serving. Forecast demand by hardware shape, not only total GPU-hours. Alert on fragmentation symptoms, not just utilization averages. Review pending-job age during every capacity planning cycle.

Where this lives in the wild

  • Consumer APIs use HPA for replicas and node autoscaling for weekend or festival bursts.
  • ML serving clusters keep warm GPU headroom because cold starts hurt latency badly.
  • Batch analytics workers scale from queue lag while spot nodes absorb cheap overflow.
  • FinOps dashboards compare requested versus used resources to expose waste by team.

Pause and recall

  1. Why do requests matter before autoscaling even starts?
  2. How do HPA and VPA solve different capacity problems?
  3. Why is node autoscaling slower than pod autoscaling?
  4. What makes GPU capacity planning harsher than CPU capacity planning?

Interview Q&A

Q: Why can high utilization still mean poor capacity design? A: The fleet may be packed so tightly that one failure causes major disruption. High utilization is good only when recovery and latency targets still hold. Common wrong answer to avoid: “Because utilization should always stay below fifty percent.”

Q: Why is Karpenter attractive for mixed workloads? A: It can choose instance types closer to actual pending pod needs. That flexibility can reduce waste compared with rigid node-group planning. Common wrong answer to avoid: “Because it replaces HPA.”

Q: Why do some teams keep warm spare capacity? A: Node startup and application startup time may be too slow for the SLA. A little idle headroom can be cheaper than missed user traffic. Common wrong answer to avoid: “Because autoscalers are unreliable by nature.”

Q: Why is fragmentation a real GPU problem? A: You may have enough total GPUs, but not in the shapes or locations jobs require. That turns capacity into stranded inventory instead of usable throughput. Common wrong answer to avoid: “Because GPU drivers report the wrong numbers.”

Apply now (5 min)

Pick one API and write its steady load, burst load, and cold-start time. Now choose one HPA metric and one max replica value. Decide how much node headroom you need during a rollout. If the workload used GPUs, decide whether warm spare capacity is worth it. Finally, write one sign that the cluster is packed too tightly.

Bridge. Scaling automated. But how do we deploy safely without downtime? → 08 → 08-rollouts-and-health.md