03. Deployments and Scaling — Desired state with moving replicas¶
⏱️ Estimated time: 24 min | Level: intermediate
ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.
Deployments manage desired state for stateless apps¶
A Deployment says how many identical pods should exist right now. A ReplicaSet tracks one exact pod template revision underneath that. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Change the template, and Kubernetes creates a fresh ReplicaSet revision. See. Control loops matter more than YAML appearance. Now watch.
deployment chain
┌────────────┐ ┌────────────┐ ┌────────────┐
│ git spec │ -> │ deploy │ -> │ pods │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
desired state | controller loop | running set
The controller keeps reconciling until reality matches.
Rolling updates replace pods without full downtime¶
A rolling update gradually shifts replicas from old template to new. maxSurge and maxUnavailable decide speed versus safety during that shift. Readiness gates stop traffic from hitting cold or broken pods. Bad probe design can stall an otherwise healthy rollout. See. Rollout speed is a business decision, not a vanity metric. Now watch.
rolling update
┌────────────┐ ┌────────────┐ ┌────────────┐
│ old rs │ -> │ new rs │ -> │ service │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
scale down | scale up | traffic shift
Healthy pods must exist before traffic moves.
Horizontal Pod Autoscaler reacts to metrics¶
HPA changes replica count based on observed metrics and declared targets. CPU-based scaling only makes sense when requests are set honestly. Custom metrics unlock scaling on queue length, RPS, or lag. Scaling always has delay because metrics and pods need time. See. Metrics without sensible requests create fake confidence. Now watch.
hpa loop
┌────────────┐ ┌────────────┐ ┌────────────┐
│ traffic │ -> │ metrics │ -> │ replicas │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
load rises | controller reads | pods adjust
Autoscaling is feedback control, not magic.
Rollback works only when data contracts stay safe¶
Rollback means returning to an older pod template revision quickly. It feels easy until data migrations make old code incompatible. Feature flags and backward-compatible contracts reduce rollback pain. Versioning belongs in deployment design, not in hindsight notes. See. Safe rollback starts before the forward release begins. Now watch.
release safety
┌────────────┐ ┌────────────┐ ┌────────────┐
│ new code │ -> │ check │ -> │ rollback │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
schema safe | metrics watch | undo path
Forward and backward paths need equal respect.
Scaling policy starts with right-sized resources¶
Autoscaling is worthless if every pod requests nonsense resource numbers. Over-requesting wastes money, while under-requesting creates noisy neighbors. Capacity planning must include rollout surge and zone failure headroom. CPU scaling and GPU scaling behave very differently under burst. See. Measure baselines first, then tune scaling knobs. Now watch.
capacity plan
┌────────────┐ ┌────────────┐ ┌────────────┐
│ baseline │ -> │ burst │ -> │ headroom │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
steady load | peak load | failure room
Good plans assume something breaks during traffic.
Where this lives in the wild¶
- Consumer apps use Deployments for stateless APIs and HPA for lunch-hour bursts.
- Background workers scale on queue lag when campaign traffic spikes suddenly.
- Platform teams pair rolling updates with canary analysis for safer releases.
- Model serving fleets often use Deployments but scale from request rate, not CPU alone.
Pause and recall¶
- Why does a Deployment create ReplicaSets instead of pods directly forever?
- How do maxSurge and maxUnavailable change rollout risk?
- Why does HPA need honest resource requests?
- When does rollback stop being easy in real systems?
Interview Q&A¶
Q: Why use a Deployment instead of manually creating pods? A: A Deployment gives desired-state reconciliation, rollout history, and self-healing behavior. Manual pods disappear or drift without a controller to restore intent. Common wrong answer to avoid: “Because kubectl apply prefers Deployments.”
Q: Why can an HPA still fail even when CPU crosses the target? A: Scaling reacts after metrics arrive, and new pods still need startup time. If the bottleneck is elsewhere, extra pods may not improve latency. Common wrong answer to avoid: “Because the HPA formula is inaccurate.”
Q: Why is rollback harder after database changes? A: Old code may not understand new writes or schema expectations. That means application rollback and data rollback become separate problems. Common wrong answer to avoid: “Because Kubernetes cannot store previous revisions.”
Q: Why should rollout settings be discussed with product and SRE teams? A: Those settings trade safety, speed, capacity, and user-visible risk. They are operational policy choices, not mere YAML decoration. Common wrong answer to avoid: “Because only SRE teams care about maxSurge.”
Apply now (5 min)¶
Pick one API service and write its desired replicas for normal load. Add one burst estimate and decide a safe max replica count. Choose maxSurge and maxUnavailable with one sentence of reasoning. Now decide which metric should drive HPA for this service. Finally, name one change that would make rollback dangerous.
Bridge. Scaling works for CPU. But what about GPU workloads? → 04 → 04-gpu-scheduling-node-pools.md