Skip to content

03. Deployments and Scaling — Desired state with moving replicas

⏱️ Estimated time: 24 min | Level: intermediate

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Deployments manage desired state for stateless apps

A Deployment says how many identical pods should exist right now. A ReplicaSet tracks one exact pod template revision underneath that. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Change the template, and Kubernetes creates a fresh ReplicaSet revision. See. Control loops matter more than YAML appearance. Now watch.

deployment chain
┌────────────┐    ┌────────────┐    ┌────────────┐
│ git spec   │ -> │ deploy     │ -> │ pods       │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  desired state | controller loop | running set
The controller keeps reconciling until reality matches.
A Deployment is for long-running stateless workloads, not one-off jobs. ReplicaSets exist so old and new versions can coexist during rollout. Editing pods directly fights the controller and usually loses. Labels connect Deployments, ReplicaSets, and Services into one story. Revision history gives you rollback points when releases go wrong. In interviews, start here before adding autoscaling layers. So what to do? Do not create unmanaged pods for normal application lifecycles. Keep label selectors stable after the first release. Name deployments after the service, not the environment. Watch rollout status instead of guessing from dashboards.

Rolling updates replace pods without full downtime

A rolling update gradually shifts replicas from old template to new. maxSurge and maxUnavailable decide speed versus safety during that shift. Readiness gates stop traffic from hitting cold or broken pods. Bad probe design can stall an otherwise healthy rollout. See. Rollout speed is a business decision, not a vanity metric. Now watch.

rolling update
┌────────────┐    ┌────────────┐    ┌────────────┐
│ old rs     │ -> │ new rs     │ -> │ service    │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  scale down | scale up | traffic shift
Healthy pods must exist before traffic moves.
maxSurge needs spare capacity, so quotas and budgets matter. maxUnavailable controls how much risk you accept during replacement. minReadySeconds slows traffic handoff until pods prove stability. progressDeadlineSeconds prevents silent, endless rollout limbo. PodDisruptionBudgets can limit how aggressively updates proceed. Database migrations must be planned separately from pod replacement. So what to do? Keep images immutable so you can trust each rollout step. Alert when progress deadlines are exceeded or pods flap. Test rollback path before you need it under pressure. Treat schema changes as separate release decisions.

Horizontal Pod Autoscaler reacts to metrics

HPA changes replica count based on observed metrics and declared targets. CPU-based scaling only makes sense when requests are set honestly. Custom metrics unlock scaling on queue length, RPS, or lag. Scaling always has delay because metrics and pods need time. See. Metrics without sensible requests create fake confidence. Now watch.

hpa loop
┌────────────┐    ┌────────────┐    ┌────────────┐
│ traffic    │ -> │ metrics    │ -> │ replicas   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  load rises | controller reads | pods adjust
Autoscaling is feedback control, not magic.
Target utilization compares current use against requested CPU or memory. No resource request means utilization percentages lose meaning quickly. Stabilization windows reduce flapping when traffic is noisy. HPA cannot fix a slow database or bad algorithm by itself. Scale-to-zero usually needs extra tooling or event-driven patterns. GPU workloads often need queue-based metrics instead of CPU. So what to do? Pick one main scaling signal before adding five more. Set clear minimum and maximum replica counts. Protect downstream systems from sudden burst amplification. Track latency and saturation, not only average CPU.

Rollback works only when data contracts stay safe

Rollback means returning to an older pod template revision quickly. It feels easy until data migrations make old code incompatible. Feature flags and backward-compatible contracts reduce rollback pain. Versioning belongs in deployment design, not in hindsight notes. See. Safe rollback starts before the forward release begins. Now watch.

release safety
┌────────────┐    ┌────────────┐    ┌────────────┐
│ new code   │ -> │ check      │ -> │ rollback   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  schema safe | metrics watch | undo path
Forward and backward paths need equal respect.
If a new version writes unreadable data, rollback becomes theatre. Dual-read or dual-write patterns can cushion risky transitions. Canary releases detect bad behavior before the whole fleet changes. Good alerts shorten the time between defect and reversal. Pause rollouts when error rate rises faster than expected. Keep ownership clear for both app code and platform settings. So what to do? Make incompatible schema changes a planned migration project. Record who can approve a rollback during incidents. Keep metrics close to each rollout event on dashboards. Prefer reversible changes over heroic midnight fixes.

Scaling policy starts with right-sized resources

Autoscaling is worthless if every pod requests nonsense resource numbers. Over-requesting wastes money, while under-requesting creates noisy neighbors. Capacity planning must include rollout surge and zone failure headroom. CPU scaling and GPU scaling behave very differently under burst. See. Measure baselines first, then tune scaling knobs. Now watch.

capacity plan
┌────────────┐    ┌────────────┐    ┌────────────┐
│ baseline   │ -> │ burst      │ -> │ headroom   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  steady load | peak load | failure room
Good plans assume something breaks during traffic.
Separate steady traffic from campaign spikes when sizing replicas. Spread pods across zones so one failure does not wipe service. Combine Deployment, HPA, and disruption rules as one system. Async workers often scale on queue lag better than CPU. Batch work may wait; user-facing latency usually cannot. Document the tradeoff between cost efficiency and fast recovery. So what to do? Benchmark one real burst pattern before claiming autoscaling works. Reserve enough quota for surge during rollouts. Capture a simple runbook for stuck or oscillating HPA. Review requests after major code changes, not yearly.

Where this lives in the wild

  • Consumer apps use Deployments for stateless APIs and HPA for lunch-hour bursts.
  • Background workers scale on queue lag when campaign traffic spikes suddenly.
  • Platform teams pair rolling updates with canary analysis for safer releases.
  • Model serving fleets often use Deployments but scale from request rate, not CPU alone.

Pause and recall

  1. Why does a Deployment create ReplicaSets instead of pods directly forever?
  2. How do maxSurge and maxUnavailable change rollout risk?
  3. Why does HPA need honest resource requests?
  4. When does rollback stop being easy in real systems?

Interview Q&A

Q: Why use a Deployment instead of manually creating pods? A: A Deployment gives desired-state reconciliation, rollout history, and self-healing behavior. Manual pods disappear or drift without a controller to restore intent. Common wrong answer to avoid: “Because kubectl apply prefers Deployments.”

Q: Why can an HPA still fail even when CPU crosses the target? A: Scaling reacts after metrics arrive, and new pods still need startup time. If the bottleneck is elsewhere, extra pods may not improve latency. Common wrong answer to avoid: “Because the HPA formula is inaccurate.”

Q: Why is rollback harder after database changes? A: Old code may not understand new writes or schema expectations. That means application rollback and data rollback become separate problems. Common wrong answer to avoid: “Because Kubernetes cannot store previous revisions.”

Q: Why should rollout settings be discussed with product and SRE teams? A: Those settings trade safety, speed, capacity, and user-visible risk. They are operational policy choices, not mere YAML decoration. Common wrong answer to avoid: “Because only SRE teams care about maxSurge.”

Apply now (5 min)

Pick one API service and write its desired replicas for normal load. Add one burst estimate and decide a safe max replica count. Choose maxSurge and maxUnavailable with one sentence of reasoning. Now decide which metric should drive HPA for this service. Finally, name one change that would make rollback dangerous.

Bridge. Scaling works for CPU. But what about GPU workloads? → 04 → 04-gpu-scheduling-node-pools.md