Skip to content

08. Rollouts and Health — Change fast without breaking trust

⏱️ Estimated time: 23 min | Level: advanced

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Probes tell Kubernetes whether traffic should flow

Liveness, readiness, and startup probes answer different operational questions. Liveness asks whether the process is stuck and should restart. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Readiness asks whether traffic should enter this instance right now. See. One bad probe can be louder than a real application bug. Now watch.

probe roles
┌────────────┐    ┌────────────┐    ┌────────────┐
│ startup    │ -> │ readiness  │ -> │ liveness   │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  cold boot | serve or wait | restart or keep
Each probe protects a different moment.
Startup probes shield slow boot paths from premature liveness restarts. Readiness should fail when dependencies make serving unsafe or useless. Liveness should detect deadlock, not temporary downstream slowness. A cheap health check beats a heavy one every single time. Probe timing must match real startup and warmup behavior. Probe failures should create logs and metrics teams can actually see. So what to do? Keep readiness endpoints fast and dependency-aware. Do not point liveness at expensive database queries. Measure cold-start time before setting probe thresholds. Review probe behavior after every major runtime change.

Rolling updates need health gates and surge planning

A rollout is safe only when new pods become healthy before old ones leave. That means readiness, surge capacity, and observability must cooperate. Max surge speeds change, while max unavailable limits pain. Without health gates, a rollout is just coordinated luck. See. Healthy replacement beats fast replacement. Now watch.

safe rollout
┌────────────┐    ┌────────────┐    ┌────────────┐
│ new pod    │ -> │ ready check │ -> │ traffic    │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  boot first | gate open | shift users
Traffic should move after health is proven.
Set rollout speed according to user impact and spare capacity. MinReadySeconds helps catch pods that fail immediately after start. Progress deadlines detect stuck releases before humans notice late. PodDisruptionBudgets interact with rollout pace during busy periods. Metrics should be time-aligned with each rollout step. Canary analysis is stronger when rollback can happen automatically. So what to do? Budget extra capacity for rollout surge and health checking. Pause the rollout when error budgets burn too quickly. Keep dashboards annotated with deployment events. Use immutable images so each step is auditable.

Canary and blue-green are different safety shapes

Canary sends a small slice of traffic to the new version first. Blue-green keeps two complete environments and switches the router cleanly. Canary is gradual; blue-green is decisive but doubles capacity temporarily. Choose the pattern that matches risk, state, and rollback needs. See. Release style should match failure style. Now watch.

release shapes
┌────────────┐    ┌────────────┐    ┌────────────┐
│ baseline   │ -> │ canary     │ -> │ blue-green │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  gradual risk | small sample | full switch
Both patterns trade cost for confidence differently.
Canary works well when metrics detect regressions quickly. Blue-green works well when environment parity is trustworthy. Stateful dependencies can make blue-green harder than it looks. Feature flags can complement either release pattern nicely. Traffic splitting and health checks are the practical backbone. Do not copy a pattern just because the internet calls it best. So what to do? Pick one success metric before choosing the release style. Define rollback trigger points ahead of time. Ensure both versions can handle current data contracts. Measure extra capacity cost for each safer release option.

Graceful shutdown protects users during scale down

Stopping safely matters as much as starting safely in distributed systems. SIGTERM, readiness drop, and connection draining must happen in order. Grace periods should match real in-flight request durations. Shutdown bugs often hide until peak load or node drain events. See. A pod should leave traffic before the process leaves memory. Now watch.

shutdown order
┌────────────┐    ┌────────────┐    ┌────────────┐
│ sigterm    │ -> │ not ready  │ -> │ exit       │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  start drain | stop new work | finish old work
Order matters more than speed here.
Readiness should fail quickly once the app begins shutdown. PreStop hooks can help, but they should stay predictable and short. Long-running requests need either longer grace or resumable design. Workers should stop accepting new jobs before process exit. Node drains and spot interruptions exercise the same shutdown path. Connection draining must also align with load balancer behavior. So what to do? Test graceful shutdown under real traffic patterns. Keep grace periods based on evidence, not folklore. Make worker checkpoints or idempotency part of job design. Document which layer owns connection draining.

Observability decides whether safe rollout claims are real

You cannot call a rollout safe if you cannot see regressions quickly. Metrics, logs, traces, and deployment markers need to line up. Error rate alone is weak when latency and saturation are rising. Good rollback is mostly fast detection plus clear ownership. See. Confidence comes from signals, not from hope. Now watch.

decision loop
┌────────────┐    ┌────────────┐    ┌────────────┐
│ deploy     │ -> │ observe    │ -> │ decide     │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  event mark | slo watch | continue or undo
Releases need a visible feedback loop.
Track user-facing latency, error rate, and resource saturation together. Separate rollout dashboards by service, version, and environment. Log build id and git sha so correlation is obvious. Alerting should distinguish probe noise from real customer pain. Postmortems should include whether health gates behaved as designed. Boring release runbooks beat heroic intuition every time. So what to do? Annotate all dashboards with rollout and rollback events. Tie alarms to SLOs, not just infrastructure symptoms. Keep one clear rollback owner for each service. Review false-positive probe and alert noise monthly.

Where this lives in the wild

  • Product APIs use readiness probes and rolling updates for everyday releases.
  • SRE teams use canaries with mesh or ingress traffic splitting on risky changes.
  • Batch workers need graceful shutdown to avoid dropping in-flight work.
  • Model serving stacks rely on startup probes because large models load slowly.

Pause and recall

  1. How do liveness, readiness, and startup probes answer different questions?
  2. Why is surge capacity part of rollout safety?
  3. When is canary a better fit than blue-green?
  4. Why must graceful shutdown drop readiness before process exit?

Interview Q&A

Q: Why should readiness and liveness rarely hit the same endpoint logic? A: They answer different operational questions and should fail for different reasons. Mixing them often creates restart loops for problems that only needed traffic to stop. Common wrong answer to avoid: “Because Kubernetes requires unique probe URLs.”

Q: Why can blue-green still be risky even with two environments? A: Data contracts, background jobs, and hidden dependencies may still be shared. A clean router switch does not magically isolate every side effect. Common wrong answer to avoid: “Because blue-green only works on stateless apps.”

Q: Why is graceful shutdown part of reliability, not just politeness? A: Without it, scale-down and node-drain events drop user requests or work in progress. That turns normal operations into customer-visible failures. Common wrong answer to avoid: “Because it helps logs flush nicely.”

Q: Why are deployment markers useful on dashboards? A: They let teams line up behavioral changes with the exact rollout step. That speeds both rollback decisions and post-incident explanation. Common wrong answer to avoid: “Because auditors like screenshots.”

Apply now (5 min)

Take one service and write one startup, readiness, and liveness rule. Now pick canary or blue-green and justify the choice. Describe the exact order of events during graceful shutdown. List three metrics you would watch during rollout minute one. Finally, choose the rollback trigger that would stop the release.

Bridge. Deployments safe. What don't we fully understand about K8s? → 09 → 09-honest-admission.md