08. Rollouts and Health — Change fast without breaking trust¶
⏱️ Estimated time: 23 min | Level: advanced
ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.
Probes tell Kubernetes whether traffic should flow¶
Liveness, readiness, and startup probes answer different operational questions. Liveness asks whether the process is stuck and should restart. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Readiness asks whether traffic should enter this instance right now. See. One bad probe can be louder than a real application bug. Now watch.
probe roles
┌────────────┐ ┌────────────┐ ┌────────────┐
│ startup │ -> │ readiness │ -> │ liveness │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
cold boot | serve or wait | restart or keep
Each probe protects a different moment.
Rolling updates need health gates and surge planning¶
A rollout is safe only when new pods become healthy before old ones leave. That means readiness, surge capacity, and observability must cooperate. Max surge speeds change, while max unavailable limits pain. Without health gates, a rollout is just coordinated luck. See. Healthy replacement beats fast replacement. Now watch.
safe rollout
┌────────────┐ ┌────────────┐ ┌────────────┐
│ new pod │ -> │ ready check │ -> │ traffic │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
boot first | gate open | shift users
Traffic should move after health is proven.
Canary and blue-green are different safety shapes¶
Canary sends a small slice of traffic to the new version first. Blue-green keeps two complete environments and switches the router cleanly. Canary is gradual; blue-green is decisive but doubles capacity temporarily. Choose the pattern that matches risk, state, and rollback needs. See. Release style should match failure style. Now watch.
release shapes
┌────────────┐ ┌────────────┐ ┌────────────┐
│ baseline │ -> │ canary │ -> │ blue-green │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
gradual risk | small sample | full switch
Both patterns trade cost for confidence differently.
Graceful shutdown protects users during scale down¶
Stopping safely matters as much as starting safely in distributed systems. SIGTERM, readiness drop, and connection draining must happen in order. Grace periods should match real in-flight request durations. Shutdown bugs often hide until peak load or node drain events. See. A pod should leave traffic before the process leaves memory. Now watch.
shutdown order
┌────────────┐ ┌────────────┐ ┌────────────┐
│ sigterm │ -> │ not ready │ -> │ exit │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
start drain | stop new work | finish old work
Order matters more than speed here.
Observability decides whether safe rollout claims are real¶
You cannot call a rollout safe if you cannot see regressions quickly. Metrics, logs, traces, and deployment markers need to line up. Error rate alone is weak when latency and saturation are rising. Good rollback is mostly fast detection plus clear ownership. See. Confidence comes from signals, not from hope. Now watch.
decision loop
┌────────────┐ ┌────────────┐ ┌────────────┐
│ deploy │ -> │ observe │ -> │ decide │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
event mark | slo watch | continue or undo
Releases need a visible feedback loop.
Where this lives in the wild¶
- Product APIs use readiness probes and rolling updates for everyday releases.
- SRE teams use canaries with mesh or ingress traffic splitting on risky changes.
- Batch workers need graceful shutdown to avoid dropping in-flight work.
- Model serving stacks rely on startup probes because large models load slowly.
Pause and recall¶
- How do liveness, readiness, and startup probes answer different questions?
- Why is surge capacity part of rollout safety?
- When is canary a better fit than blue-green?
- Why must graceful shutdown drop readiness before process exit?
Interview Q&A¶
Q: Why should readiness and liveness rarely hit the same endpoint logic? A: They answer different operational questions and should fail for different reasons. Mixing them often creates restart loops for problems that only needed traffic to stop. Common wrong answer to avoid: “Because Kubernetes requires unique probe URLs.”
Q: Why can blue-green still be risky even with two environments? A: Data contracts, background jobs, and hidden dependencies may still be shared. A clean router switch does not magically isolate every side effect. Common wrong answer to avoid: “Because blue-green only works on stateless apps.”
Q: Why is graceful shutdown part of reliability, not just politeness? A: Without it, scale-down and node-drain events drop user requests or work in progress. That turns normal operations into customer-visible failures. Common wrong answer to avoid: “Because it helps logs flush nicely.”
Q: Why are deployment markers useful on dashboards? A: They let teams line up behavioral changes with the exact rollout step. That speeds both rollback decisions and post-incident explanation. Common wrong answer to avoid: “Because auditors like screenshots.”
Apply now (5 min)¶
Take one service and write one startup, readiness, and liveness rule. Now pick canary or blue-green and justify the choice. Describe the exact order of events during graceful shutdown. List three metrics you would watch during rollout minute one. Finally, choose the rollback trigger that would stop the release.
Bridge. Deployments safe. What don't we fully understand about K8s? → 09 → 09-honest-admission.md