05. Storage in Kubernetes — Data should outlive the pod¶

⏱️ Estimated time: 23 min | Level: intermediate

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Pods are temporary, so storage must be explicit¶

A pod can be rescheduled anytime, so local writable state is fragile. EmptyDir and container filesystems help scratch work, not durable records. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Persistent data needs an explicit storage contract outside pod replacement. See. Start by deciding what must survive a restart. Now watch.

state choices
┌────────────┐    ┌────────────┐    ┌────────────┐
│ pod fs     │ -> │ scratch    │ -> │ persistent │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  ephemeral | cache space | real data
Not all writes deserve durability.

Use ephemeral storage for temp files, cache, and short-lived intermediates. Use persistent storage for databases, queues, user uploads, and checkpoints. A container restart can wipe local assumptions you forgot to name. Kubernetes makes that pain visible faster than pets-style servers. Stateful design starts with failure, not with happy-path demos. If data matters tomorrow, plan storage today. So what to do? List every write path before choosing a volume type. Separate cache loss from data loss in architecture reviews. Treat local node storage as risky unless you accept node coupling. Make durability needs explicit in team docs.

PersistentVolumes and claims split supply from demand¶

A PersistentVolume describes available storage capacity and capabilities. A PersistentVolumeClaim expresses what the workload wants to consume. That split lets app teams request storage without hand-wiring devices. StorageClass adds the policy layer for provisioning behavior. See. Separate demand from supply so platforms can evolve safely. Now watch.

pv claim flow
┌────────────┐    ┌────────────┐    ┌────────────┐
│ claim      │ -> │ class      │ -> │ volume     │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  app asks | policy picks | storage binds
Claims talk intent, volumes provide reality.

Claims ask for size, access mode, and storage class. A bound claim then mounts into the pod as a usable volume. ReadWriteOnce is common and often misunderstood by new teams. ReadWriteMany sounds nice but depends on backend support and cost. Reclaim policy decides what happens after the claim goes away. PVCs keep app manifests cleaner than backend-specific details. So what to do? Choose access modes from backend reality, not wishful thinking. Name storage classes clearly by performance and durability tier. Review reclaim policies before deleting test environments. Track which claims are orphaned and still billing money.

CSI drivers automate provisioning and attachment¶

CSI is the standard interface between Kubernetes and storage systems. It lets vendors build one integration model instead of custom hacks. Dynamic provisioning means volumes appear when claims are created. That removes ticket-driven storage work for common paths. See. Provisioning should feel boring when the control plane is healthy. Now watch.

csi path
┌────────────┐    ┌────────────┐    ┌────────────┐
│ claim      │ -> │ csi        │ -> │ disk       │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  request | provision | attach mount
Storage arrives through a controller workflow.

Block, file, and object-style integrations behave differently in practice. Volume expansion may require both backend support and filesystem support. Snapshots and cloning depend on CSI capabilities, not your wishes. Provisioning delay can slow pod startup more than teams expect. Attach limits per node can surprise dense workload packing. Failure events usually tell you whether bind, attach, or mount failed. So what to do? Test expansion and restore before you promise them to users. Alert on provisioning latency, not just outright failure. Know your node attach limits for each cloud or backend. Read CSI controller logs during tricky incidents.

Stateful workloads need identity and careful rollout¶

Databases, brokers, and some ML stores need stable identity per replica. StatefulSets provide ordered identities and volume attachment patterns for that. Rolling stateful systems needs more care than replacing stateless web pods. Data layout and leader election often matter more than YAML shape. See. Stateful does not mean impossible; it means less careless. Now watch.

stateful set
┌────────────┐    ┌────────────┐    ┌────────────┐
│ ordinal    │ -> │ volume     │ -> │ restart    │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  stable name | same data | ordered move
Identity stays attached to storage.

A StatefulSet pod keeps a stable ordinal name across restarts. Each replica often gets its own dedicated claim or volume. Ordered startup and shutdown can protect quorum-based systems. Headless services commonly expose per-pod identities for peers. Do not force every stateful system into Kubernetes blindly. Sometimes managed databases remain the saner operational choice. So what to do? Benchmark storage latency before running databases on the cluster. Plan backups before the first production write arrives. Test node failure and rescheduling for each stateful workload. Document who owns restore drills and retention windows.

Backups, recovery, and cost need equal attention¶

Persistent storage is only useful when you can recover data safely. Snapshots, backups, and restore drills prove whether durability is real. Fast storage tiers are attractive until the monthly bill appears. Retention policy is part of storage design, not a legal afterthought. See. Recovery time and recovery point should guide storage choices. Now watch.

recovery view
┌────────────┐    ┌────────────┐    ┌────────────┐
│ write      │ -> │ backup     │ -> │ restore    │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  live data | snapshot copy | usable state
A backup is theory until restore works.

Choose storage tier by latency, throughput, and recovery goals together. Cold data rarely deserves premium SSD pricing forever. Backup windows must respect application consistency, not only file copies. Model checkpoints can be huge, so transfer cost matters too. Delete tests carefully because reclaim policy may keep disks alive. Tag volumes for team, service, and environment cost visibility. So what to do? Schedule restore drills, not only backup jobs. Measure storage spend per workload family every month. Expire unused snapshots before they become quiet waste. Tie retention windows to business and compliance needs.

Where this lives in the wild¶

StatefulSets power self-hosted brokers and search clusters when teams accept the operational load.
ML platforms store checkpoints on persistent volumes during long training jobs.
SaaS teams use dynamic provisioning so app teams request storage without tickets.
Fintech systems pair Kubernetes volumes with strict backup and restore drills.

Pause and recall¶

Why is pod-local storage unsafe for durable application data?
How do PVs, PVCs, and StorageClasses divide responsibilities?
What extra work does CSI remove for platform teams?
Why do stateful systems need different rollout thinking than stateless APIs?

Interview Q&A¶

Q: Why do PVCs help app teams? A: They let workloads request storage intent without embedding backend details everywhere. That keeps application manifests portable while platform policy evolves underneath. Common wrong answer to avoid: “Because PVs are deprecated.”

Q: Why can dynamic provisioning still feel slow? A: The controller must create, attach, and mount real storage before the pod becomes ready. That workflow can take noticeable time, especially on fresh nodes or busy backends. Common wrong answer to avoid: “Because Kubernetes storage is always slower than local disks.”

Q: Why are backups not enough without restore drills? A: A backup file proves only that a copy exists somewhere. Recovery matters only when teams can restore usable state within expected time. Common wrong answer to avoid: “Because auditors demand extra paperwork.”

Q: Why are managed databases often still attractive? A: They offload replication, backup, patching, and failure management work. Running stateful systems on Kubernetes is possible, but not automatically cheaper or easier. Common wrong answer to avoid: “Because Kubernetes cannot run databases at all.”

Apply now (5 min)¶

List one workload you know and separate cache data from durable data. Choose a claim size, access mode, and storage class for it. Now write one sentence for its backup frequency and restore target. Decide whether StatefulSet or managed service feels saner. Finally, note one hidden cost in that storage decision.

Bridge. Storage attached. But how do containers talk securely? → 06 → 06-service-mesh-network-policy.md