Skip to content

00. Kubernetes and GPU Platforms — The Five-Year-Old Version

Containers are boxes. Kubernetes is the warehouse manager that stacks, moves, and replaces them.


Imagine a shipping port. Thousands of containers arrive daily. Each container holds different goods. Some are fragile, some heavy, some need refrigeration.

You need a dock manager who decides where each container goes. Which ship? Which stack position? If a container falls off, the dock manager replaces it immediately. If a ship is full, the dock manager routes containers to another ship.

That is Kubernetes. Your applications are packed in containers (Docker images). Kubernetes is the dock manager — it schedules containers onto servers, restarts crashed ones, scales them up when traffic spikes, and rolls out new versions without downtime.

Each container sits on a ship — a physical or virtual server (a node). Multiple containers share a ship. The dock manager decides which containers go on which ship based on resource requests: "This container needs 4 CPUs and 16 GB RAM. This ship has room."

Some containers are special. They need GPUs — expensive cargo cranes that only certain ships have. GPU scheduling is harder. You can't split a GPU across containers easily. The dock manager must match GPU-hungry containers to GPU-equipped ships.

The port has rules. Containers from the shipping company can't access containers from the military. Port security — network policies, service mesh, RBAC — keeps things isolated. Just because containers share a ship doesn't mean they can talk to each other.

When the port gets busy, you add more ships. When it's quiet, you remove them. Auto-scaling means the port shrinks and grows with demand, and you only pay for ships that are actually docked.

Why does AI care about Kubernetes? Three reasons:

First, model training needs many GPUs working together. Kubernetes manages the cluster — assigns GPUs, handles failures mid-training, and schedules jobs in priority order. Without the dock manager, GPU allocation is manual chaos.

Second, model serving needs elasticity. At 3 AM, traffic is low — one replica is enough. At noon, traffic spikes — you need twenty replicas. Kubernetes auto-scales the containers based on request volume.

Third, ML teams run many experiments simultaneously. Researcher A trains a vision model. Researcher B trains an LLM. Researcher C runs batch inference. They all share the same GPU cluster. Kubernetes provides fair scheduling, resource quotas, and isolation so they don't step on each other.

The ecosystem around Kubernetes is vast. Helm packages applications. ArgoCD deploys them GitOps-style. Istio adds networking magic. Prometheus monitors everything. The dock manager is just the core — the port has many supporting services that make operations smooth.


The placeholders you will see called back

Placeholder Meaning
dock manager Kubernetes control plane — scheduler, controller, API server
ship node — the physical or virtual server running containers
container pod — the smallest deployable unit (one or more Docker containers)
cargo crane GPU — the specialized hardware for ML workloads
port security network policies, RBAC, service mesh — isolation and access control

Top resources


What's coming

  1. 01-containers-and-images.md — Docker, images, layers, and why containers exist
  2. 02-pods-services-ingress.md — the core K8s objects: pods, services, ingress controllers
  3. 03-deployments-and-scaling.md — ReplicaSets, Deployments, HPA, and rolling updates
  4. 04-gpu-scheduling-node-pools.md — taints, tolerations, device plugins, and multi-GPU jobs
  5. 05-storage-in-kubernetes.md — PVs, PVCs, StorageClasses, and stateful workloads
  6. 06-service-mesh-network-policy.md — Istio, mTLS, traffic splitting, and network isolation
  7. 07-autoscaling-and-capacity.md — HPA, VPA, cluster autoscaler, Karpenter, and right-sizing
  8. 08-rollouts-and-health.md — canary, blue-green, probes, and graceful shutdown
  9. 09-honest-admission.md — what we don't fully understand about K8s and GPU scheduling

Bridge. Before the dock manager can do anything, goods must be packed into containers. Let's start there. → 01-containers-and-images.md