Skip to content

01. Cloud primitives and compute — VMs vs containers vs serverless

⏱️ Estimated time: 18 min | Level: intermediate

ELI5 callback: In the dragon farm, the barn runs the work, the feeding trough holds the data, the fence limits access, the breeding ground scales the herd, and the ledger stops waste. Today we decide where AI work should actually run.

1) See the shape clearly

VMs, containers, and serverless functions all matter here. They do not optimise the same pressure. See. Start with workload shape, not vendor branding. Check startup time, runtime length, and host control. Check who patches the base layer. Check whether scale is steady or bursty. Check whether warm state must survive. Simple, no? VMs give full machine control and clear hardware ownership. Containers give faster packaging and denser multi-service deployment. Serverless functions give burst handling without long-lived hosts. AI teams often use all three, but for different lanes. So what to do? Write the fit matrix before provisioning anything. - Prioritise the slowest or costliest path. - Measure idle time honestly. - Record operational ownership. - Record rollback method. - Record debugging path. - Record compliance limits. Good teams choose boring defaults first. Fancy choices can wait.

2) Read the decision signals

Use VMs when GPU drivers, kernel settings, or local SSDs matter. Use containers when many services need identical images and quick rollouts. Use serverless when tasks are short, stateless, and event driven. Cold starts hurt synchronous inference more than batch cleanup jobs. Long runtimes hurt serverless economics and timeout limits. Shared container clusters need guardrails against noisy neighbours. Now use thresholds, not feelings. If latency is sacred, keep readiness. If cost is sacred, chase utilisation carefully. If control is sacred, reduce abstraction. If delivery speed is sacred, buy managed pieces. Quick decision prompts: - Can the job tolerate a cold start? - Does the job need root or device access? - Will traffic sit idle for long periods? - Does the team already operate a cluster? - Can the workload be split into short tasks? - Is rollback image based or machine based? See. One clear 'no' can eliminate a whole option. Trade-offs are normal. Document the fallback path. Now watch.

3) Map the working path

A clean compute path starts with a trigger. Then you place work on the right runtime. Then you connect logs, metrics, and data stores. Now watch the simplest sketch. ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Client │──→│ Gateway │──→│ Compute │ └────────────┘ └─────┬──────┘ └─────┬──────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ Data/Logs │ │ Cost │ └────────────┘ └────────────┘ Interactive APIs usually enter through a gateway or load balancer. Batch jobs may enter from a queue or scheduler instead. The compute box can be a VM pool, a container service, or functions. Logs and artifacts should leave the runtime quickly. Cost views should tag every request path or job type. Without tags, compute arguments become emotional, not factual. At every arrow, ask who retries. At every box, ask who pays. At every store, ask what expires. Now watch. One metric should sit beside each box. That is how operations stays sane.

4) Notice the common traps

Picking serverless because no servers sounds modern. Running stateful GPU inference on tiny ephemeral runtimes. Using VMs everywhere and forgetting image automation. Ignoring container image bloat and startup drag. Skipping autoscaling tests under burst traffic. Treating CI builds and production runtime as the same thing. See. Most outages start as silent assumptions. Review these traps before launch: - Cold starts can wreck chat latency targets. - Quota exhaustion can stall sudden launch traffic. - Noisy neighbours can flatten container throughput. - Patch drift can create security and driver mismatches. - Runaway retries can multiply cost fast. - Idle GPU VMs can burn money silently. Simple, no? Write failure drills for the top three risks. Decide what degrades first. Decide what must never degrade. Review quotas before launch day. Prefer explicit limits over wishful thinking. Now watch.

5) Lock the operating routine

List every workload: API, batch, training, cron, and admin. Map each workload to latency, duration, and hardware needs. Declare who patches the host or base image. Choose rollout style: replace, canary, or blue-green. Set shutdown rules for idle compute. Publish one default path for new services. Lock the language across the team. Use the same terms in code, dashboards, and reviews. Review this quick operating list: - Benchmark startup time. - Benchmark steady throughput. - Benchmark failure recovery. - Benchmark idle cost. - Document hard limits. - Keep one escape hatch. Good platform design keeps the barn, feeding trough, fence, breeding ground, and ledger aligned. So what to do? Create a one-page runbook. Create a one-page cost note. Create a one-page rollback note. Teach the team the same words. That alignment saves real money. See. Consistency beats cleverness. Benchmark first; opinions come second. Name the owner of every limit. Prefer reversible choices whenever the future is foggy. Document what changes during incidents. Keep one small default path for newcomers. Automate the boring thing as soon as it stabilises. Vendor docs help, but workload data matters more. Good naming prevents bad tickets. Observe p95, not only averages. Small runbooks beat heroic memory. Teach cost with the same seriousness as latency. Now watch how much confusion disappears.

Where this lives in the wild

  • AWS Batch and EC2 Auto Scaling for batch AI jobs. Classic pattern for preprocessing, training prep, and large queues.
  • Kubernetes clusters on EKS, GKE, or AKS. Common home for containerised inference and supporting services.
  • Cloudflare Workers for lightweight edge logic. Useful when prompts need auth checks or tiny transformations near users.
  • Azure VM Scale Sets for GPU-backed inference fleets. Teams use them when they need strong host control and private networking.
  • Modal or RunPod style burst compute platforms. Good example of mixing container packaging with elastic infrastructure.

Pause and recall

  1. When does a VM beat a container for AI work? Say it without looking up vendor names.
  2. Why can serverless fail for warm model inference? Give one concrete example.
  3. Which metric decides between density and control? State the trade-off in one line.
  4. What should every compute choice document up front? Mention one failure mode too.

Interview Q&A

Q. How do you choose between VMs, containers, and serverless for inference? A. Start with latency, runtime length, state warmth, and hardware control. Then pick the simplest option that satisfies those constraints. Common wrong answer to avoid: Containers are always best because everyone uses them. Better direction: Say which workload fits each runtime, and why.

Q. When do containers beat VMs? A. Containers win when packaging consistency, quick rollouts, and cluster density matter more than raw host control. Common wrong answer to avoid: Containers are just lighter VMs, so always cheaper. Better direction: Mention orchestration overhead and noisy-neighbour risk too.

Q. When does serverless fit AI workloads? A. It fits short, stateless, bursty tasks like validation, routing, or asynchronous enrichment. Common wrong answer to avoid: Serverless is perfect for any inference because it auto-scales. Better direction: Call out cold starts, timeout limits, and large model loading.

Q. What must be written in the design note? A. Write expected traffic shape, ownership model, rollback path, and measured cost profile. Common wrong answer to avoid: Just list the chosen service name. Better direction: Show the decision criteria, not only the answer.

Apply now (5 min)

  1. List three workloads you run today or expect soon.
  2. Mark each one as steady, bursty, or scheduled.
  3. Write the maximum acceptable cold start for each.
  4. Write whether host access is required.
  5. Choose VM, container, or serverless for each workload.
  6. Add one reason tied to latency.
  7. Add one reason tied to cost.
  8. Circle the riskiest assumption and plan a benchmark.

Bridge. Compute chosen. But dragons need food stored somewhere. → 02