Skip to content

00. Cloud Infrastructure for AI — The Five-Year-Old Version

AI models are hungry beasts. They need special farms to feed them — not regular offices.


Imagine a farm. A regular farm grows wheat. It needs soil, water, sun, and a barn. Simple infrastructure.

Now imagine a dragon farm. Dragons eat ten times more. They breathe fire. They need reinforced barns that won't burn down, special feeding troughs that hold a thousand pounds of meat, fireproof fences around each dragon, and a breeding ground with enough space for them to stretch their wings.

Cloud infrastructure for AI is building the dragon farm. AI models are the dragons. They need GPUs (the reinforced barns), massive object storage (the feeding troughs that hold terabytes of training data), IAM and network isolation (the fireproof fences), and scalable compute clusters (the breeding ground where models train).

A regular web app runs on a CPU and 2 GB of RAM. An LLM needs 8 GPUs with 80 GB each. A training run chews through 100 TB of data. The infrastructure bill for one training run can exceed $1M. Different beasts, different farms.

The cloud providers — AWS, GCP, Azure — offer pre-built dragon farms. Managed GPU clusters, serverless inference, object storage with lifecycle policies, secret vaults, and cost controls. Your job is choosing the right pieces and wiring them together without burning money.

One more thing. Dragons are expensive to keep alive. A single A100 GPU costs $3/hour. A 64-GPU training cluster runs $192/hour = $4,608/day. If you forget to turn it off over a weekend, that's $9,216 wasted. Cost control isn't optional — it's survival. The ledger tracks every dollar spent.

But why "cloud" and not on-premises? Simple. You don't buy dragons — you rent them. Need 64 GPUs for 3 days? Rent them. Training done? Release them. No hardware sitting idle for months. The cloud is a rental service for compute, storage, and networking. You pay by the hour, minute, or even second.

The three big providers each have strengths: - AWS: broadest service catalog, largest GPU fleet - GCP: best ML tooling (Vertex AI, TPUs), strong networking - Azure: enterprise integration, OpenAI partnership

You don't pick one and ignore the others. Most teams use one primary and keep options open. Vendor lock-in is real — every proprietary service you use is a fence that's hard to climb over later.

The farm has layers. At the bottom: raw compute (barns). Above that: storage (feeding troughs). Then: networking and security (fences). Then: managed platforms that wire everything together (breeding grounds). At the top: cost visibility and controls (ledger). Each layer builds on the one below. Miss any layer and the farm doesn't function.


The placeholders you will see called back

Placeholder Meaning
barn compute instances — CPU/GPU VMs, containers, serverless functions
feeding trough object storage and data lakes — S3, GCS, Azure Blob
fence security boundaries — IAM, VPC, encryption, network policies
breeding ground training/inference clusters — GPU pools, managed ML platforms
ledger cost controls — budgets, alerts, spot instances, auto-shutdown

Top resources


What's coming

  1. 01-cloud-primitives-compute.md — VMs, containers, serverless — choosing the right barn
  2. 02-object-storage-and-data.md — S3, GCS, lifecycle policies, and feeding the data pipeline
  3. 03-iam-vpc-security.md — identity, networking, and building fireproof fences
  4. 04-managed-databases-caches.md — RDS, ElastiCache, DynamoDB — the managed shelf
  5. 05-gpu-instances-and-clusters.md — A100, H100, instance types, and multi-GPU training
  6. 06-managed-ml-platforms.md — SageMaker, Vertex AI, Azure ML — managed breeding grounds
  7. 07-secrets-config-management.md — Vault, SSM, environment configs, and rotation
  8. 08-cost-controls-budgets.md — spot instances, reserved capacity, auto-shutdown, and alerts
  9. 09-serverless-patterns.md — Lambda, Cloud Functions, and when serverless fits AI workloads
  10. 10-edge-and-hybrid.md — edge inference, on-device models, and cloud-edge split
  11. 11-honest-admission.md — what we don't fully understand about cloud AI infra

Bridge. The dragon farm needs a barn first. Let's understand compute options — VMs, containers, and serverless. → 01-cloud-primitives-compute.md