00. Cloud Infrastructure for AI — The Five-Year-Old Version¶

AI models are hungry beasts. They need special farms to feed them — not regular offices.

Imagine a farm. A regular farm grows wheat. It needs soil, water, sun, and a barn. Simple infrastructure.

Now imagine a dragon farm. Dragons eat ten times more. They breathe fire. They need reinforced barns that won't burn down, special feeding troughs that hold a thousand pounds of meat, fireproof fences around each dragon, and a breeding ground with enough space for them to stretch their wings.

Cloud infrastructure for AI is building the dragon farm. AI models are the dragons. They need GPUs (the reinforced barns), massive object storage (the feeding troughs that hold terabytes of training data), IAM and network isolation (the fireproof fences), and scalable compute clusters (the breeding ground where models train).

A regular web app runs on a CPU and 2 GB of RAM. An LLM needs 8 GPUs with 80 GB each. A training run chews through 100 TB of data. The infrastructure bill for one training run can exceed $1M. Different beasts, different farms.

The cloud providers — AWS, GCP, Azure — offer pre-built dragon farms. Managed GPU clusters, serverless inference, object storage with lifecycle policies, secret vaults, and cost controls. Your job is choosing the right pieces and wiring them together without burning money.

One more thing. Dragons are expensive to keep alive. A single A100 GPU costs $3/hour. A 64-GPU training cluster runs $192/hour = $4,608/day. If you forget to turn it off over a weekend, that's $9,216 wasted. Cost control isn't optional — it's survival. The ledger tracks every dollar spent.

But why "cloud" and not on-premises? Simple. You don't buy dragons — you rent them. Need 64 GPUs for 3 days? Rent them. Training done? Release them. No hardware sitting idle for months. The cloud is a rental service for compute, storage, and networking. You pay by the hour, minute, or even second.

The three big providers each have strengths: - AWS: broadest service catalog, largest GPU fleet - GCP: best ML tooling (Vertex AI, TPUs), strong networking - Azure: enterprise integration, OpenAI partnership

You don't pick one and ignore the others. Most teams use one primary and keep options open. Vendor lock-in is real — every proprietary service you use is a fence that's hard to climb over later.

The farm has layers. At the bottom: raw compute (barns). Above that: storage (feeding troughs). Then: networking and security (fences). Then: managed platforms that wire everything together (breeding grounds). At the top: cost visibility and controls (ledger). Each layer builds on the one below. Miss any layer and the farm doesn't function.

The placeholders you will see called back¶

Placeholder	Meaning
barn	compute instances — CPU/GPU VMs, containers, serverless functions
feeding trough	object storage and data lakes — S3, GCS, Azure Blob
fence	security boundaries — IAM, VPC, encryption, network policies
breeding ground	training/inference clusters — GPU pools, managed ML platforms
ledger	cost controls — budgets, alerts, spot instances, auto-shutdown

Top resources¶

AWS Well-Architected ML Lens — AWS patterns for ML workloads across all pillars
Google Cloud AI Infrastructure Guide — GCP reference architectures for training and serving
The GPU Cost Handbook (Anyscale) — practical GPU pricing, spot strategies, and cost optimization
Terraform Up & Running by Yevgeniy Brikman — infrastructure-as-code for reproducible cloud setups
Vantage Cloud Cost Handbook — cloud cost management strategies and unit economics

What's coming¶

01-cloud-primitives-compute.md — VMs, containers, serverless — choosing the right barn
02-object-storage-and-data.md — S3, GCS, lifecycle policies, and feeding the data pipeline
03-iam-vpc-security.md — identity, networking, and building fireproof fences
04-managed-databases-caches.md — RDS, ElastiCache, DynamoDB — the managed shelf
05-gpu-instances-and-clusters.md — A100, H100, instance types, and multi-GPU training
06-managed-ml-platforms.md — SageMaker, Vertex AI, Azure ML — managed breeding grounds
07-secrets-config-management.md — Vault, SSM, environment configs, and rotation
08-cost-controls-budgets.md — spot instances, reserved capacity, auto-shutdown, and alerts
09-serverless-patterns.md — Lambda, Cloud Functions, and when serverless fits AI workloads
10-edge-and-hybrid.md — edge inference, on-device models, and cloud-edge split
11-honest-admission.md — what we don't fully understand about cloud AI infra

Bridge. The dragon farm needs a barn first. Let's understand compute options — VMs, containers, and serverless. → 01-cloud-primitives-compute.md