14. AI System HLD Patterns — The city where compute is expensive and timing is everything¶

~16 min read. AI systems look modern, but their traffic math is still very physical.

Built on the ELI5 in 00-eli5.md. The warehouse — the storage layer where goods sit ready — matters here because feature stores are specialized warehouses filled with pre-computed goods for instant pickup.

1) Why the AI city has a different blueprint¶

A normal CRUD app usually spends more attention on request routing and database indexes. An AI system still needs those, but the weight shifts. The warehouses get heavier because features, embeddings, model artifacts, training data, and experiment logs all need different storage behaviour. The roads get stranger because GPU workers need fast data movement, not only HTTP. And the overflow lane becomes expensive. Adding ten web servers is one conversation. Adding ten GPUs is a budget meeting.

See the mental map.

┌──────────────┐ │ Online path │ user request → feature lookup → model inference → response ├──────────────┤ │ Offline path │ data ingest → preprocess → train → evaluate → deploy ├──────────────┤ │ Shared core │ model registry, feature store, experiment logs, monitoring └──────────────┘

So what is the real HLD question? Not only, "Can it work?" The sharper question is, "Can it serve fast, train safely, and improve without breaking users?"

2) Model serving: low latency with expensive brains¶

Online inference means a user request hits a model and expects an answer in milliseconds or low hundreds of milliseconds.

A common blueprint is this.

┌────────┐ request ┌──────────────┐ route ┌──────────────┐ │ Client │ ─────→ │ API gateway │ ───→ │ LB / router │ └────────┘ └──────────────┘ └──────┬───────┘ ▼ ┌──────────────┐ │ Model pods │ ├──────────────┤ │ replica A │ │ replica B │ │ replica C │ └──────┬───────┘ ▼ ┌──────────────┐ feature fetch ┌──────────────┐ │ Response │ ←────────────── │ Online store │ └──────────────┘ └──────────────┘

Map it clearly. - toll booth: API gateway and model router doing auth, routing, and rate limits. - road: request path to model replicas plus feature fetch path to the online store. - warehouse: model registry, model artifact store, online feature store, cache for repeated prompts or embeddings. - overflow lane: more replicas, traffic shedding, fallback model, queueing for non-urgent inference.

Worked example. Suppose you need 12,000 inference requests per second and one model replica handles 150 requests per second at p95 latency within target. Step 1: base replicas = 12,000 / 150 = 80. Step 2: add 20% headroom = 80 × 0.20 = 16. Step 3: planned replicas = 80 + 16 = 96. Now assume each request also makes two feature lookups and one lookup averages 3 ms while inference takes 42 ms. Step 4: feature time = 2 × 3 = 6 ms. Step 5: total core time = 6 + 42 = 48 ms. Step 6: if gateway and network add 12 ms, end-to-end estimate = 48 + 12 = 60 ms. Simple, no? Your model may be accurate, but the HLD fails if the roads into features are slow or if the toll booth sends traffic unevenly.

3) Feature stores: specialized warehouses for fresh pickup¶

Many ML systems need the same feature definitions in training and serving. Without discipline, training uses one definition and production uses another. That is training-serving skew. So teams build feature stores with two sides: an offline store for bulk history and backfills, and an online store for low-latency point lookups. Think of them as two connected warehouses. One is huge and slow. One is small and fast.

┌──────────────┐ batch write ┌──────────────┐ materialize ┌──────────────┐ │ Raw data │ ─────────→ │ Offline FS │ ─────────→ │ Online FS │ └──────────────┘ └──────┬───────┘ └──────┬───────┘ │ │ ▼ ▼ training jobs live inference

Map the placeholders. - warehouse: offline feature tables, online key-value feature store, metadata catalog, point-in-time snapshots. - road: batch pipelines for backfills and low-latency reads for inference. - overflow lane: precomputed hot-feature caches, replication, TTL cleanup, fallback defaults when the online store is late.

Worked example. Suppose each user feature record is 2 KB and you serve 5,000 requests per second. Each request needs 12 features but they are fetched as one joined record. Step 1: per-second feature bytes = 5,000 × 2 KB = 10,000 KB. Step 2: convert to MB = 10,000 / 1024 ≈ 9.8 MB per second. Step 3: per-minute bytes = 9.8 × 60 ≈ 588 MB. Now freshness. Suppose offline recomputation runs every 30 minutes, but fraud scores need freshness under 2 minutes. Step 4: batch lag gap = 30 - 2 = 28 minutes too stale. So what to do? Keep slow features in the offline warehouse and stream the hot features into the online warehouse every few seconds.

4) Training pipelines: factories connected by roads, not one giant script¶

Many candidates say, "We train a model." That sentence hides five systems. Data ingestion, preprocessing, training, evaluation, and deployment all want separate ownership and observability.

A practical blueprint looks like this.

┌──────────────┐ → ┌──────────────┐ → ┌──────────────┐ → ┌──────────────┐ │ Data ingest │ │ Preprocess │ │ Train │ │ Evaluate │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ▼ ▼ ▼ ▼ raw data lake cleaned dataset model artifact metrics report └────────────────────────────────────┬──────────────────────────────┘ ▼ ┌──────────────┐ │ Deploy stage │ └──────────────┘

Why break it up? Because retries, lineage, approvals, and rollback happen at stage boundaries. One giant script gives you one giant headache.

Map the city pieces. - road: scheduled batch jobs, event triggers, artifact handoff, approval workflow. - warehouse: raw lake, processed dataset store, model artifact registry, experiment log store. - toll booth: orchestrator or pipeline controller that decides what runs, when it runs, and with which version. - overflow lane: extra workers for preprocessing, spot instances for cheap training, queueing for pending jobs, fallback to previous good model.

Worked example. Suppose you ingest 4 TB per day and preprocessing reduces the usable training set by 70%. Step 1: cleaned data = 4 TB × 0.30 = 1.2 TB. Step 2: if one preprocessing worker handles 100 GB per hour, one worker needs 1,200 / 100 = 12 hours. Step 3: if you need preprocessing done in 3 hours, workers needed = 12 / 3 = 4. Now training. Suppose one GPU trains 0.15 epochs per hour and you need 3 full epochs before the daily deadline. Step 4: total GPU-hours = 3 / 0.15 = 20 GPU-hours. Step 5: with 5 GPUs, wall-clock time = 20 / 5 = 4 hours. See how the pipeline capacity math shapes the blueprint.

5) GPU clusters and model rollout: special roads, costly overflow lanes¶

Web traffic scaling is annoying. GPU scaling is expensive and capacity-constrained. That is why AI HLD must treat the cluster itself as a first-class component. GPU clusters need job scheduling, quota enforcement, data locality, and often fast interconnects between machines. These are special roads. If one model needs multi-GPU training, PCIe or NVLink bandwidth becomes part of the design.

A common pattern is this.

┌──────────────┐ submit ┌──────────────┐ assign ┌──────────────┐ │ ML platform │ ────→ │ Scheduler │ ────→ │ GPU nodes │ └──────────────┘ └──────┬───────┘ ├──────────────┤ │ │ training jobs │ │ │ inference jobs│ ▼ └──────┬───────┘ ┌──────────────┐ ▼ │ Quota store │ model artifacts └──────────────┘ │ ▼ ┌──────────────┐ │ Serving pool │ └──────────────┘

Now rollout. Never send 100% of users to a fresh model instantly. Use A/B testing or canary deployment. Worked example. Suppose model V1 serves 10,000 RPS and you want a 5% canary for V2. Step 1: canary traffic = 10,000 × 0.05 = 500 RPS. Step 2: if one V2 replica handles 125 RPS, canary replicas = 500 / 125 = 4. Step 3: if error budget allows 0.5% failures and V2 shows 2% failures after 20,000 requests, then 2% is 4× the allowed rate. Roll back. Now GPU scaling. Suppose each inference GPU costs 4× a CPU node, and traffic spikes from 10,000 to 16,000 RPS. Step 4: extra traffic = 16,000 - 10,000 = 6,000 RPS. Step 5: at 250 RPS per GPU, extra GPUs = 6,000 / 250 = 24. That is an expensive overflow lane. So what to do? Mix tactics. Add replicas, cache stable results, route some traffic to a smaller fallback model, and keep canary traffic isolated at the toll booth.

Where this lives in the wild¶

Uber Michelangelo — ML platform engineer manages feature pipelines, training jobs, and model rollout through a shared production platform.
DoorDash recommendations — ML infrastructure engineer serves online features with low-latency lookups so ranking models stay fresh during meal rushes.
YouTube recommendations — serving engineer runs many model replicas behind traffic routers because inference latency directly affects session depth.
Netflix personalization — data platform engineer keeps offline historical features and online serving features aligned to reduce training-serving skew.
OpenAI-style LLM serving stacks — capacity planner treats GPU pools, batching, and canary rollouts as first-class HLD decisions, not afterthoughts.

Pause and recall¶

Why does an AI system usually need both offline and online feature stores?
In model serving, why is replica count not the only latency decision?
Why should a training pipeline be split into stages instead of one giant job?
Why is GPU auto-scaling a more expensive overflow lane than CPU scaling?

Interview Q&A¶

Q: Why keep offline and online feature stores instead of one universal store? A: Training needs huge historical scans and point-in-time correctness, while serving needs millisecond key lookups. One storage shape rarely does both well. Common wrong answer to avoid: "Because teams like separate databases" — the split is about workload mismatch, not organizational preference.

Q: Why use model replicas behind a load balancer and not one very large server? A: Replicas improve availability, parallelism, rolling deploys, and tail-latency control. One giant server creates a bigger blast radius. Common wrong answer to avoid: "Because one server cannot run ML models" — it can, but HLD cares about resilience and scaling, not only raw possibility.

Q: Why A/B test or canary a new model instead of replacing the old model at once? A: Offline metrics miss real-user feedback loops, skew, and latency surprises. Small traffic first lets you measure impact before widening the blast radius. Common wrong answer to avoid: "Because accuracy on validation data is never useful" — offline evaluation is useful, just insufficient by itself.

Q: Why treat GPU scheduling as part of HLD and not a low-level ops detail? A: GPU scarcity, interconnect topology, quotas, and job mix decide throughput, cost, and fairness across teams. That changes the whole system plan. Common wrong answer to avoid: "Because Kubernetes will handle it automatically" — a scheduler helps, but it does not remove capacity trade-offs.

Apply now (5 min)¶

Exercise: Design a movie-recommendation serving path for 2,000 RPS. Mark the toll booth, online warehouse, model replicas, and one cheap fallback path if GPU capacity is exhausted.

Sketch from memory: Draw the offline path from raw data to deployed model. Then draw the online path from request to feature lookup to inference. Circle the most expensive overflow lane in your design.

Bridge. We've designed cities of every kind. But some questions stay open, especially where trade-offs refuse to become formulas. → 15-honest-admission.md