00. AI Platform System Design — The Five-Year-Old Version¶
An AI platform is a kitchen that turns raw ingredients into meals, serves them hot, and improves the recipe over time.
Imagine a restaurant chain. Hundreds of locations. The central kitchen develops recipes (trains models). Each restaurant has a serving counter that plates meals for customers in under 2 seconds (inference). A recipe book stores every version of every recipe so you can roll back if customers complain (model registry).
Before cooking, chefs prep ingredients — washing, chopping, marinating. The prep station does this work ahead of time so the kitchen isn't slow during dinner rush. That is the feature pipeline — computing features offline so inference is fast.
Sometimes a new recipe flops. Customers spit it out. The taste test catches this before rollout — a small group tries the new dish, ratings are compared to the old recipe, and only winners go chain-wide. That is A/B testing and model evaluation — canary deployments for ML models.
Finally, ingredients change with seasons. Tomatoes in summer taste different than winter. If the kitchen doesn't adapt, meal quality drifts. The quality inspector checks if today's dishes still taste as good as last month's. That is monitoring for model drift — performance degradation over time as data distributions shift.
An AI platform ties all these together: training pipelines, feature stores, model registries, serving infrastructure, evaluation frameworks, and monitoring systems. It's not one tool — it's the full kitchen operation from raw data to served prediction.
Why is this different from regular software platforms? Because ML has a double maintenance burden. Regular software has code. ML has code AND data AND models. When data changes, the model may break even though no code changed. When the model updates, the serving infrastructure must handle the new version without downtime.
The iteration cycle matters too. A software engineer deploys a code fix in hours. An ML engineer retrains a model in days or weeks. The kitchen must support rapid experimentation — running dozens of training experiments simultaneously — while the serving counter keeps production stable.
Scale complicates everything. Training a large language model costs millions of dollars and takes weeks on hundreds of GPUs. Serving it requires specialized hardware (A100, H100 GPUs or TPUs). A single model might handle 10,000 requests per second. The infrastructure behind that is not a simple web server — it's a distributed system optimized for tensor operations.
The AI platform sits at the intersection of data engineering, ML engineering, and infrastructure. Data flows from the data platform (Module 11) into training pipelines. Trained models get served on Kubernetes (Module 08). Monitoring connects to observability systems (Module 09). Security protects model weights and training data (Module 10). Everything connects.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| kitchen | training infrastructure — GPU clusters, experiment tracking, hyperparameter tuning |
| serving counter | inference system — model servers, load balancing, latency SLOs |
| recipe book | model registry — versioning, metadata, lineage, approval workflows |
| prep station | feature pipeline — offline/online feature computation and caching |
| quality inspector | monitoring — drift detection, performance metrics, automated retraining triggers |
Top resources¶
- Designing Machine Learning Systems by Chip Huyen — the definitive book on ML systems end-to-end
- MLOps.community — community resources, case studies, and practitioner talks
- Google ML Best Practices — practical rules for production ML
- Feast Feature Store Documentation — open-source feature store patterns
- MLflow Documentation — experiment tracking, model registry, and deployment
What's coming¶
- 01-ml-lifecycle-overview.md — from problem framing to production; the full loop
- 02-training-infrastructure.md — distributed training, experiment tracking, and GPU orchestration
- 03-feature-engineering-stores.md — offline features, online serving, and consistency guarantees
- 04-model-registry-versioning.md — storing, tagging, approving, and rolling back models
- 05-serving-and-inference.md — real-time vs. batch, autoscaling, and latency optimization
- 06-evaluation-ab-testing.md — offline metrics, shadow mode, canary rollouts, and statistical tests
- 07-monitoring-and-drift.md — data drift, concept drift, alerting, and retraining triggers
- 08-honest-admission.md — what we don't fully understand about AI platforms
Bridge. The restaurant chain starts with understanding the full menu. Let's see the ML lifecycle end-to-end before zooming into each station. → 01-ml-lifecycle-overview.md