AI Infrastructure¶

Use this track for the platform layer beneath AI products: backend APIs, model serving, vector retrieval infrastructure, MLOps, and cost/latency economics.

This is not the starting point for AI Engineering. Start with ../01_ai_engineering/ for agent/product architecture, then come here when the system needs concrete infrastructure decisions.

Module	Focus	Folder
00	AI backend API engineering	`00_ai_backend_api_engineering/`
01	Model gateway and provider operations	`01_model_gateway_provider_ops/` (placeholder)
02	Inference serving systems	`02_inference_serving_systems/`
03	Vector retrieval infrastructure	`03_vector_retrieval_infrastructure/`
04	ML platform operations	`04_ml_platform_operations/`
05	Agent performance economics	`05_agent_performance_economics/`
06	AI runbooks and on-call operations	`06_ai_runbooks_oncall/` (placeholder)
07	Tool execution sandboxes	`07_tool_execution_sandboxes/` (placeholder)
08	Distributed training systems — memory wall, data/tensor/pipeline parallelism, ZeRO/FSDP, 3D parallelism, checkpointing at scale	`08_distributed_training_systems/`
09	GPU acceleration stack — roofline, CUDA/kernel fusion, NCCL, TensorRT-LLM, Triton, NIM, NeMo, MIG/cluster scheduling	`09_gpu_acceleration_stack/`