02. Training infrastructure¶

⏱️ Estimated time: 22 min | Level: advanced

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

Training infrastructure is shared factory capacity¶

Training infrastructure exists to turn many ideas into repeatable runs. It is not only about raw GPU count. It is also scheduling, isolation, artifacts, and metadata. See. The kitchen is a shared factory, not one laptop. The prep station must hand clean, point-in-time features into training. The recipe book records which run produced which artifact. The serving counter later depends on tensor shapes fixed here. The quality inspector will need run context when drift appears. So what to do? Design for many concurrent experiments. Different teams, datasets, and budgets will collide daily. Fair scheduling matters as much as peak throughput. Quota policies stop one giant job from starving everyone else. Simple, no? Shared platforms need boring rules.

┌────────┐   ┌────────────┐   ┌───────────┐
│Notebook│→→│ Job submit │→→│ GPU queue  │
└────────┘   └────────────┘   └────┬──────┘
                                   │
                         ┌─────────v─────────┐
                         │ training workers  │
                         └───────────────────┘

Plan for multi-tenant fairness from the start.
Separate interactive work from long batch runs.
Track budget per team and per project.
Keep artifacts immutable once a run completes.
Standard templates reduce accidental misconfiguration.
Now watch. Capacity planning becomes the real design game.

Distributed training solves size and speed limits¶

Single-device training fails when model state or batch size grows. Data parallelism splits batches across workers and synchronizes gradients. Model parallelism splits the model itself across devices. Pipeline parallelism overlaps stages to keep hardware busy. Each approach trades simplicity for scale. See. Communication cost decides whether extra GPUs help or hurt. All-reduce overhead can dominate if batches are too small. Straggler workers slow the full job. Checkpointing matters because long jobs fail eventually. So what to do? Choose the simplest parallelism that meets the target. Do not add model parallelism just because it sounds elite. Network topology also matters for large clusters. Now watch. Storage and checkpoints become part of training speed.

GPU1 ─ batch A ─┐
GPU2 ─ batch B ─┼─ all-reduce ─ update
GPU3 ─ batch C ─┤
GPU4 ─ batch D ─┘
        │
        v
   next training step

Scale compute and network together.
Save checkpoints often enough to cap restart pain.
Measure hardware utilization, not only epoch time.
Keep failure recovery automatic when possible.
Use spot capacity only with safe checkpoint intervals.
Simple, no? Parallelism is math plus plumbing.

Experiment tracking keeps science from becoming gossip¶

A useful experiment log captures code, data, params, and outcomes. Tools like MLflow and Weights & Biases make that visible. But the platform must decide what is mandatory. See. If run names are free-form, search becomes comedy. Log dataset version, feature view version, and environment image. Log metrics over time, not only final numbers. Attach model artifacts, plots, and failure notes. Now compare runs by business goal, not only by timestamp. So what to do? Standardize metadata fields across teams. Require tags for owner, task, dataset, and candidate status. Link runs directly to code commits and pipeline executions. Then promotion and rollback become evidence-driven. Simple, no? Memory is weak; metadata is stronger.

run id → params → metrics
   │         │        │
   ├── code commit    │
   ├── data snapshot  │
   ├── artifacts      │
   └── owner / notes  │
              ↓

Make experiment metadata queryable across teams.
Default dashboards should compare baselines and candidates.
Store failures too; they prevent repeat waste.
Build lineage once, then reuse it everywhere.
The tracker is part of governance, not decoration.
See. Good logs shorten arguments dramatically.

GPU orchestration is mostly scheduling discipline¶

Kubernetes, Slurm, or managed cloud schedulers all face the same tension. You want high utilization without chaos. Jobs need GPUs, CPUs, RAM, network, and storage together. Gang scheduling matters when distributed jobs need all workers simultaneously. Placement matters because cross-rack traffic can crush performance. See. A half-started distributed job is wasted money. Use node pools for hardware classes like A10, A100, or H100. Add admission controls for image size, quota, and priority. Preemption can help urgent jobs, but it must honor checkpoint safety. So what to do? Expose queue state and estimated wait time. Engineers tolerate delay better than mystery. Also track fragmentation, not only aggregate free GPUs. Now watch. Procurement planning depends on these metrics.

submit job
   │
   ├── quota check
   ├── image check
   ├── gang placement
   └── start workers
          ↓

Publish queue rules in plain language.
Match priority levels to business needs, not politics.
Keep capacity dashboards historical and current.
Design around maintenance windows and node failures.
Separate training, fine-tuning, and ad hoc experimentation pools.
Simple, no? Scheduling is product design for engineers.

Guardrails keep training fast, safe, and affordable¶

Unbounded experimentation can burn money without learning much. Set budget alerts by project, model family, and environment. Enforce approved base images for security and reproducibility. Scan artifacts before storing or promoting them. Keep secrets out of notebooks and job definitions. See. The best platform removes accidental heroics. Offer starter templates for fine-tuning, distributed training, and evaluation. Bake in logging, checkpoints, and metadata capture automatically. Then engineers spend time on models, not glue code. So what to do? Make the paved road obviously easier. Also publish cost per successful experiment family. That metric reveals waste hidden by vanity throughput numbers. Now the platform can improve itself with real evidence.

template → submit → train → track
   │         │       │       │
   ├─ policy ├─ quota├─ cost ├─ lineage
   │         │       │       │
   └─────────┴───────┴───────┘
              ↓
         safe iteration

Guardrails should feel helpful, not punitive.
Automate the boring parts before scaling cluster size.
Cost visibility changes behavior faster than lectures.
Security baselines must ride inside the platform path.
Reproducibility is a feature engineers notice only when absent.
See. Good infra teaches good habits by default.

Where this lives in the wild¶

A foundation-model team tunes gang scheduling because partial starts waste expensive GPUs.
A fintech ML group mandates MLflow lineage so audit questions can be answered quickly.
A marketplace platform separates exploratory notebook pools from production training pools.
A research lab tracks cost per successful run family to justify hardware expansion.
A cloud AI platform exposes queue wait estimates so teams can plan experiments sanely.

Pause and recall¶

Why is training infrastructure more than a pile of GPUs?
When does extra parallelism stop helping and start hurting?
Why must experiment tracking include code and data together?
What makes gang scheduling important for distributed training jobs?

Interview Q&A¶

Q: How would you design a shared training platform? A: I would cover job submission, quotas, scheduling, distributed execution, experiment tracking, artifact storage, and budget guardrails as one system. Common wrong answer to avoid: I would buy more GPUs first and let each team manage itself.

Q: Why do teams use MLflow or Weights & Biases? A: They preserve run context, artifacts, comparisons, and lineage so experiments remain reproducible and promotion decisions stay explainable. Common wrong answer to avoid: They are mainly for pretty charts during demos.

Q: What is the hard part of GPU orchestration? A: The hard part is fair, efficient scheduling under multi-resource constraints, failures, fragmentation, and distributed job startup needs. Common wrong answer to avoid: The hard part is writing a dockerfile.

Q: How do you control training cost? A: Use quotas, budget alerts, efficient checkpointing, right-sized hardware pools, and standardized templates that avoid wasteful retries. Common wrong answer to avoid: Tell engineers to be more careful and hope for the best.

Apply now (5 min)¶

List every field your experiment tracker must capture for one training run. Then circle three fields your current setup forgets most often. Next, sketch a queue policy for small, medium, and urgent jobs. Add one budget alert threshold for each job class. Finally, decide checkpoint frequency for a six-hour run. You now have the first draft of a training platform policy.

Bridge. Models trained. But they need features — preprocessed ingredients. → 03