05. GPU instances and clusters — A100, H100, multi-GPU, and NVLink¶
⏱️ Estimated time: 21 min | Level: advanced
ELI5 callback: In the dragon farm, the barn runs the work, the feeding trough holds the data, the fence limits access, the breeding ground scales the herd, and the ledger stops waste. Today we look at the hardware that actually trains and serves serious models.
1) See the shape clearly¶
A100 GPUs, H100 GPUs, and multi-GPU clusters all matter here. They do not optimise the same pressure. See. Start with workload shape, not vendor branding. Check startup time, runtime length, and host control. Check who patches the base layer. Check whether scale is steady or bursty. Check whether warm state must survive. Simple, no? A100 class hardware remains common for training and heavy inference. H100 class hardware pushes newer performance, memory speed, and scaling gains. Multi-GPU clusters matter when one card cannot hold the model or batch. The cloud menu is not enough; interconnect and topology matter too. So what to do? Write the fit matrix before provisioning anything. - Prioritise the slowest or costliest path. - Measure idle time honestly. - Record operational ownership. - Record rollback method. - Record debugging path. - Record compliance limits. Good teams choose boring defaults first. Fancy choices can wait.
2) Read the decision signals¶
Choose by model size, batch size, precision plan, and latency target. Instance families differ in GPU count, CPU balance, local NVMe, and network. NVLink or similar fast interconnect helps when GPUs talk constantly. Data parallel, tensor parallel, and pipeline parallel choices drive cluster design. Inference may need fewer GPUs but stricter latency and memory placement. Scarcity and quota can matter as much as raw benchmark numbers. Now use thresholds, not feelings. If latency is sacred, keep readiness. If cost is sacred, chase utilisation carefully. If control is sacred, reduce abstraction. If delivery speed is sacred, buy managed pieces. Quick decision prompts: - How large is the model in serving precision? - How much activation memory appears at target batch size? - Do GPUs need fast peer-to-peer traffic? - Can jobs tolerate queueing for scarce capacity? - Will spot or reserved capacity change the plan? - Is the bottleneck compute, memory, or input pipeline? See. One clear 'no' can eliminate a whole option. Trade-offs are normal. Document the fallback path. Now watch.
3) Map the working path¶
GPU planning starts before the first cluster request. Model size, data path, and interconnect must agree. Otherwise you rent expensive cards and still wait. Now watch the clean sketch. ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Dataset │──→│ Scheduler │──→│ GPU Node │ └────────────┘ └─────┬──────┘ └─────┬──────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ Checkpoint │ │ Telemetry │ └────────────┘ └────────────┘ The scheduler places the job on suitable instance families. Each GPU node must have the right driver, image, and network setup. Checkpoints need durable storage because training will fail eventually. Telemetry should show GPU utilisation, memory, and input stall time. Cluster health also depends on networking between nodes, not only cards. If GPUs wait for data, the storage plan is the bottleneck. At every arrow, ask who retries. At every box, ask who pays. At every store, ask what expires. Now watch. One metric should sit beside each box. That is how operations stays sane.
4) Notice the common traps¶
Choosing by GPU name only and ignoring memory size. Ignoring node-to-node bandwidth in distributed training. Using giant clusters before single-node profiling is done. Forgetting checkpoint cadence on long expensive runs. Buying premium GPUs for workloads bottlenecked elsewhere. Assuming inference and training need the same topology. See. Most outages start as silent assumptions. Review these traps before launch: - Quota shortages can block launch dates. - Fragmented capacity can delay scale-out. - Poor interconnect can erase multi-GPU gains. - Mis-sized CPU or RAM can starve expensive GPUs. - Checkpoint loss can destroy days of spend. - Thermal or driver issues can reduce usable fleet size. Simple, no? Write failure drills for the top three risks. Decide what degrades first. Decide what must never degrade. Review quotas before launch day. Prefer explicit limits over wishful thinking. Now watch.
5) Lock the operating routine¶
Profile one GPU first, then scale the topology deliberately. Write model memory estimates before booking hardware. Choose instance families by GPU count and network shape together. Plan checkpoint storage, resume strategy, and image versions. Track utilisation, memory pressure, and data loading stalls. Reserve scarce capacity early when launches matter. Lock the language across the team. Use the same terms in code, dashboards, and reviews. Review this quick operating list: - Measure tokens or samples per dollar. - Measure scaling efficiency across GPU counts. - Review peer-to-peer interconnect support. - Keep golden images reproducible. - Test resume from checkpoint. - Align quotas with roadmap dates. Good platform design keeps the barn, feeding trough, fence, breeding ground, and ledger aligned. So what to do? Create a one-page runbook. Create a one-page cost note. Create a one-page rollback note. Teach the team the same words. That alignment saves real money. See. Consistency beats cleverness. Benchmark first; opinions come second. Name the owner of every limit. Prefer reversible choices whenever the future is foggy. Document what changes during incidents. Keep one small default path for newcomers. Automate the boring thing as soon as it stabilises. Vendor docs help, but workload data matters more. Good naming prevents bad tickets. Observe p95, not only averages. Small runbooks beat heroic memory. Teach cost with the same seriousness as latency. Now watch how much confusion disappears.
Where this lives in the wild¶
- AWS p4d and p5 style GPU clusters. Common reference point for A100 and H100 cloud training capacity.
- NVIDIA DGX Cloud environments. Shows a managed route when teams want fast access to high-end GPU fleets.
- CoreWeave style specialised GPU clouds. Good example of providers optimised around scarce accelerator supply.
- Vertex AI or SageMaker distributed training pools. Managed platforms often hide part of the cluster plumbing.
- On-prem or hosted NVLink rich boxes for heavy inference. Useful when one machine must hold huge models with fast peer links.
Pause and recall¶
- Why is GPU name alone not enough? Say it without looking up vendor names.
- When does NVLink or similar interconnect matter most? Give one concrete example.
- What can starve a GPU even when the card is powerful? State the trade-off in one line.
- Why should quota planning sit beside architecture planning? Mention one failure mode too.
Interview Q&A¶
Q. How do you choose between A100 and H100 class hardware? A. Compare model size, target throughput, memory needs, software support, and actual availability. Common wrong answer to avoid: Always choose the newest GPU if budget allows. Better direction: Mention scarcity, software fit, and cost efficiency too.
Q. When does multi-GPU become necessary? A. When one card cannot hold the model, or throughput targets need parallel execution. Common wrong answer to avoid: Use more GPUs whenever training feels slow. Better direction: Explain memory limits and scaling efficiency.
Q. Why does NVLink matter? A. Fast peer traffic helps workloads where GPUs exchange activations or shards frequently. Common wrong answer to avoid: It is just a marketing feature. Better direction: Tie it to communication-heavy parallel training.
Q. What should every long run include? A. Checkpointing, telemetry, resume testing, and a capacity fallback plan. Common wrong answer to avoid: A big cluster and optimism. Better direction: Show how you protect both time and money.
Apply now (5 min)¶
- Choose one model size you care about.
- Estimate memory at your target precision.
- Estimate whether one GPU can hold it.
- If not, note whether sharding or batching is needed.
- Write the likely bottleneck: compute, memory, or input.
- Write the checkpoint interval you would start with.
- Write one quota risk.
- Write one cheaper fallback hardware plan.
Bridge. GPUs provisioned. But managing them manually is painful. → 06