11. Honest admission — multi-cloud, GPU economics, sustainability, and lock-in¶
⏱️ Estimated time: 16 min | Level: advanced
ELI5 callback: In the dragon farm, the barn runs the work, the feeding trough holds the data, the fence limits access, the breeding ground scales the herd, and the ledger stops waste. Today we talk about the parts of cloud AI infrastructure that remain genuinely messy.
1) See the shape clearly¶
multi-cloud portability, GPU shortage economics, and sustainability trade-offs all matter here. They do not optimise the same pressure. See. Start with workload shape, not vendor branding. Check startup time, runtime length, and host control. Check who patches the base layer. Check whether scale is steady or bursty. Check whether warm state must survive. Simple, no? Multi-cloud sounds safe, but portability often costs speed and simplicity. GPU economics depend on scarce supply, changing prices, and queue risk. Sustainability goals can conflict with performance and latency goals. Vendor lock-in is not evil by itself; hidden lock-in is the problem. So what to do? Write the fit matrix before provisioning anything. - Prioritise the slowest or costliest path. - Measure idle time honestly. - Record operational ownership. - Record rollback method. - Record debugging path. - Record compliance limits. Good teams choose boring defaults first. Fancy choices can wait.
2) Read the decision signals¶
Portability is easiest at the container, model, and workflow boundary, not every managed feature. Scarce accelerators make capacity planning partly an economic problem. Carbon, power, and utilisation should influence scheduling more often than they do. Different clouds will not expose identical networking, IAM, and ML features. Sometimes the right answer is deliberate lock-in for a time-boxed win. The mature move is naming trade-offs clearly, not pretending they vanished. Now use thresholds, not feelings. If latency is sacred, keep readiness. If cost is sacred, chase utilisation carefully. If control is sacred, reduce abstraction. If delivery speed is sacred, buy managed pieces. Quick decision prompts: - Which layer truly needs portability? - What is the cost of abstraction over vendor features? - How scarce is the target accelerator family? - Can workload timing shift to cheaper or greener windows? - What would migration really require? - What lock-in is acceptable for the next two years? See. One clear 'no' can eliminate a whole option. Trade-offs are normal. Document the fallback path. Now watch.
3) Map the working path¶
Open problems are still design problems. You still need a map, even when the answer is imperfect. Good teams make uncertainty visible early. Now watch the pressure points. ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Need │──→│ Choice │──→│ Vendor │ └────────────┘ └─────┬──────┘ └─────┬──────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ Fallback │ │ Review │ └────────────┘ └────────────┘ A real need should drive whether you optimise for portability or depth. Vendor choice then shapes cost, hardware access, and managed features. Fallback planning matters because capacity and pricing move. Regular review is essential because the market changes quickly. Sustainability metrics are only useful when they change scheduling or design. Honesty beats fake certainty in architecture documents. At every arrow, ask who retries. At every box, ask who pays. At every store, ask what expires. Now watch. One metric should sit beside each box. That is how operations stays sane.
4) Notice the common traps¶
Saying multi-cloud is mandatory without naming the concrete risk. Believing portability comes free if you use containers. Ignoring GPU supply risk until the quarter of launch. Calling a design sustainable without measuring utilisation or energy context. Pretending vendor lock-in can be eliminated completely. Confusing strategic options with immediate product needs. See. Most outages start as silent assumptions. Review these traps before launch: - Abstraction layers can slow teams and hide useful cloud features. - Capacity shortages can derail timelines even with budget ready. - Migration rehearsals can be far costlier than slide decks suggest. - Green goals can conflict with latency and geography constraints. - Deep managed integrations can multiply exit cost. - Unclear trade-offs can split teams into ideology camps. Simple, no? Write failure drills for the top three risks. Decide what degrades first. Decide what must never degrade. Review quotas before launch day. Prefer explicit limits over wishful thinking. Now watch.
5) Lock the operating routine¶
Write where portability truly matters and where it does not. Track hardware availability as a roadmap input. Measure utilisation before claiming efficiency. Record which vendor features create meaningful lock-in. Plan at least one fallback for scarce GPU families. Review the strategy quarterly because reality moves fast. Lock the language across the team. Use the same terms in code, dashboards, and reviews. Review this quick operating list: - Prefer open model formats when practical. - Keep deployment packaging reproducible. - Treat supply risk like a first-class dependency. - Measure cost, latency, and carbon together when possible. - Use managed features deliberately, not accidentally. - Document acceptable lock-in openly. Good platform design keeps the barn, feeding trough, fence, breeding ground, and ledger aligned. So what to do? Create a one-page runbook. Create a one-page cost note. Create a one-page rollback note. Teach the team the same words. That alignment saves real money. See. Consistency beats cleverness. Benchmark first; opinions come second. Name the owner of every limit. Prefer reversible choices whenever the future is foggy. Document what changes during incidents. Keep one small default path for newcomers. Automate the boring thing as soon as it stabilises. Vendor docs help, but workload data matters more. Good naming prevents bad tickets. Observe p95, not only averages. Small runbooks beat heroic memory. Teach cost with the same seriousness as latency. Now watch how much confusion disappears.
Where this lives in the wild¶
- Kubernetes plus Terraform as a portability layer. Helpful, but not magical, when teams span vendors or regions.
- Ray, Spark, and open model formats across clouds. Good example of portable pieces sitting above different infrastructure.
- Specialised GPU clouds rising during accelerator shortages. Shows how economics and supply chain shape architecture choices.
- Carbon-aware scheduling and region selection experiments. A signal that sustainability is becoming an operational knob.
- Deep use of proprietary ML platforms for speed. Sometimes the right short-term choice, if the lock-in is consciously accepted.
Pause and recall¶
- Why is multi-cloud harder than slide decks suggest? Say it without looking up vendor names.
- What makes GPU planning partly an economic problem? Give one concrete example.
- How can sustainability affect infrastructure design choices? State the trade-off in one line.
- What is the difference between deliberate and accidental lock-in? Mention one failure mode too.
Interview Q&A¶
Q. Should every AI platform be multi-cloud from day one? A. No. Start where business risk justifies portability, and avoid fake complexity elsewhere. Common wrong answer to avoid: Yes, otherwise you are irresponsible. Better direction: Tie the answer to concrete business risk and team capacity.
Q. Why does vendor lock-in happen? A. It happens when teams depend deeply on proprietary workflows, networking, data services, or hardware contracts. Common wrong answer to avoid: It only happens if the model file is proprietary. Better direction: Show that lock-in surrounds the model, not just the model.
Q. How do GPU shortages change design? A. They push teams to plan quotas, reservations, fallback hardware, and provider alternatives earlier. Common wrong answer to avoid: They only change procurement, not architecture. Better direction: Explain the effect on timelines and topology choices.
Q. How should sustainability enter the discussion? A. Use it as a measurable design input beside cost, latency, and reliability. Common wrong answer to avoid: Mention it in the introduction slide and move on. Better direction: Show where it can change scheduling, regions, or model size.
Apply now (5 min)¶
- Choose one vendor feature your plan depends on.
- Write the speed benefit it gives you.
- Write the exit cost it may create.
- Choose one scarce hardware dependency.
- Write one fallback if supply tightens.
- Choose one efficiency metric beyond total cost.
- Write where portability truly matters in your design.
- Write one trade-off you will state openly in interviews.
Bridge. Infrastructure built. Now let us orchestrate containers on it. → ../08_kubernetes_gpu_platforms/00-eli5.md