Skip to content

09. Boundaries and tradeoffs — what's physics, what's a vendor convention, and what it cost to forget the difference

~21 min read. The stack fit together too cleanly. Roofline → fusion → NCCL → TensorRT-LLM → Triton → NIM → NeMo → MIG, each file relieving the wall the last one exposed, all of it one vendor's answer. A thoughtful engineer should be uneasy about that tidiness. This file pulls the stack apart along the seam that matters: which mechanisms are physics that hold on any accelerator, which are NVIDIA conventions you could have done differently, and what the convenience of not asking quietly cost.

The capstone for the feed the beast invariant. Every prior file relieved one wall; this one asks which walls are real (memory bandwidth, interconnect, idle-card economics) and which are contingent (CUDA, the specific tools, the OpenAI-compatible API). It revisits buy-vs-build, vendor lock-in, and the idle-GPU cost pressure as open questions, not solved ones, and is more contemplative than the rest — because the honest answers are "it depends" and "we don't fully know yet."


What we know so far and what we deliberately didn't ask

The module built a working stack and a working number. The 70B endpoint went from 280 tokens/sec at 38% utilization to its target, and we can trace every gain to a wall we moved: the roofline named the memory wall, fusion deleted HBM round-trips, NCCL kept the all-reduce on NVLink, TensorRT-LLM compiled and batched, Triton served and packed, NIM packaged, NeMo trained, MIG and the scheduler stopped whole cards from idling. Each file solved a real pressure and exposed the next.

What we didn't ask, on purpose, is whether the answers were forced or chosen. We took TensorRT-LLM as the compiler, Triton as the server, NIM as the package, CUDA as the substrate — and the stack fit so well that the seams disappeared. That seamlessness is itself a thing to interrogate. Some of what we learned is physics: a memory-bound decode is memory-bound on any accelerator, an idle card bills the same whether it's NVIDIA or AMD, a model too big for one chip must be split across chips no matter whose chips. Some of it is convention: that the compiler is TensorRT-LLM, that the server speaks Triton's config, that the easy path is NVIDIA's — those were choices that felt like defaults.

This file teaches you to hold the two apart, because the difference decides what you can change, what you can leave, and what the convenience cost. It is the contested-and-unsettled file: open problems, things that work without theory, debates with two real sides, and an honest lock-in accounting.


What this file solves

A team built the whole stack, hit their number, and now faces a board question: "how locked in are we, and what would leaving cost?" The naive read is "we picked the best tools, lock-in is the price of performance." The real issue is that they never separated the physics they were obeying from the conventions they were adopting, so they can't tell which parts of the stack are irreplaceable and which are merely convenient. This file teaches you to draw that line — to name what's a hardware law versus a vendor choice, what's settled versus contested, and what the lock-in actually costs — so the next architecture decision is made with open eyes instead of momentum.


Why interrogate a stack that works

The instinct after a successful build is to stop — it works, ship it. But a stack that works is exactly when the seams are invisible and the lock-in is cheapest to ignore and most expensive to discover later (file 06's lesson, now fleet-wide). The pressure isn't a failure; it's a decision debt. Every "we'll use the NVIDIA tool" was a small bet, and the bets compounded into a dependency nobody priced.

The way to interrogate it is to sort every mechanism into two buckets. Physics: would this be true on a TPU, an AMD MI300, an Inferentia chip, or a hypothetical future accelerator? Memory bandwidth limits, interconnect topology, the fit-or-split decision, the idle-card economics — yes, these recur on any hardware. Convention: is this specific because NVIDIA built it this way? CUDA as the kernel language, TensorRT-LLM as the compiler, the OpenAI-compatible API surface, the NeMo checkpoint format — these could have been otherwise, and on other hardware they are. The physics you must obey; the conventions you chose, and could re-choose at a cost.

Teacher voice. The test for "physics or convention" is simple: imagine the hardware vendor changed tomorrow. What survives the change is physics — your roofline reasoning, your batching instinct, your interconnect-topology awareness, your utilization economics. What you'd have to relearn or rebuild is convention — the CUDA kernels, the engine format, the container catalog. Skills transfer; tooling locks in. The module taught both; this file separates them so you know which is which.


The number that came with an asterisk

A team reports their win: the 70B endpoint hits its throughput target, matching NVIDIA's published benchmark. The board is happy. Then a competitor on different hardware reports a higher number on a similar model, and a careful engineer asks what's actually being compared.

The asterisk unfolds. The published throughput assumed FP8 — which the team validated, but FP8 shifts outputs, and "tokens/sec at FP8" and "tokens/sec at BF16" aren't the same product. It assumed a specific batch size and sequence-length mix that matched the benchmark, not necessarily their traffic. It assumed the optimized engine engaged (file 06's profile-fallback trap). And the competitor's higher number assumed its favorable conditions. Two impressive numbers, neither directly comparable, both true under their own asterisks.

The visible break: the headline metric the whole module climbed toward is not a single comparable quantity — it's a number conditioned on dtype, batch, sequence mix, and hardware, and comparing across vendors without normalizing those is comparing nothing.

So the real problem is not which vendor is faster; it is that throughput is not a scalar — it's a surface over dtype, batch, sequence length, and quality tolerance, and any single number hides the conditions. The fix isn't a better benchmark. It's the discipline to ask "under what conditions, at what quality, on what traffic?" before believing any tokens/sec figure, including your own.

So how do we reason about a stack honestly when its central metric comes with hidden conditions and its tools come with hidden lock-in?


When the same model gives two true, incomparable numbers

Take the smallest case. The same 70B serves at "3000 tokens/sec" on FP8 with a benchmark-shaped batch, and "1400 tokens/sec" on BF16 with your actual bursty traffic. Both are measured, both are true, neither is wrong — and they describe the same model on the same GPU. The 2x gap is entirely conditions: quantization (which trades a little quality for speed) and traffic shape (which sets the achievable batch). A vendor quotes the first; your production sees the second; a competitor quotes their own first. Three numbers, no contradiction, no comparison possible without naming the conditions.

That's the whole epistemics of this layer in miniature: every performance claim is conditioned, and the condition is usually omitted.


Rule: separate the physics you obey from the conventions you chose

The boundary rule. Sort every mechanism into physics (true on any accelerator — memory-bandwidth limits, interconnect topology, the fit-or-split decision, idle-card economics) or convention (NVIDIA's specific answer — CUDA, TensorRT-LLM, Triton's config, the NeMo format, the NIM catalog). Obey the physics; price the convention. Treat every throughput number as a surface over dtype, batch, sequence mix, and quality tolerance, never a scalar. And treat lock-in as a real, quantifiable cost — the engineering to leave plus the foregone portability — not a free side effect of picking good tools.

Why the rule exists. The primitive is that hardware laws and vendor choices feel identical inside a stack that works — both present as "how it's done." The constraint is that they have opposite properties under change: physics survives a hardware switch and transfers as skill; convention must be rebuilt and is the lock-in. Conflating them breaks two ways — you either over-credit the vendor (thinking CUDA is why decode is fast, when batching is) or under-price the lock-in (thinking the tools were free, when leaving costs a rebuild). Separating them relieves both errors; the cost it imposes is intellectual honesty about a stack you're invested in, which is uncomfortable precisely when it matters most.


1) The physics-vs-convention split, drawn

The core mental model of this file is the stack sorted along the seam that survives a hardware change:

        PHYSICS (true on any accelerator)        CONVENTION (NVIDIA's specific answer)
        ─────────────────────────────────        ────────────────────────────────────
roofline  memory-bound vs compute-bound      │   CUDA as the kernel language
          arithmetic intensity decides        │   cuDNN / cuBLAS as the kernel libs
fusion    round-trips to memory cost          │   TensorRT-LLM as THE compiler
          amortize fixed costs                 │   the trtllm-build engine format
nccl      interconnect topology decides        │   NCCL as the collective lib
          collective bandwidth                 │   NVLink/NVSwitch as the fast wire
batching  share weight-read across tokens      │   in-flight batching's TRT-LLM impl
          decode throughput ∝ avg batch        │   paged KV's specific API
serving   don't idle GPUs between requests     │   Triton config.pbtxt, ensembles
          pack the hardware                     │   the OpenAI-compatible surface
package   buy-vs-build is a real decision      │   NIM catalog, AI Enterprise license
train     model too big → split across chips   │   NeMo / Megatron-Core conventions
fleet     idle card bills 24/7; raise util     │   MIG profiles, GPU operator, DRA
        ───────────────────────────────────        ───────────────────────────────────
        survives a hardware switch              must be rebuilt → this is the lock-in
        transfers as portable skill             priced as engineering + foregone options

The left column is what you take to any accelerator — the reasoning the module actually taught. The right column is what you'd rebuild if you left NVIDIA — the lock-in the module quietly accumulated. A 70B is memory-bound on a TPU too (physics); but you'd rewrite the kernels, re-compile in a different toolchain, and re-learn a different collective library (convention). The line between the columns is the answer to "how locked in are we."


2) The picture — the lock-in ladder, rung by rung

Lock-in isn't one decision; it's a ladder, each rung harder to climb back down:

LOCK-IN LADDER (cheapest to leave at the bottom, hardest at the top)
─────────────────────────────────────────────────────────────────
  NIM container        ████████████  hardest: catalog + API + license + atrophied build skill
  NeMo training        ██████████    checkpoint format, parallelism conventions
  TensorRT-LLM engine  ████████      engine format, FP8 path, build pipeline
  Triton serving       ██████        config, ensembles, backends (some portable)
  CUDA kernels         ████████████  deepest: everything sits on it; rewriting is enormous
  NVLink/hardware      ██████        physical; switching means new silicon
─────────────────────────────────────────────────────────────────
 each rung you adopt for convenience is a rung you must climb to leave
 vLLM / OpenAI-compatible APIs LOWER some rungs (portable serving surface)

Each layer is a convenience you took and a dependency you owe. CUDA is the deepest rung — the entire stack sits on it, and AMD's ROCm or a TPU's XLA is a different world you'd port to. NIM is the highest-friction rung at the top because it bundles all the others plus a license and erodes your build skill (file 06). The mitigations are real and worth naming: an OpenAI-compatible API surface makes the serving interface portable (swap NIM for vLLM behind the same API), vLLM as an engine lowers the compiler rung, and keeping in-house build skill (files 04-05, 07) keeps the upper rungs climbable. Lock-in is not binary; it's a ladder you can choose how far to climb.


3) The 70B endpoint, re-examined honestly — the running example one last time

Our endpoint hit its target. Now sort its stack along the seam. The physics it obeys: decode is memory-bound (so we batched), the all-reduce wants NVLink (so we placed it), the 70B doesn't fit one card (so we split it), the small models waste whole cards (so we MIG-packed them), idle cards bill 24/7 (so we drove utilization). Every one of those holds on any accelerator — that's the transferable lesson.

The conventions it chose: TensorRT-LLM compiled it, Triton served it, FP8 hit the headline number, NIM could have packaged it, NeMo could have trained the custom version, MIG profiles partitioned the small models. Swap to AMD or a TPU and the physics reasoning is identical — you'd still batch, still place collectives on the fast wire, still split the 70B, still pack the fleet — but you'd rebuild every convention: different kernels (ROCm/XLA), a different compiler, a different collective library, a different partitioning scheme.

So the honest answer to the board's "how locked in are we": deeply on the tooling, not at all on the reasoning. Leaving NVIDIA means rebuilding the right column of section 1 — months of engineering — while the left column (the skill that got us here) transfers for free. The throughput number we're proud of comes with the FP8-and-traffic-shape asterisk from the opening. Both of those — the lock-in cost and the conditioned number — are things the clean build hid, and naming them is the point of this file.

Mini-FAQ. "If the reasoning transfers, is the lock-in even a problem?" It depends on what you optimize. If raw performance-per-dollar on today's hardware is everything and NVIDIA leads, the lock-in is a price worth paying. If multi-vendor leverage, supply resilience, or future portability matter, the lock-in is a strategic cost you under-priced. There's no universal answer — which is exactly why this is a boundary file and not a solved one.


4) Why interrogate lock-in now and not just optimize harder?

The plausible alternative is to keep optimizing — there's always more throughput to chase, another FP8 trick, another batching gain. Why stop and audit the dependency instead?

Because under our workload — a six-figure GPU bill, a board asking strategic questions, and a stack whose seams are invisible — the marginal throughput is no longer the binding constraint; the unexamined dependency is. The module's whole climb relieved performance pressure; what remains is the decision debt that performance work accumulated. Auditing it now, while the build is fresh and the alternatives (vLLM, ROCm, TPUs, OpenAI-compatible portability) are real, costs a few days and buys strategic optionality. Auditing it after the footprint triples and the build skill has atrophied (file 06's failure, fleet-wide) costs a painful migration or a permanent dependency. The honest counter-case: if you're certain NVIDIA's performance lead and ecosystem will persist for your horizon, the optimization is the better use of time and the lock-in is a deliberate, defensible bet. The point isn't that lock-in is bad — it's that it should be a chosen bet, not an accumulated accident.

Why this instead of more optimization, under our workload? We've relieved the performance walls; the remaining risk is the dependency we never priced. A few days of honest accounting now — while alternatives are live and skills are fresh — buys optionality that a later migration would cost dearly. Optimization has diminishing returns here; the audit has a one-time strategic payoff.


5) The property that decides how much lock-in to accept: how fixed your hardware future is

The one dimension that decides how much convention-lock-in is acceptable is how committed you are to NVIDIA hardware for your planning horizon. The more committed (and the more NVIDIA leads for your workload), the more the deep tooling integration pays; the more uncertain, the more portability is worth its cost.

Your situation How fixed is the hardware future? Lock-in stance
All-in on NVIDIA, performance-per-dollar is everything committed embrace deep integration (TensorRT-LLM, NIM); price lock-in as accepted
Want multi-vendor leverage or supply resilience uncertain favor portable surfaces (vLLM, OpenAI-compatible API); keep build skill
Regulated / sovereignty / must run on non-NVIDIA somewhere partly forced off build on portable stacks; treat CUDA-deep tools as optional
Small team, ship-now, NVIDIA cloud anyway committed by default take NIM; revisit when footprint grows
Research / frontier, hardware churns uncertain keep portable, expect to re-platform

The asymmetry to remember: the physics reasoning is portable at zero cost regardless of your stance — so the only thing your hardware-future commitment changes is how much convention lock-in to accept. A committed shop rationally goes deep; an uncertain one rationally pays for portability. Both are right under their own conditions, which is why the throughput-is-a-surface lesson and the lock-in-is-a-ladder lesson both resolve to "it depends on what you're optimizing for" — the contemplative honesty this file exists to deliver.


6) The failure walked through: the unexamined dependency that became a migration

A team goes all-in on the NVIDIA stack — TensorRT-LLM, NIM for the stock models, NeMo for the custom one, MIG-packed fleet. It works, the number is good, nobody questions it. Two years later, a strategic shift demands running a portion of the workload on a different accelerator (a cloud deal, a supply constraint, a sovereignty requirement). Now the bill comes due: the kernels are CUDA, the engines are trtllm-format, the training checkpoints are NeMo-format, the serving is Triton-config, and the team that could have kept a portable path alive instead specialized entirely into the NVIDIA conventions. The "migration" is a near-rebuild.

Trace it. Every convenience was locally correct — each NVIDIA tool was the best choice for that decision in isolation. But nobody priced the accumulation: each adopted convention was a rung climbed, and the rungs compounded into a ladder too tall to descend cheaply, while the physics skill that would have transferred sat unused because the team never built on anything portable. The fix that should have happened years earlier: keep the serving interface OpenAI-compatible (portable surface), keep a vLLM path exercised for at least some models, keep enough non-CUDA-deep build skill alive, and write down the lock-in cost at each adoption so the ladder's height was always visible. The lesson: lock-in isn't created by any one decision; it accumulates across many locally-correct ones, and the only defense is pricing it continuously while the alternatives are still live.


7) Cost movement: what the honest audit buys and what it costs

  • What it fixes: turns invisible decision debt into a priced, visible dependency — you now know which parts of the stack are physics (free to keep, transferable) and which are convention (the lock-in, with a quantified exit cost), so the next architecture bet is deliberate.
  • What it costs: intellectual discomfort (admitting the clean number has an asterisk and the clean stack has a bill), and real engineering if you choose to keep portable paths alive (exercising vLLM, keeping non-CUDA build skill, maintaining an OpenAI-compatible boundary) — portability isn't free; it's a hedge you pay premiums on.
  • Which subsystem pays: the architecture/leadership layer absorbs the strategic decision and the portability premium; the engineering team pays the cost of keeping alternatives exercised. The reward lands as optionality (you can leave, or credibly threaten to, which itself improves vendor terms) and as honest metrics (numbers reported with conditions). The pressure this creates is ongoing: portability hedges decay if not exercised, so the audit is a recurring discipline, not a one-time report.

For the running example: auditing the 70B stack costs a few days and the ongoing premium of keeping a vLLM/OpenAI-compatible path warm, and buys the ability to answer the board honestly and re-platform if strategy demands — a trade that's clearly worth it when the future is uncertain and clearly optional when you're committed to NVIDIA anyway.


8) Signals: healthy, first to degrade, and the liar

  • Healthy: every reported throughput number carries its conditions (dtype, batch, sequence mix, quality tolerance); the team can name which stack layers are physics vs convention; a portable path (vLLM / OpenAI-compatible) is exercised, not just theoretical; lock-in cost is written down and revisited.
  • First metric to degrade: the portability hedge goes stale — the vLLM path stops being tested, the OpenAI-compatible boundary leaks NVIDIA-specifics, build skill atrophies — so the option to leave quietly expires before anyone notices. The decay is silent because nothing breaks; the optionality just evaporates.
  • The misleading metric: the headline tokens/sec reported as a scalar — impressive, comparable-looking, and stripped of the dtype/batch/traffic conditions that make it meaningful. A bare number is the liar that lets you believe you're winning (or losing) a comparison that was never valid.
  • The graph an expert opens first: the throughput surface — tokens/sec across dtype (FP8 vs BF16) and across your real traffic's batch/sequence distribution, not the benchmark's — alongside a written lock-in ladder with an exit cost per rung. The surface kills false comparisons; the ladder keeps the dependency honest.

9) Boundary: where this honesty helps and where it's paralysis

The physics-vs-convention discipline helps most when you face a real strategic choice — a hardware decision, a vendor negotiation, a board question, a re-platform — where knowing what's portable and what's locked-in changes the answer. There it converts momentum into a deliberate bet.

It becomes pathology as analysis-paralysis: refusing to commit to any tool because everything is "lock-in," re-deriving portability premiums for decisions that don't matter, or treating a committed-NVIDIA shop's deep integration as a mistake when it's a rational bet. Not all lock-in is bad — depth of integration is often exactly how you get NVIDIA's performance lead, and paying for portability you'll never use is its own waste. The scale limit that invalidates naive intuition: "always stay portable" is as wrong as "never question the stack" — the right amount of lock-in is a function of how fixed your hardware future is (section 5), and a contemplative file's job is to give you the axis, not a fixed answer on it.


10) Wrong model: "we picked the best tools, so lock-in is just the price of performance"

The seductive wrong idea is that lock-in is an unavoidable, unmanageable byproduct of choosing good tools — you wanted performance, NVIDIA has it, the dependency is the bill, end of story. That conflates physics with convention and treats an accumulated, manageable ladder as a single forced cost.

Replace it with: most of what got you the performance is physics (batching, interconnect placement, the fit-or-split decision, utilization economics) and transfers for free; the lock-in is the convention layer (CUDA, the specific tools, the format and catalog), it's a ladder you climbed rung by rung, and it's priceable and partly mitigable (portable surfaces, exercised alternatives, retained build skill). Lock-in isn't the price of performance — it's a separate, chosen bet that you can size deliberately. Believing it's forced is how you end up two years deep with no path down (section 6).


11) Other failure shapes to recognize

  • Scalar-number trap. Reporting or comparing tokens/sec without dtype, batch, and traffic conditions. Fix: report throughput as a surface; normalize conditions before any comparison.
  • Accumulated lock-in. Many locally-correct NVIDIA choices compounding into an undescendable ladder (section 6). Fix: price lock-in at each adoption; keep a portable path exercised.
  • Stale hedge. A vLLM/portable path that exists on paper but is never run, so the option silently expires. Fix: exercise the alternative on real traffic periodically.
  • Physics-convention conflation. Crediting CUDA for speed that's really batching, or treating transferable reasoning as locked-in. Fix: run the "if hardware changed tomorrow" test.
  • FP8 quality blind spot. Quoting FP8 throughput without validating the quality it traded away. Fix: report throughput at a stated quality bar, not in isolation.
  • Analysis paralysis. Refusing to commit to any tool because everything is "lock-in." Fix: size lock-in to your hardware-future commitment (section 5); deep integration is a valid bet.

12) Pattern transfer

  • Same buy-vs-build seam as files 06 and 07, generalized. NIM and NeMo each posed "take the vendor's packaged answer or keep control"; this file lifts that to the whole stack and the whole company's hardware future. The "convenience now, dependency later" shape recurs at every scale from a kernel library to a vendor relationship.
  • Same physics-vs-implementation split as any systems abstraction. "Memory-bandwidth limits are real; CUDA is one way to obey them" is the same shape as "the CAP theorem is real; this database is one way to live with it." Separating the law from the tool is a recurring move across all of systems.
  • Same conditioned-metric honesty as every benchmark debate. "Tokens/sec is a surface, not a scalar" is the GPU version of "QPS depends on the query," "latency depends on the percentile," "throughput depends on the batch." Refusing the bare number recurs wherever performance is sold.

13) Design test — five questions before you trust the stack or the number

  1. For each layer of your stack, can you say whether it's physics (survives a hardware switch) or convention (you'd rebuild it)?
  2. Does every throughput number you report or believe come with its dtype, batch, sequence mix, and quality bar?
  3. Can you state your lock-in ladder and an exit cost per rung — and is any portable path actually exercised, not just theoretical?
  4. Is your acceptance of lock-in matched to how fixed your hardware future genuinely is, or is it accumulated momentum?
  5. If your hardware vendor changed tomorrow, what transfers for free (the reasoning) and what would you rebuild (the tooling) — and have you written that down?

Where this appears in production

Physics that holds on any accelerator

  • Memory-bandwidth limits (the roofline) — decode is memory-bound on NVIDIA, AMD MI300, TPU, and Inferentia alike; the batching instinct transfers everywhere.
  • Interconnect topology decides collective cost — NVLink, AMD's Infinity Fabric, TPU's ICI; the "keep the collective on the fast wire" reasoning is universal.
  • Fit-or-split for large models — a model too big for one chip must be model-parallel on any hardware; tensor/pipeline parallelism is a hardware-independent idea.
  • Idle-card economics — a reserved accelerator bills 24/7 on every cloud and vendor; utilization-over-allocation is universal FinOps.
  • Throughput as a surface — every accelerator's tokens/sec depends on dtype, batch, and sequence mix; the conditioned-metric discipline is vendor-agnostic.

Conventions that are NVIDIA's specific answer (the lock-in)

  • CUDA / cuDNN / cuBLAS — the deepest rung; AMD ROCm and TPU XLA are the alternative worlds you'd port to.
  • TensorRT-LLM engine format — the compiler convention; vLLM and SGLang are more portable engines that lower this rung.
  • Triton config / NIM catalog / NVIDIA AI Enterprise license — serving and packaging conventions; the OpenAI-compatible API surface is the portability mitigation.
  • NeMo / Megatron-Core checkpoint and parallelism conventions — training-side lock-in; HuggingFace + DeepSpeed/FSDP is the portable counterpart.
  • MIG profiles and the GPU operator — partitioning convention; Kubernetes DRA is evolving a more vendor-neutral model.

Where the honesty matters

  • Cloud GPU contract negotiations — a credible portable path improves vendor terms; pure lock-in weakens your position.
  • AMD MI300 / Google TPU / AWS Inferentia adoption — teams that kept physics reasoning and portable surfaces re-platform far cheaper than those who specialized into CUDA.
  • vLLM as a portability hedge — widely run alongside or instead of TensorRT-LLM precisely to keep the compiler rung low.
  • MLPerf Inference benchmarks — the public attempt to normalize the throughput surface across vendors so numbers are comparable at a stated accuracy.
  • Sovereignty / regulated deployments — force at least partial non-NVIDIA hardware, making the lock-in cost concrete and unavoidable.

Pause and recall

  1. Give two examples of stack "physics" and two of stack "convention."
  2. Why do physics and convention feel identical inside a working stack, and what test separates them?
  3. Why is tokens/sec a surface, not a scalar — and what conditions does it depend on?
  4. What is the "lock-in ladder," and which rung is deepest? Which is highest-friction?
  5. How does an OpenAI-compatible API surface lower a lock-in rung?
  6. Why does accumulated lock-in become a near-rebuild even when every choice was locally correct?
  7. What decides how much convention lock-in you should accept?
  8. Why is "always stay portable" as wrong as "never question the stack"?

Interview Q&A

Q1. Your board asks "how locked into NVIDIA are we, and what would leaving cost?" How do you answer? A. Separate physics from convention. The reasoning that got us our performance — batching for memory-bound decode, placing collectives on the fast interconnect, splitting the too-big model, packing the fleet — is physics that transfers to any accelerator for free. The lock-in is the convention layer: CUDA kernels, the TensorRT-LLM engine format, Triton/NIM serving, NeMo checkpoints. Leaving means rebuilding that layer — months of engineering — while the skill transfers. So: deeply locked on tooling, not at all on reasoning, with a quantifiable exit cost per stack layer. Common wrong answer to avoid: "We picked the best tools, lock-in is just the price of performance." That conflates the physics that transfers with the convention that locks in, and treats a manageable ladder as a forced cost.

Q2. A vendor quotes 3000 tokens/sec; your production sees 1400 on the same model and GPU. Is someone lying? A. No — both can be true under different conditions. The 3000 likely assumes FP8 and a benchmark-shaped batch/sequence mix; your 1400 is BF16 (or your real quality bar) under bursty production traffic with a smaller achievable batch. Throughput is a surface over dtype, batch, sequence mix, and quality, not a scalar, so the numbers describe different points on the surface. Normalize the conditions before comparing. Common wrong answer to avoid: "The vendor's benchmark is dishonest." It's conditioned, not dishonest; the error is comparing two points on a surface as if they were one scalar.

Q3. Which parts of everything this module taught would survive switching to AMD or a TPU, and which wouldn't? A. Survives (physics): the roofline reasoning, the batching instinct, interconnect-topology awareness, the fit-or-split decision for large models, and idle-card utilization economics — these hold on any accelerator. Doesn't survive (convention): CUDA/cuDNN kernels, the TensorRT-LLM engine format, NCCL specifics, Triton config, NIM, and NeMo formats — you'd rebuild these on ROCm/XLA with different tools. The reasoning is portable; the tooling is the lock-in. Common wrong answer to avoid: "Everything's NVIDIA-specific, you'd start over." The reasoning — most of the actual value — transfers; only the tooling layer is rebuilt.

Q4. Every individual decision to use an NVIDIA tool was correct, yet the team ended up unable to leave. How? A. Lock-in accumulates across locally-correct choices. Each adopted convention — CUDA kernels, trtllm engines, NeMo checkpoints, Triton config — was the best choice in isolation, but each was a rung climbed, and the rungs compounded into a ladder too tall to descend, especially as portable build skill atrophied from disuse. The defense is to price lock-in at each adoption and keep a portable path (vLLM, OpenAI-compatible boundary) exercised, so the ladder's height stays visible and climbable. Common wrong answer to avoid: "Then the individual decisions were wrong." They weren't — the failure is not pricing the accumulation, not any single choice.

Q5. When is going deep on NVIDIA-specific tooling the right call rather than a lock-in mistake? A. When your hardware future is genuinely committed to NVIDIA and NVIDIA leads on performance-per-dollar for your workload — then deep integration (TensorRT-LLM, NIM, NeMo) is exactly how you capture the lead, and paying for portability you'll never use is its own waste. The amount of lock-in to accept is a function of how fixed your hardware future is; for a committed shop, depth is a rational bet, not a mistake. Common wrong answer to avoid: "Lock-in is always bad, stay portable." Portability has a premium; paying it when you'll never switch is wasted, and depth is often how you get the performance.

Q6. (Cumulative.) Pick any one mechanism from the module and show it as both physics and convention. A. In-flight batching: the physics is that memory-bound decode throughput scales with the average number of tokens sharing each weight-read, so you keep the batch full — true on any accelerator. The convention is TensorRT-LLM's specific in-flight-batching implementation and paged-KV API, which you'd rebuild on another stack (vLLM has its own, a TPU serving stack another). Same idea, portable; same code, locked-in. That split is the whole file: obey the batching physics anywhere, and know the specific implementation is the rung you'd re-climb. Common wrong answer to avoid: "Batching is just a TensorRT-LLM feature." The principle is hardware-independent physics; only the implementation is the convention.


Design/debug exercise (10 min)

Step 1 — Model it. Take the module's stack and sort five mechanisms into physics vs convention: (1) decode is memory-bound, (2) TensorRT-LLM compiles the engine, (3) the all-reduce wants the fast interconnect, (4) NIM's OpenAI-compatible container, (5) an idle card bills 24/7. Mark each P or C, and for each C, name what you'd rebuild on a different accelerator. This is the section-1 split applied by hand.

Step 2 — Your turn. For our 70B endpoint, write the lock-in ladder rung by rung (CUDA → TensorRT-LLM → Triton → NIM/NeMo → MIG) with a rough exit cost per rung, then mark which rungs an OpenAI-compatible API surface and a kept-warm vLLM path would lower. Decide, given a stated hardware-future commitment (pick committed or uncertain), how much of that ladder you'd accept and what portability premium you'd pay. Tie it to section 5's axis.

Step 3 — Reproduce from memory. Redraw the physics-vs-convention split from section 1 and the lock-in ladder from section 2. Then state in one sentence how this file connects to file 06 (the buy-vs-build dependency, now generalized to the whole stack and the hardware future) and to the module opening (the same "38% utilization / feed the beast" reasoning that opened the module is the physics that transfers — the tooling that achieved it is the convention that locks in).


Operational memory

This chapter explained why a stack that works is exactly when its seams are invisible and its lock-in is cheapest to ignore: hardware physics and vendor conventions feel identical inside a working build, both presenting as "how it's done." The important idea is that they have opposite properties under change — physics (memory-bandwidth limits, interconnect topology, fit-or-split, idle-card economics) survives a hardware switch and transfers as free skill, while convention (CUDA, TensorRT-LLM, Triton, NIM, NeMo, MIG profiles) must be rebuilt and is the lock-in.

You learned to sort the stack along that seam, to treat every throughput number as a surface over dtype, batch, sequence mix, and quality rather than a comparable scalar, and to read lock-in as a priceable ladder climbed rung by rung — mitigable with portable surfaces (OpenAI-compatible APIs, exercised vLLM paths, retained build skill) and sized to how fixed your hardware future is. That solves the opening board question because you can now answer it honestly: deeply locked on tooling, not on reasoning, with a quantified exit cost.

Carry this diagnostic forward: before trusting a number, ask "under what dtype, batch, traffic, and quality?"; before adopting a tool, ask "is this physics I obey or convention I'm choosing, and what's the exit cost?"; and revisit the portability hedge periodically, because options decay silently when not exercised. The amount of lock-in to accept is never universal — it's a deliberate bet matched to your hardware-future commitment, not an accident of momentum.

Remember:

  • Physics (bandwidth, interconnect, fit-or-split, idle-card cost) transfers to any accelerator for free; convention (CUDA, the specific tools) is the lock-in you'd rebuild.
  • Throughput is a surface over dtype, batch, sequence mix, and quality — never a comparable scalar; report the conditions.
  • Lock-in is a ladder climbed rung by rung, priceable per rung and partly mitigable by portable surfaces and exercised alternatives.
  • Lock-in accumulates across locally-correct choices; price it continuously while alternatives are live, or the ladder becomes undescendable.
  • The right amount of lock-in depends on how fixed your hardware future is — deep integration is a rational bet for a committed shop, a risk for an uncertain one.
  • Next pressure: the model is now fast, served, and cost-efficient — but to run it as a product you must track which version is live, reproduce any result, and roll back safely, which is a platform-operations problem, not a GPU one.

Bridge. We've closed every gap between the model's math and the hardware's peak, and then stepped back to price the dependency that closing them created. The model is fast, served, packed, and honestly accounted for. But "fast" is only one property a production model needs. Which exact model version is serving right now? Can you reproduce the result a customer complained about last Tuesday? Can you roll back a bad deploy in seconds, run A/B experiments, track lineage from training data to served checkpoint, and prove what was live when? None of that is a GPU problem — it's a platform-operations problem, and the speed this module bought is worthless if you can't track, version, and reproduce the model that's running. The next module builds that operational layer. → ../04_ml_platform_operations/00-eli5.md