01. Opening failure — when the giant model cannot enter the room¶

~12 min read. The thing that kills a launch before the first token.

Built on the ELI5 in 00-eli5.md. The site constraint — the GPU memory limit that makes the blueprint too heavy to carry — decides whether the system can even stand on the production floor.

The problem is arithmetic, not optimism¶

Look. A 70B model at fp16 uses two bytes per parameter. So the quick math is brutal. 70B × 2 bytes = 140GB. Your GPU has 80GB. The model does not fit. No clever prompt changes this. No dashboard can hide it. This is not a toy classroom problem. This is the production gate. If the weights do not fit, serving does not start. Simple, no? The first lesson is mental. A model is not only an intelligence object. It is also a memory object. It is a latency object. It is a cost object. It is even a scheduling object. Once the site constraint bites, every downstream choice changes. Think like a builder carrying stone slabs. The blueprint may be accurate. The floor still has weight limits. That is why quantization matters later. But first, face the wall. An 80GB GPU cannot carry 140GB of weights. See.

One table changes executive conversations¶

Engineers who can do this math become trusted quickly. Because they turn vague ambition into deployable plans. Start with the raw weight table. | Model size | fp32 (4B) | fp16/bf16 (2B) | int8 (1B) | int4 (0.5B) | |---|---:|---:|---:|---:| | 7B | 28 GB | 14 GB | 7 GB | 3.5 GB | | 13B | 52 GB | 26 GB | 13 GB | 6.5 GB | | 70B | 280 GB | 140 GB | 70 GB | 35 GB | Now pause. A 7B model in fp16 fits on many single GPUs. A 13B model in fp16 may fit, but leaves less headroom. A 70B model in fp16 misses the box completely. Under the site constraint, size class matters more than hype. One table can save weeks of wrong planning. Here is the same picture as a mental ladder.

memory needed ──→
7B  fp16  ────────────── 14 GB
13B fp16  ────────────────────────── 26 GB
70B fp16  ───────────────────────────────────────────────────────────── 140 GB
A100/H100 box ─────────────────────── 80 GB

So what do we see? The 70B checkpoint is not a little too large. It is wildly too large. It overshoots by 60GB even before extra runtime memory. That gap is strategy, not tuning.

Raw weights are only the opening bill¶

Many people stop at weight memory. Production cannot stop there. Serving also needs room for activations. It needs room for CUDA kernels and buffers. It needs room for the KV cache. It needs room for fragmentation and safety margin. So the real memory line is higher than the checkpoint line. This matters especially for long context and bigger batches. More users mean more live cache. More tokens mean larger cache pages. More headroom means fewer nasty surprises. See the shape.

total serving memory
┌──────────────────────────────────────────────┐
│ weights │ kv cache │ runtime buffers │ slack │
└──────────────────────────────────────────────┘

So if raw weights already exceed the card, the game is over. And if raw weights barely fit, the game is still risky. That is why the site constraint is not an ops footnote. It is part of model selection itself.

One worked example makes the pain concrete¶

Assume one 80GB GPU. Assume fp16 weights. Assume a target of serving the full 70B model. Step one is raw weights. 70 × 2 = 140. So raw weights need 140GB. Step two is the gap. 140GB - 80GB = 60GB. You are already short by 60GB. Step three is honesty. Say the live batch and context need another 12GB of KV cache. Now the shortfall becomes roughly 72GB. If you want healthy margin, the gap grows again. So what to do? Do not say, "Maybe it squeezes." Say, "The current design is impossible on one card." That sentence is valuable. It protects time, budget, and credibility. It also tells you which options are still open. Yes?

Five broad options exist, and none are free¶

When the model does not fit, teams usually have five levers. Option one is more GPUs. You shard the model across cards. That preserves model size. But it adds networking and orchestration pain. Latency can rise. Cost definitely rises. Failure domains multiply. Option two is a smaller model. Memory drops. Latency drops. Cost drops. But quality may drop too. Option three is quantization. You keep roughly the same model structure. You store weights in fewer bits. Memory drops sharply. Sometimes latency improves. Accuracy can bend if you quantize badly. Option four is distillation. A smaller student learns from the bigger teacher. Serving becomes lighter. But distillation takes work, data, and evaluation. Option five is changing the requirement itself. Maybe 70B is unnecessary for this task. Maybe hard requests can route elsewhere. Maybe offline generation is enough. Look at the choice map.

Need 70B quality?
        │
        ├── yes ──→ more GPUs or quantize carefully
        │
        ├── maybe ─→ distill or route to smaller model
        │
        └── no ───→ shrink model or change product requirement

A senior engineer does not worship one lever. A senior engineer compares consequences.

The engineer who does this math is valuable¶

Many people speak about models as if they are only benchmarks. Production does not permit that luxury. A model is memory. A model is throughput. A model is p95 latency. A model is GPU reservation policy. A model is unit economics. If you miss the first arithmetic step, all later planning becomes theatre. So speak clearly. "The blueprint is strong, but the site constraint rejects it." That is a real engineering statement. Not a vibes statement. The moment you can turn parameter count into deployment consequences, people listen. Because now you can answer practical questions. How many cards? What datatype? What concurrency? What cost per request? What compromise is acceptable? That is why this opening failure matters. We are not compressing for fun. We are compressing because the room has limits. See.

Where this lives in the wild¶

Meta Llama 3 70B — self-hosting teams must choose tensor parallelism, quantization, or a smaller variant before launch.
Databricks DBRX Instruct — deployment planning starts with checkpoint memory, not just benchmark charts.
GitHub Copilot code completion — interactive latency budgets often push routing toward smaller or optimized models.
Perplexity answer engine — model routing balances quality against GPU memory and serving cost.
Mistral Large API — many buyers pick managed serving because owning the memory footprint is operationally heavy.

Pause and recall¶

Why is 70B × 2 bytes = 140GB the first production question?
Why is raw weight memory not the whole serving story?
What are the five broad levers when the model misses the GPU box?
Why does a model become a latency and cost object, not only a quality object?

Interview Q&A¶

Q1. Why quantize a 70B model instead of simply buying one more giant GPU? A1. Because more GPUs increase coordination cost, latency, and failure complexity, while quantization may solve the memory gap inside the existing box. Common wrong answer to avoid: "Extra GPUs are always cheaper than engineering work."

Q2. Why choose a smaller model instead of sharding the 70B model across many cards? A2. Because the product may not need 70B quality, and a smaller model can dramatically simplify latency, reliability, and cost. Common wrong answer to avoid: "Bigger models are always worth the serving pain."

Q3. Why do memory math before benchmarking throughput? A3. Because a model that does not fit cannot be benchmarked honestly for single-card serving in the first place. Common wrong answer to avoid: "We can optimize memory later if quality looks good."

Q4. Why change the requirement instead of forcing the original model choice? A4. Because product constraints, latency targets, and unit economics may make the original requirement irrational. Common wrong answer to avoid: "Requirements are fixed, so infrastructure must absorb everything."

Apply now (5 min)¶

Quick exercise. Compute raw weight memory for 8B, 34B, and 72B models in fp16 and int4. Then mark which ones fit on one 24GB, one 48GB, and one 80GB GPU. Sketch from memory the simple ladder: model size → bytes per parameter → total memory → does it fit? Under it, write one sentence on how the site constraint changes architecture choices.

Bridge. Good. We now know the room is too small. Before compressing anything, we must understand what the numbers themselves look like inside. → 02-number-formats.md