Memory Math¶
Formula¶
Raw weight memory starts with:
weight_memory = parameter_count × bytes_per_parameter
Useful byte assumptions:
- fp16 / bf16 = 2 bytes
- int8 = 1 byte
- int4 = 0.5 byte
Worked examples¶
Representative 7B model¶
| Precision | Approx weight memory |
|---|---|
| fp16 / bf16 | 13.04 GB |
| int8 | 6.52 GB |
| int4 | 3.26 GB |
Computation:
- fp16:
7e9 × 2 / 1024^3 ≈ 13.04 GB - int8:
7e9 × 1 / 1024^3 ≈ 6.52 GB - int4:
7e9 × 0.5 / 1024^3 ≈ 3.26 GB
KV-cache caveat¶
Even if the weights fit after quantization, serving memory still grows with:
- sequence length
- batch size / concurrency
- layer count
- number of KV heads
So a quantized model can still fail under long prompts or many simultaneous requests.
Hardware assumption for the real hands_on_lab¶
Assume one 24 GB GPU for a practical local decision:
- fp16 7B can be tight once runtime buffers and KV cache are included
- int8 is much safer for serving
- int4 may be needed if concurrency or context is large
- LoRA is feasible if the base model already fits for training
- QLoRA becomes attractive if only the compressed base fits
Smoke-path note¶
The runnable scripts in this folder use a tiny local model to validate the workflow. The formulas above keep the decision memo grounded in a realistic 7B-style deployment question.