Memory Math¶

Formula¶

Raw weight memory starts with:

weight_memory = parameter_count × bytes_per_parameter

Useful byte assumptions:

fp16 / bf16 = 2 bytes
int8 = 1 byte
int4 = 0.5 byte

Worked examples¶

Representative 7B model¶

Precision	Approx weight memory
fp16 / bf16	13.04 GB
int8	6.52 GB
int4	3.26 GB

Computation:

fp16: 7e9 × 2 / 1024^3 ≈ 13.04 GB
int8: 7e9 × 1 / 1024^3 ≈ 6.52 GB
int4: 7e9 × 0.5 / 1024^3 ≈ 3.26 GB

KV-cache caveat¶

Even if the weights fit after quantization, serving memory still grows with:

sequence length
batch size / concurrency
layer count
number of KV heads

So a quantized model can still fail under long prompts or many simultaneous requests.

Hardware assumption for the real hands_on_lab¶

Assume one 24 GB GPU for a practical local decision:

fp16 7B can be tight once runtime buffers and KV cache are included
int8 is much safer for serving
int4 may be needed if concurrency or context is large
LoRA is feasible if the base model already fits for training
QLoRA becomes attractive if only the compressed base fits

Smoke-path note¶

The runnable scripts in this folder use a tiny local model to validate the workflow. The formulas above keep the decision memo grounded in a realistic 7B-style deployment question.