Skip to content

Memory Math

Formula

Raw weight memory starts with:

weight_memory = parameter_count × bytes_per_parameter

Useful byte assumptions:

  • fp16 / bf16 = 2 bytes
  • int8 = 1 byte
  • int4 = 0.5 byte

Worked examples

Representative 7B model

Precision Approx weight memory
fp16 / bf16 13.04 GB
int8 6.52 GB
int4 3.26 GB

Computation:

  • fp16: 7e9 × 2 / 1024^3 ≈ 13.04 GB
  • int8: 7e9 × 1 / 1024^3 ≈ 6.52 GB
  • int4: 7e9 × 0.5 / 1024^3 ≈ 3.26 GB

KV-cache caveat

Even if the weights fit after quantization, serving memory still grows with:

  • sequence length
  • batch size / concurrency
  • layer count
  • number of KV heads

So a quantized model can still fail under long prompts or many simultaneous requests.

Hardware assumption for the real hands_on_lab

Assume one 24 GB GPU for a practical local decision:

  • fp16 7B can be tight once runtime buffers and KV cache are included
  • int8 is much safer for serving
  • int4 may be needed if concurrency or context is large
  • LoRA is feasible if the base model already fits for training
  • QLoRA becomes attractive if only the compressed base fits

Smoke-path note

The runnable scripts in this folder use a tiny local model to validate the workflow. The formulas above keep the decision memo grounded in a realistic 7B-style deployment question.