Skip to content

04. Week 6 — Daily Recall

Spaced practice.

Answer from memory first.

If you stall, jump to the referenced section in 02_explainer.md.

Monday (after ELI5 + chapter 1)

  1. In the sculptor analogy, what are the blueprint, the field notes, the rounding error, the overlay sketch, and the site constraint? (explainer ELI5)
  2. A 70B model in fp16 needs roughly how much raw memory for weights? (explainer §1.1)
  3. Why is “the model loads on my machine” not yet a production answer? (explainer §1.3)
  4. Name four options when the model does not fit and one trade-off for each. (explainer §1.2-§1.3)
  5. What does a Lead AI Engineer own here besides just picking a model name? (explainer §1.3)

Tuesday (after chapter 2)

  1. fp32 vs fp16 vs bf16 — what changes in bits, and what changes in practice? (explainer §2.2)
  2. Precision vs range — define each in one sentence. (explainer §2.3)
  3. Why is bf16 usually preferred to fp16 for training? (explainer §2.4)
  4. Do raw weight memory for 7B, 13B, and 70B in fp16 from memory. (explainer §2.5)
  5. Why is int4 exciting and dangerous at the same time? (explainer §2.5)

Wednesday (after chapter 3)

  1. Write the core quantization equations. (explainer §3.1)
  2. In the worked int4 example, which value got hurt most by rounding and why? (explainer §3.2)
  3. Why can per-tensor quantization crush a small channel? (explainer §3.3)
  4. GPTQ — what is it trying to preserve during quantization? (explainer §3.4)
  5. AWQ — why can a small weight still be very important? (explainer §3.5)
  6. What usually breaks first when moving from int8 to int4? Name at least three failure surfaces. (explainer §3.6)

Thursday (after chapter 4)

  1. Write the KV-cache memory formula from memory. (explainer §4.2)
  2. For a 70B-style model with GQA at 8K context, what is the rough per-request KV-cache size? (explainer §4.2)
  3. Why can a quantized model still OOM under long-context traffic? (explainer §4.1-§4.2)
  4. MQA vs GQA — what gets shared, and why does that help memory? (explainer §4.3-§4.4)
  5. PagedAttention — what systems idea is it borrowing? (explainer §4.5)
  6. Prefill vs decode — why are they different serving phases? (explainer §4.6)

Friday (after chapter 5)

  1. Why is full fine-tuning expensive beyond just the raw weight memory? (explainer §5.1)
  2. Write the LoRA formula and define rank r. (explainer §5.2)
  3. For a 4096 x 4096 matrix, how many trainable params does rank-16 LoRA add? (explainer §5.2)
  4. QLoRA — what two ideas are combined? (explainer §5.4)
  5. When is prompt improvement enough, and when do you need PEFT? (explainer §5.6)
  6. When is RAG better than fine-tuning? (explainer §5.6, §6.6)

Weekend (cumulative)

  1. Recreate the failure-fix table from explainer §6.1 with at least 8 rows.
  2. Do the Scenario A serving math from explainer §6.4 without looking.
  3. Explain, in one clean paragraph, why “weights are fixed cost and KV cache is traffic-shaped cost.” (explainer §4.1)
  4. Give one honest admission about what this module does not make you an expert in. (explainer §5.8)
  5. Say the bridge to Module 07 from memory: what problem does RAG solve that quantization/fine-tuning do not? (explainer §6.7)