04. Week 6 — Daily Recall¶
Spaced practice.
Answer from memory first.
If you stall, jump to the referenced section in 02_explainer.md.
Monday (after ELI5 + chapter 1)¶
- In the sculptor analogy, what are the blueprint, the field notes, the rounding error, the overlay sketch, and the site constraint? (explainer ELI5)
- A 70B model in fp16 needs roughly how much raw memory for weights? (explainer §1.1)
- Why is “the model loads on my machine” not yet a production answer? (explainer §1.3)
- Name four options when the model does not fit and one trade-off for each. (explainer §1.2-§1.3)
- What does a Lead AI Engineer own here besides just picking a model name? (explainer §1.3)
Tuesday (after chapter 2)¶
- fp32 vs fp16 vs bf16 — what changes in bits, and what changes in practice? (explainer §2.2)
- Precision vs range — define each in one sentence. (explainer §2.3)
- Why is bf16 usually preferred to fp16 for training? (explainer §2.4)
- Do raw weight memory for 7B, 13B, and 70B in fp16 from memory. (explainer §2.5)
- Why is int4 exciting and dangerous at the same time? (explainer §2.5)
Wednesday (after chapter 3)¶
- Write the core quantization equations. (explainer §3.1)
- In the worked int4 example, which value got hurt most by rounding and why? (explainer §3.2)
- Why can per-tensor quantization crush a small channel? (explainer §3.3)
- GPTQ — what is it trying to preserve during quantization? (explainer §3.4)
- AWQ — why can a small weight still be very important? (explainer §3.5)
- What usually breaks first when moving from int8 to int4? Name at least three failure surfaces. (explainer §3.6)
Thursday (after chapter 4)¶
- Write the KV-cache memory formula from memory. (explainer §4.2)
- For a 70B-style model with GQA at 8K context, what is the rough per-request KV-cache size? (explainer §4.2)
- Why can a quantized model still OOM under long-context traffic? (explainer §4.1-§4.2)
- MQA vs GQA — what gets shared, and why does that help memory? (explainer §4.3-§4.4)
- PagedAttention — what systems idea is it borrowing? (explainer §4.5)
- Prefill vs decode — why are they different serving phases? (explainer §4.6)
Friday (after chapter 5)¶
- Why is full fine-tuning expensive beyond just the raw weight memory? (explainer §5.1)
- Write the LoRA formula and define rank
r. (explainer §5.2) - For a
4096 x 4096matrix, how many trainable params does rank-16 LoRA add? (explainer §5.2) - QLoRA — what two ideas are combined? (explainer §5.4)
- When is prompt improvement enough, and when do you need PEFT? (explainer §5.6)
- When is RAG better than fine-tuning? (explainer §5.6, §6.6)
Weekend (cumulative)¶
- Recreate the failure-fix table from explainer §6.1 with at least 8 rows.
- Do the Scenario A serving math from explainer §6.4 without looking.
- Explain, in one clean paragraph, why “weights are fixed cost and KV cache is traffic-shaped cost.” (explainer §4.1)
- Give one honest admission about what this module does not make you an expert in. (explainer §5.8)
- Say the bridge to Module 07 from memory: what problem does RAG solve that quantization/fine-tuning do not? (explainer §6.7)