04. Week 13 — Daily Recall¶
Spaced practice. Answer from memory. If stuck, jump to the explainer chapter referenced in parentheses.
Monday (after ELI5 + chapter 1)¶
- Tell the painter analogy using all five placeholders: eye, translator, canvas, patch, frame tape. (ELI5)
- Why did the multimodal model miss the burned capacitor? Give three plausible reasons. (§1.1)
- Why is multimodal AI strategically important for senior AI roles now? (§1.2)
- What does it mean to say pixels are not meanings yet? (§2.1)
- Give one real production risk from over-trusting a VLM answer. (§1.3)
Tuesday (after chapter 2)¶
- How does ViT split a 224×224 image into tokens? (§2.2)
- What exactly is a patch token? (§2.2)
- Why are positional embeddings needed in vision transformers? (§2.2)
- CLIP training objective — describe it without using the word “contrastive.” (§2.4)
- Zero-shot image classification with CLIP — workflow from memory. (§2.4)
- [W12] Reasoning model vs prompt-based reasoning — underlying difference? (Module 12)
Wednesday (after chapter 3)¶
- Draw the generic VLM architecture from memory. (§3.2)
- Why can’t we just paste raw image vectors into an LLM? (§3.2)
- LLaVA training recipe — two stages in plain English. (§3.3)
- GPT-4V / Claude / Gemini style systems share what broad architecture idea? (§3.4)
- Name three tasks where VLMs help, and three where they still fail. (§3.5)
- [W11] LLM-as-Judge — one rubric dimension you would reuse for multimodal evals. (Module 11)
Thursday (after chapters 4-5)¶
- GAN generator vs discriminator — one sentence each. (§4.1)
- What is latent space, and why does it matter for image generation? (§4.2)
- Why did diffusion mostly replace GANs for mainstream image generation? (§4.3)
- Why is video generation harder than image generation? Give four reasons. (§5.2)
- What does temporal attention buy you? (§5.3)
- Noise vs signal — explain using a photo, not math. (§4.3, §6.5)
Friday (cumulative)¶
- Re-tell the full ladder: classification → captioning → generation → video. (ELI5)
- ViT, CLIP, VLM, diffusion — put them in one causal chain. (§2-§4)
- What is the single biggest hallucination risk in visual systems? (§5.5)
- Which failure from the chapter-6 table worries you most in production? Why? (§6.1)
- What exact bridge sentence takes you into Module 14? (§6.6)
Weekend (pre-hands_on_lab)¶
- Sketch the ViT patch grid and CLIP training loop without notes. (§2.2, §2.4)
- Explain the burned-capacitor failure to a non-technical manager in five sentences. (§1.1)
- Give one retrieval prompt, one diagnosis prompt, and one generation prompt for the same image set. (§3.6)
- List the four foundation concepts Module 14 assumes. (§6.5)
- Name two honest limitations of image models and two honest limitations of video models. (§5.5)