Skip to content

04. Week 13 — Daily Recall

Spaced practice. Answer from memory. If stuck, jump to the explainer chapter referenced in parentheses.

Monday (after ELI5 + chapter 1)

  1. Tell the painter analogy using all five placeholders: eye, translator, canvas, patch, frame tape. (ELI5)
  2. Why did the multimodal model miss the burned capacitor? Give three plausible reasons. (§1.1)
  3. Why is multimodal AI strategically important for senior AI roles now? (§1.2)
  4. What does it mean to say pixels are not meanings yet? (§2.1)
  5. Give one real production risk from over-trusting a VLM answer. (§1.3)

Tuesday (after chapter 2)

  1. How does ViT split a 224×224 image into tokens? (§2.2)
  2. What exactly is a patch token? (§2.2)
  3. Why are positional embeddings needed in vision transformers? (§2.2)
  4. CLIP training objective — describe it without using the word “contrastive.” (§2.4)
  5. Zero-shot image classification with CLIP — workflow from memory. (§2.4)
  6. [W12] Reasoning model vs prompt-based reasoning — underlying difference? (Module 12)

Wednesday (after chapter 3)

  1. Draw the generic VLM architecture from memory. (§3.2)
  2. Why can’t we just paste raw image vectors into an LLM? (§3.2)
  3. LLaVA training recipe — two stages in plain English. (§3.3)
  4. GPT-4V / Claude / Gemini style systems share what broad architecture idea? (§3.4)
  5. Name three tasks where VLMs help, and three where they still fail. (§3.5)
  6. [W11] LLM-as-Judge — one rubric dimension you would reuse for multimodal evals. (Module 11)

Thursday (after chapters 4-5)

  1. GAN generator vs discriminator — one sentence each. (§4.1)
  2. What is latent space, and why does it matter for image generation? (§4.2)
  3. Why did diffusion mostly replace GANs for mainstream image generation? (§4.3)
  4. Why is video generation harder than image generation? Give four reasons. (§5.2)
  5. What does temporal attention buy you? (§5.3)
  6. Noise vs signal — explain using a photo, not math. (§4.3, §6.5)

Friday (cumulative)

  1. Re-tell the full ladder: classification → captioning → generation → video. (ELI5)
  2. ViT, CLIP, VLM, diffusion — put them in one causal chain. (§2-§4)
  3. What is the single biggest hallucination risk in visual systems? (§5.5)
  4. Which failure from the chapter-6 table worries you most in production? Why? (§6.1)
  5. What exact bridge sentence takes you into Module 14? (§6.6)

Weekend (pre-hands_on_lab)

  1. Sketch the ViT patch grid and CLIP training loop without notes. (§2.2, §2.4)
  2. Explain the burned-capacitor failure to a non-technical manager in five sentences. (§1.1)
  3. Give one retrieval prompt, one diagnosis prompt, and one generation prompt for the same image set. (§3.6)
  4. List the four foundation concepts Module 14 assumes. (§6.5)
  5. Name two honest limitations of image models and two honest limitations of video models. (§5.5)