Skip to content

05. Assignment 13 — Multimodal Retrieval + Visual Inspection

Week 13. Build a small system that retrieves images well and admits when visual reasoning is weak.

Required reading first: 02_explainer.md chapters 1-3. The hands_on_lab is much easier once the roles of the eye, the translator, and the patch are clear.

Goal

Build a multimodal system with two user flows: - Retrieval: text → relevant images, and image → similar images - Inspection: run a VLM-style prompt on a defect-oriented subset, then analyze failures honestly

This is not just a demo. It is a measurement exercise.

Constraints

  • Use an existing vision model family: CLIP, SigLIP, OpenCLIP, or equivalent
  • You may use a hosted VLM API for the inspection part
  • Keep the dataset small enough to run locally if needed
  • Do not hide failure cases; document them explicitly

Required dataset

Create or source two subsets:

  1. Retrieval subset
  2. 500-3000 images
  3. diverse objects/scenes/styles
  4. captions or tags preferred for evaluation

  5. Inspection subset

  6. 30-100 images with visible issues or fine details
  7. examples: damaged products, circuit boards, manufacturing defects, packaging errors
  8. create a tiny gold label file describing the visible problem

Required deliverables

  1. index.py — embed and store image vectors
  2. search.py — text-to-image and image-to-image retrieval
  3. inspect.py — prompt a VLM or multimodal model on the defect subset
  4. eval.py — precision@5 or recall@5 for retrieval, plus failure-rate summary for inspection
  5. gold_queries.json — at least 20 text queries with expected relevant images
  6. inspection_labels.json — ground-truth defect notes for the inspection subset
  7. README.md — architecture, metrics, examples, honest failure analysis

Minimum architecture

User query
Text encoder / CLIP text tower
Shared embedding space
Nearest-neighbor retrieval
Top-K images
(optional) VLM reranker / inspector

Required experiments

Experiment What to measure
CLIP model A vs model B precision@5 and latency
Short query vs descriptive query retrieval quality
Text-only retrieval vs text + image reranking quality change
Normal image vs defect image prompting hallucination / miss rate

Success criteria

  • Retrieval feels semantically relevant on real text queries
  • Precision@5 is reported on a gold set
  • Inspection pipeline identifies some visible defects correctly
  • Failure analysis is specific, not vague
  • You can explain why the system failed on at least five examples

Common pitfalls

  • Treating cosine similarity as magic without inspecting nearest neighbors
  • Using only easy, centered images; that hides real failure modes
  • Assuming the VLM “saw everything” without zoom/crop experiments
  • Ignoring OCR, tiny objects, rotated views, or partial occlusion
  • Reporting accuracy without showing failure slices

Suggested stretch goals

  • Add a simple web UI with thumbnail results
  • Add crop-based inspection for tiny defects
  • Compare CLIP retrieval against a caption-first baseline
  • Add hybrid ranking: embedding score + metadata filter
  • Add “I am uncertain” prompting for the inspection model

What to show in the writeup

  • Dataset choice and why it matters
  • Retrieval architecture diagram
  • Example good queries and bad queries
  • Three inspection failures with screenshots
  • One lesson about grounding versus language fluency
  • One lesson that will matter for diffusion in Module 14

Writeup questions

  1. Which errors came from the vision encoder, and which from the language side?
  2. Did cropping or zooming change inspection results materially?
  3. Which queries exposed shared-embedding strengths best?
  4. Which inspection cases produced confident but wrong answers?
  5. What would you change before shipping this into production?

LinkedIn post template

"This week I built a multimodal retrieval system with CLIP and then stress-tested a vision-language model on small visual defects.

The biggest insight: a model sounding confident about an image is not the same as a model being grounded in the right pixels.

Three things I learned: 1. [your insight] 2. [your insight] 3. [your insight]

Repo/demo: [link]"

Why this hands_on_lab matters

Search, cataloging, moderation, inspection, and support are all going multimodal. A Lead AI Engineer must know two things simultaneously: - how to make these systems useful - how to explain their failure modes before users discover them brutally