05. Assignment 13 — Multimodal Retrieval + Visual Inspection¶

Week 13. Build a small system that retrieves images well and admits when visual reasoning is weak.

Required reading first: 02_explainer.md chapters 1-3. The hands_on_lab is much easier once the roles of the eye, the translator, and the patch are clear.

Goal¶

Build a multimodal system with two user flows: - Retrieval: text → relevant images, and image → similar images - Inspection: run a VLM-style prompt on a defect-oriented subset, then analyze failures honestly

This is not just a demo. It is a measurement exercise.

Constraints¶

Use an existing vision model family: CLIP, SigLIP, OpenCLIP, or equivalent
You may use a hosted VLM API for the inspection part
Keep the dataset small enough to run locally if needed
Do not hide failure cases; document them explicitly

Required dataset¶

Create or source two subsets:

Retrieval subset
500-3000 images
diverse objects/scenes/styles
captions or tags preferred for evaluation
Inspection subset
30-100 images with visible issues or fine details
examples: damaged products, circuit boards, manufacturing defects, packaging errors
create a tiny gold label file describing the visible problem

Required deliverables¶

index.py — embed and store image vectors
search.py — text-to-image and image-to-image retrieval
inspect.py — prompt a VLM or multimodal model on the defect subset
eval.py — precision@5 or recall@5 for retrieval, plus failure-rate summary for inspection
gold_queries.json — at least 20 text queries with expected relevant images
inspection_labels.json — ground-truth defect notes for the inspection subset
README.md — architecture, metrics, examples, honest failure analysis

Minimum architecture¶

User query
  ↓
Text encoder / CLIP text tower
  ↓
Shared embedding space
  ↓
Nearest-neighbor retrieval
  ↓
Top-K images
  ↓
(optional) VLM reranker / inspector

Required experiments¶

Experiment	What to measure
CLIP model A vs model B	precision@5 and latency
Short query vs descriptive query	retrieval quality
Text-only retrieval vs text + image reranking	quality change
Normal image vs defect image prompting	hallucination / miss rate

Success criteria¶

Retrieval feels semantically relevant on real text queries
Precision@5 is reported on a gold set
Inspection pipeline identifies some visible defects correctly
Failure analysis is specific, not vague
You can explain why the system failed on at least five examples

Common pitfalls¶

Treating cosine similarity as magic without inspecting nearest neighbors
Using only easy, centered images; that hides real failure modes
Assuming the VLM “saw everything” without zoom/crop experiments
Ignoring OCR, tiny objects, rotated views, or partial occlusion
Reporting accuracy without showing failure slices

Suggested stretch goals¶

Add a simple web UI with thumbnail results
Add crop-based inspection for tiny defects
Compare CLIP retrieval against a caption-first baseline
Add hybrid ranking: embedding score + metadata filter
Add “I am uncertain” prompting for the inspection model

What to show in the writeup¶

Dataset choice and why it matters
Retrieval architecture diagram
Example good queries and bad queries
Three inspection failures with screenshots
One lesson about grounding versus language fluency
One lesson that will matter for diffusion in Module 14

Writeup questions¶

Which errors came from the vision encoder, and which from the language side?
Did cropping or zooming change inspection results materially?
Which queries exposed shared-embedding strengths best?
Which inspection cases produced confident but wrong answers?
What would you change before shipping this into production?

LinkedIn post template¶

"This week I built a multimodal retrieval system with CLIP and then stress-tested a vision-language model on small visual defects.

The biggest insight: a model sounding confident about an image is not the same as a model being grounded in the right pixels.

Three things I learned: 1. [your insight] 2. [your insight] 3. [your insight]

Repo/demo: [link]"

Why this hands_on_lab matters¶

Search, cataloging, moderation, inspection, and support are all going multimodal. A Lead AI Engineer must know two things simultaneously: - how to make these systems useful - how to explain their failure modes before users discover them brutally