05. Assignment 13 — Multimodal Retrieval + Visual Inspection¶
Week 13. Build a small system that retrieves images well and admits when visual reasoning is weak.
Required reading first:
02_explainer.mdchapters 1-3. The hands_on_lab is much easier once the roles of the eye, the translator, and the patch are clear.
Goal¶
Build a multimodal system with two user flows: - Retrieval: text → relevant images, and image → similar images - Inspection: run a VLM-style prompt on a defect-oriented subset, then analyze failures honestly
This is not just a demo. It is a measurement exercise.
Constraints¶
- Use an existing vision model family: CLIP, SigLIP, OpenCLIP, or equivalent
- You may use a hosted VLM API for the inspection part
- Keep the dataset small enough to run locally if needed
- Do not hide failure cases; document them explicitly
Required dataset¶
Create or source two subsets:
- Retrieval subset
- 500-3000 images
- diverse objects/scenes/styles
-
captions or tags preferred for evaluation
-
Inspection subset
- 30-100 images with visible issues or fine details
- examples: damaged products, circuit boards, manufacturing defects, packaging errors
- create a tiny gold label file describing the visible problem
Required deliverables¶
index.py— embed and store image vectorssearch.py— text-to-image and image-to-image retrievalinspect.py— prompt a VLM or multimodal model on the defect subseteval.py— precision@5 or recall@5 for retrieval, plus failure-rate summary for inspectiongold_queries.json— at least 20 text queries with expected relevant imagesinspection_labels.json— ground-truth defect notes for the inspection subsetREADME.md— architecture, metrics, examples, honest failure analysis
Minimum architecture¶
User query
↓
Text encoder / CLIP text tower
↓
Shared embedding space
↓
Nearest-neighbor retrieval
↓
Top-K images
↓
(optional) VLM reranker / inspector
Required experiments¶
| Experiment | What to measure |
|---|---|
| CLIP model A vs model B | precision@5 and latency |
| Short query vs descriptive query | retrieval quality |
| Text-only retrieval vs text + image reranking | quality change |
| Normal image vs defect image prompting | hallucination / miss rate |
Success criteria¶
- Retrieval feels semantically relevant on real text queries
- Precision@5 is reported on a gold set
- Inspection pipeline identifies some visible defects correctly
- Failure analysis is specific, not vague
- You can explain why the system failed on at least five examples
Common pitfalls¶
- Treating cosine similarity as magic without inspecting nearest neighbors
- Using only easy, centered images; that hides real failure modes
- Assuming the VLM “saw everything” without zoom/crop experiments
- Ignoring OCR, tiny objects, rotated views, or partial occlusion
- Reporting accuracy without showing failure slices
Suggested stretch goals¶
- Add a simple web UI with thumbnail results
- Add crop-based inspection for tiny defects
- Compare CLIP retrieval against a caption-first baseline
- Add hybrid ranking: embedding score + metadata filter
- Add “I am uncertain” prompting for the inspection model
What to show in the writeup¶
- Dataset choice and why it matters
- Retrieval architecture diagram
- Example good queries and bad queries
- Three inspection failures with screenshots
- One lesson about grounding versus language fluency
- One lesson that will matter for diffusion in Module 14
Writeup questions¶
- Which errors came from the vision encoder, and which from the language side?
- Did cropping or zooming change inspection results materially?
- Which queries exposed shared-embedding strengths best?
- Which inspection cases produced confident but wrong answers?
- What would you change before shipping this into production?
LinkedIn post template¶
"This week I built a multimodal retrieval system with CLIP and then stress-tested a vision-language model on small visual defects.
The biggest insight: a model sounding confident about an image is not the same as a model being grounded in the right pixels.
Three things I learned: 1. [your insight] 2. [your insight] 3. [your insight]
Repo/demo: [link]"
Why this hands_on_lab matters¶
Search, cataloging, moderation, inspection, and support are all going multimodal. A Lead AI Engineer must know two things simultaneously: - how to make these systems useful - how to explain their failure modes before users discover them brutally