03. CLIP and shared meaning — pull images and words into one coordinate system¶

~15 min read. The trick that makes a picture and a sentence land near each other without hand-built labels.

Built on the ELI5 in 00-eli5.md. The the translator — vision-language bridge, the part that turns visual numbers into words — starts here by learning a shared space where matching images and texts sit close together.

1) The big picture: two encoders, one meeting place¶

CLIP uses two encoders. One image encoder. One text encoder. The image side can be a CNN or a ViT. The text side is a transformer over words. Both sides output embeddings in the same dimension. Then training pulls matched image-text pairs together and pushes mismatched pairs apart.

So what is really happening? the eye maps a picture into a vector. The text encoder maps a caption into another vector. Alignment is learned through shared geometry, not through sentence generation. That point matters. CLIP does not explain the picture in free-form language. It learns where the picture should sit relative to text.

The training scale is huge. CLIP is trained on roughly 400M image-text pairs. That gives the model many weak but useful associations. A dog photo appears near dog-like captions. A sneaker photo appears near sneaker-like captions. A chart image appears near chart-like captions. Simple, no? The model learns a geometry of agreement.

2) Architecture as a matching system¶

Look at the flow. Each encoder works separately at first. They only meet in the shared embedding space. That design makes retrieval easy. It also makes training symmetrical.

image x                            text t
   │                                  │
   ▼                                  ▼
┌──────────────┐                ┌──────────────┐
│ image encoder│                │ text encoder │
│ vision tower │                │ word model   │
└──────┬───────┘                └──────┬───────┘
       │                               │
       ▼                               ▼
   image vector u                  text vector v
        └──────────────┬──────────────┘
                       ▼
             shared embedding space
                       │
            matched pairs close
            mismatched pairs far

Usually both vectors are normalized. Then similarity is measured with cosine similarity or a scaled dot product. High similarity means the image and text likely match. Low similarity means they probably do not.

See why this is powerful. You can precompute image embeddings for a product catalog. Then encode a query like "red running shoes". Nearest vectors become the results. That is the alignment trick in action. No hand-written rule table is needed.

3) Worked example: batch of 4 image-text pairs¶

Now let us make the contrastive loss concrete. Suppose one training batch has four matched pairs: 1. image of a cat ↔ text "a cat on a sofa" 2. image of a pizza ↔ text "a cheese pizza" 3. image of a bicycle ↔ text "a parked bicycle" 4. image of a receipt ↔ text "a grocery receipt"

After encoding and normalizing, compute similarities between every image and every text. Assume we get this matrix:

                 text1   text2   text3   text4
              ┌────────────────────────────────┐
image1 cat    │  0.82    0.11    0.18    0.05 │
image2 pizza  │  0.09    0.77    0.14    0.12 │
image3 cycle  │  0.15    0.20    0.80    0.19 │
image4 receipt│  0.07    0.16    0.13    0.86 │
              └────────────────────────────────┘
                     ▲       ▲       ▲       ▲
                     target diagonal for matched pairs

The diagonal entries are the correct matches. Those are the targets. Everything off the diagonal is a mismatch. Training wants diagonal values high and off-diagonal values lower.

Now focus on image2. Its similarities are: [0.09, 0.77, 0.14, 0.12] The correct text is text2. So the model should place most probability on column 2. Apply softmax over that row. First compute exponentials approximately: e^0.09 = 1.094 e^0.77 = 2.160 e^0.14 = 1.150 e^0.12 = 1.128 Row sum: 1.094 + 2.160 + 1.150 + 1.128 = 5.532 Probability of the correct match: 2.160 ÷ 5.532 ≈ 0.3905 Row loss for image2: -log(0.3905) ≈ 0.940

Do the same from the text side too. That symmetry matters. Text2 should also prefer image2 over the other images. So contrastive learning is not just "pull one pair together". It is "rank the right partner above all wrong partners in the batch". Yes? That is why bigger and more diverse batches help.

4) Zero-shot classification is just prompt matching¶

Now comes the famous trick. Once images and text live in one shared space, you can classify without training a task-specific head. How? Encode the image once. Then encode several text prompts that describe candidate classes. Choose the text with highest cosine similarity.

Suppose the image is a photo of a golden retriever. Candidate prompts are: - "a photo of a dog" - "a photo of a cat" - "a photo of a bicycle"

Assume cosine similarities come out as: sim(image, dog) = 0.83 sim(image, cat) = 0.41 sim(image, bicycle) = 0.09 So the model predicts "dog". No extra classifier head was trained for this tiny task. That is zero-shot classification.

Look at what happened. the eye produced a visual embedding. The text encoder produced prompt embeddings. Then the translator aligned them inside one metric space. Prediction became nearest-neighbor comparison in semantics. Very elegant.

5) What CLIP gives, and what it still cannot do¶

CLIP is excellent at alignment. It is great for search, retrieval, ranking, filtering, and zero-shot tagging. It is often used to score whether an image and a phrase belong together. It is also useful inside bigger multimodal systems.

But now what is the limitation? CLIP does not generate an explanation sentence token by token. It does not answer follow-up questions by itself. It does not reason in long dialogue. It places images and text near each other. That is all. A very valuable all, but still all.

So the translator here is partial. It builds shared semantics. It does not yet speak fluently. To get a real assistant, we must connect visual features to a language model that can decode text. That is the next topic. And yes, the eye stays important because the language model is only as grounded as the visual embedding it receives.

Where this lives in the wild¶

OpenAI CLIP image-text retrieval — research engineer: ranks captions and images in one embedding space for zero-shot recognition and search.
Pinterest visual discovery — recommendation engineer: matches product-like images to textual style queries such as "mid-century lamp" or "linen saree".
Shopify catalog search — commerce ML engineer: aligns merchant photos with text queries so product retrieval works even when metadata is messy.
Adobe Stock search relevance — search engineer: scores image-caption alignment to improve prompt-based asset discovery.
Instacart grocery understanding — applied scientist: matches product packshots with text labels and shelf queries for retail retrieval pipelines.

Pause and recall¶

Why does CLIP need two separate encoders instead of one model that only reads images?
In the 4 × 4 similarity matrix, why is the diagonal the training target?
For image2 in the worked example, what was the probability assigned to the correct text after softmax?
Why can CLIP do zero-shot classification with class prompts?

Interview Q&A¶

Q: Why does CLIP use a contrastive objective instead of directly generating the caption token by token? A: Because contrastive alignment is enough to learn shared semantics for retrieval and zero-shot recognition without requiring an autoregressive decoder. Common wrong answer to avoid: "Because generation and alignment are basically the same task."

Q: Why are large in-batch negatives valuable in CLIP training? A: Because every non-matching text or image in the batch becomes a comparison target, sharpening the ranking pressure around the correct pair. Common wrong answer to avoid: "Negatives matter only when the positive pair is mislabeled."

Q: Why is cosine similarity a natural choice after embedding normalization? A: Because it measures directional agreement in the shared space, which is what matters once magnitude has been normalized away. Common wrong answer to avoid: "Because cosine similarity preserves absolute scale better than dot product."

Q: Why is CLIP useful for zero-shot classification but insufficient for full visual question answering by itself? A: Because it aligns images and texts in one space, but it does not decode grounded multi-step language responses on its own. Common wrong answer to avoid: "Because zero-shot classification is harder than answering open-ended questions."

Apply now (5 min)¶

Quick exercise. Write down four image labels and four matching captions of your own. Make a 4 × 4 grid and invent similarity scores where the diagonal is strongest. Then pick one row and compute the softmax probability of the correct caption.

Sketch from memory the dual-encoder diagram with one shared embedding space. Under the sketch, write one line on why CLIP aligns meaning but still does not generate an answer sentence.

Bridge. CLIP aligns images and text in one space. But it cannot generate language. For that, the translator must connect to an LLM. → 04-vision-language-models.md