06. Similarity and models — how the librarian measures closeness¶

~12 min read. Two chunks point in the same direction but one is twice as long. Cosine says they are identical twins. Dot product says one wins by a mile. Which answer is right? By the end of this page, you will know exactly when each one is right.

Builds on 05-embeddings.md. The index card from the ELI5 is a vector. This page is about the ruler that the librarian uses to measure card-to-card distance on the meaning map.

1) The hook — when the metric flips the winner¶

Three small index cards. Same library. Same shelf.

Card A:  [ 0.8 ,  0.6 ]      length ≈ 1.0
Card B:  [ 1.6 ,  1.2 ]      length ≈ 2.0   (same direction as A, twice as long)
Card C:  [ 0.6 ,  0.8 ]      length ≈ 1.0   (slightly different direction)
Query q: [ 0.9 ,  0.4 ]      length ≈ 0.98

Now look at the same three cards through two different rulers.

metric         A vs q     B vs q     C vs q     winner
─────────────────────────────────────────────────────────
cosine          0.953      0.953      0.825      A == B   (tie!)
dot product     0.96       1.92       0.86       B wins big
L2 distance     0.22       0.94       0.50       A wins

Three rulers. Three different winners.

This is not a trick. This is the whole module. The metric you pick decides which chunk reaches the reading desk. Cosine ignores how long the card is. Dot product punishes short cards. L2 distance punishes far cards. Same library, three different librarians, three different answers.

So the question is not "which metric is best." The question is: what does my embedding model produce, and which ruler reads that output correctly?

2) The mental model — arrows on a meaning map¶

Recall the ELI5. Each chunk is an index card holding a meaning-signature. That signature is a vector. A vector is an arrow.

Two things about an arrow.

Direction — what the card is about.
Length (magnitude) — how strongly the model thinks so, or just an artefact of training.

Cosine reads direction only. Dot product reads direction times length. L2 reads coordinate-by-coordinate distance.

Picture in 2D. Same query, three chunks.

                 ▲ y
                 │
                 │           ◀── c2 (wide angle, longer)
                 │          ╱
                 │         ╱
                 │    ◀───╱── q  (the query arrow)
                 │       ╱
                 │      ╱  c1 (tight angle, short)
─────────────────┼──────────────────▶ x
                 │
                 │   ╲
                 │    ╲    c3 (opposite-ish direction)
                 ▼

c1 points almost where q points → high cosine. c2 points wider but is longer → high dot product, lower cosine. c3 points away → low on everything, ignored.

Hold this picture. We will come back to it.

3) The three rulers, side by side¶

Cosine similarity¶

cos(q, d) = (q · d) / (||q|| × ||d||)

Top: the dot product. Bottom: the product of the two lengths. The division strips out length. Only orientation survives.

Range: -1 to 1. 1 means perfectly aligned. 0 means orthogonal (unrelated in that space). -1 means opposite.

For text retrieval, real scores usually sit between 0.2 and 0.9. Negative cosine on text embeddings is rare and usually meaningful — it says these meanings actively disagree.

Dot product¶

q · d = Σ(qᵢ × dᵢ)

No division. Length flows straight through. A vector with norm 2.0 will out-score an identically-aligned vector with norm 1.0. That can be a feature or a bug.

When is length a feature? Some embedding models bake confidence or information density into the norm. A short, generic chunk gets a small norm. A rich, specific chunk gets a larger norm. Dot product then rewards the richer chunk — which is often what you want.

When is length a bug? Most modern models do not encode useful information in length. The norm is noise from training. Then dot product introduces magnitude bias, and cosine is safer.

Euclidean distance (L2)¶

L2(q, d) = sqrt(Σ(qᵢ - dᵢ)²)

Smaller is better here. Not bigger. L2 reads the straight-line gap between the two arrow tips.

L2 is well-defined and supported everywhere, but it conflates direction and magnitude. For text, where direction usually carries the meaning signal, L2 is rarely the first choice. For image embeddings, it sometimes wins.

Mini-FAQ — Cosine vs dot product, when does it matter? It matters whenever your embeddings are not normalized. With unnormalized vectors, dot product can be dominated by long vectors. With normalized vectors, cosine and dot product give the same ranking (see section 5). Most production teams just normalize once at index time and stop worrying.

4) The worked example — three cards, full math¶

Same three cards from section 1. We compute every score by hand.

A = [0.8, 0.6]      B = [1.6, 1.2]      C = [0.6, 0.8]
q = [0.9, 0.4]

Dot products.

q · A = 0.9×0.8 + 0.4×0.6 = 0.72 + 0.24 = 0.96
q · B = 0.9×1.6 + 0.4×1.2 = 1.44 + 0.48 = 1.92
q · C = 0.9×0.6 + 0.4×0.8 = 0.54 + 0.32 = 0.86

Norms.

||q|| = sqrt(0.81 + 0.16) = sqrt(0.97) ≈ 0.985
||A|| = sqrt(0.64 + 0.36) = 1.0
||B|| = sqrt(2.56 + 1.44) = sqrt(4.0) = 2.0
||C|| = sqrt(0.36 + 0.64) = 1.0

Cosines.

cos(q, A) = 0.96 / (0.985 × 1.0) ≈ 0.975
cos(q, B) = 1.92 / (0.985 × 2.0) ≈ 0.975
cos(q, C) = 0.86 / (0.985 × 1.0) ≈ 0.873

A and B tie under cosine. They have to. Same direction, different lengths only.

L2 distances.

L2(q, A) = sqrt((0.9-0.8)² + (0.4-0.6)²) = sqrt(0.01 + 0.04) ≈ 0.224
L2(q, B) = sqrt((0.9-1.6)² + (0.4-1.2)²) = sqrt(0.49 + 0.64) ≈ 1.063
L2(q, C) = sqrt((0.9-0.6)² + (0.4-0.8)²) = sqrt(0.09 + 0.16) ≈ 0.500

Final ranking under each ruler:

cosine:        A = B > C
dot product:   B > A > C
L2:            A > C > B

Three rulers, three rankings, same data. This is why the metric is not a footnote.

5) Why normalization collapses cosine and dot product¶

If every vector has length exactly 1, the math simplifies brutally.

||q|| = 1
||d|| = 1
cos(q, d) = (q · d) / (1 × 1) = q · d

Cosine is dot product when vectors are unit-length. Not approximately. Exactly.

This is why production stacks love normalization. Normalize once at index time. Normalize the query once per request. Now you can use the cheaper dot product in the inner loop, but you get cosine semantics for ranking. Best of both worlds.

normalize(v) = v / ||v||

[1.6, 1.2]   length 2.0   ──normalize──▶   [0.8, 0.6]   length 1.0

Cost of normalization: one square root, one division per vector, done once at index time. Negligible compared to running the embedding model itself.

Mini-FAQ — What is normalization and why does it collapse the two? Normalization rescales each vector to length 1 while keeping its direction. Once both vectors have length 1, the denominator in the cosine formula becomes 1, so cosine equals dot product. Faster compute, identical ranking.

Most modern text embedding models return normalized vectors by default. OpenAI's text-embedding-3-small, Cohere's Embed v3, and most BGE checkpoints output unit-norm vectors. For these, "cosine vs dot product" is a non-debate — pick whichever your vector store prefers in its hot path.

6) Asymmetric vs symmetric models — the encoder shape¶

A new failure mode. The metric is right. The math is clean. Retrieval is still terrible. Why?

Because there are two kinds of embedding models, and people mix them up.

Symmetric encoders¶

One encoder. Both query and document go through the same network. Both come out the same shape, same flavour. Used when both sides look alike. Examples — sentence similarity, duplicate detection, clustering.

"refund policy"    ──┐
                     ├──▶  one encoder  ──▶  vectors
"refund process"   ──┘

Asymmetric encoders¶

Two distinct flavours — query side and passage side. The model is trained on pairs where one is a short question and the other is a long answer paragraph. Often a single model, but with prefixes or role tokens that flip which side is being encoded.

"query: refund policy"        ──▶  query encoder side  ──▶  vector
"passage: Enterprise annual   ──▶  passage encoder    ──▶  vector
 plans may request..."             side

Skip the prefix and retrieval quietly degrades. No error. No warning. Just worse top-k.

Real model families that use this trick:

e5 family (intfloat/e5-base-v2, multilingual-e5-large): prefix queries with query:, passages with passage:.
BGE family (BAAI/bge-large-en): query prefix recommended, often an instruction like Represent this sentence for searching relevant passages:.
Nomic Embed (nomic-ai/nomic-embed-text-v1.5): uses search_query: and search_document:.
Jina embeddings v3: task-conditioned prefixes for retrieval, separation, classification.

Mini-FAQ — What's an asymmetric encoder? A model where queries and passages are encoded differently — by separate paths or by the same path with a routing prefix. The training objective treats them as different roles. You must tell the model which role each input plays.

7) ColBERT and late interaction — when one vector isn't enough¶

One vector per chunk is fast but blunt. A 500-token chunk gets squashed into a single 768-dim point. Nuance dies.

ColBERT keeps one vector per token instead. A chunk of 500 tokens becomes 500 small vectors. The query also stays as a bag of token vectors. At query time, ColBERT computes a MaxSim score — for each query token, find its best-matching document token, take the cosine, sum them up.

query tokens:     [q1, q2, q3]            (3 vectors)
doc tokens:       [d1, d2, d3, d4, d5]    (5 vectors)

For each qᵢ, find max cos(qᵢ, dⱼ):
   q1 ▶ best match d3 ▶ 0.81
   q2 ▶ best match d5 ▶ 0.74
   q3 ▶ best match d1 ▶ 0.69
                            ──────
   total MaxSim score        2.24

This is called late interaction. The query and doc tokens never fuse into one vector — they only "meet" at scoring time, late in the pipeline.

Cost: storage explodes (one chunk now stores N vectors instead of 1). A 10-million chunk corpus with 200-token chunks blows up from 30 GB to ~6 TB at 768 dims. Worth it when precision matters more than storage.

In production: Vespa ships native ColBERT support. RAGatouille wraps ColBERTv2 for Python pipelines. JaColBERT is the Japanese-tuned variant. PLAID is the optimized index format used by ColBERTv2.

Cross-encoders — previewed, full coverage in module 10¶

Bi-encoder: query and doc each get their own vector. Compared by cosine or dot product. Fast at scale. This page's whole topic.

Cross-encoder: query and doc are concatenated and fed together into a small transformer that outputs a single relevance score. No vector at all. Slow per pair, very sharp.

bi-encoder:    [query] ──▶ vec_q
               [doc]   ──▶ vec_d        score = cos(vec_q, vec_d)

cross-encoder: [query | doc] ──▶ transformer ──▶ scalar relevance

Cross-encoders cannot be precomputed — every query-doc pair must be scored fresh. So they sit after retrieval, on the top-20 to top-100 candidates, as the rerank stage. We cover this fully in 10-reranking.md.

8) Predict the cosine-vs-dot trap before reading on¶

Stop. Before reading section 9, answer these in your head.

Why do cosine and dot product give the same ranking for L2-normalized vectors?
Which metric punishes long vectors? Which metric rewards them?
Why does an e5 model give poor retrieval without query: and passage: prefixes?
In ColBERT, what does the "late" in late interaction refer to?

If you cannot answer all four, scroll back. The rest of the page assumes them.

9) Choosing the right embedding model¶

Metric is the ruler. Model is the cartographer that draws the map in the first place. A wrong model gives a perfect ruler over a broken map.

Five checks before picking one.

Dimension. Higher dims (1024, 1536, 3072) capture more nuance but cost more storage and ANN compute. 384-dim models (MiniLM) are tiny and fast — fine for small corpora.
Domain fit. General text embedders miss legal, biomedical, and code-specific patterns. For code, Voyage Code or Jina Code beats generic embedders. For biomedical, BGE-M3 or specialized clinical models help.
Language coverage. English-only models (all-MiniLM-L6-v2) collapse on Hindi or Chinese queries. Multilingual: multilingual-e5-large, BGE-M3, Cohere multilingual.
Cost and latency. OpenAI text-embedding-3-small runs ~$0.02 per 1M tokens. Self-hosted bge-small-en-v1.5 on a GPU can be free at the margin but you pay capacity.
Prefixes / instructions. Already covered. Check the model card before indexing.

Real model families you will actually encounter:

OpenAI — text-embedding-3-small (1536d, normalized), text-embedding-3-large (3072d, configurable dims via Matryoshka).
Cohere — embed-english-v3.0, embed-multilingual-v3.0 (1024d, compressible to int8/binary).
Voyage AI — voyage-3, voyage-code-3, voyage-finance-2, voyage-law-2.
BGE (BAAI) — bge-large-en-v1.5, bge-m3 (multilingual + multi-granularity).
e5 / multilingual-e5 — strong open asymmetric embedders.
Nomic Embed — open-weights, long-context (8192 tokens).
Jina Embeddings v3 — task-conditioned, long-context.
Sentence Transformers — the open-source backbone behind much of the above.
Google Gemini Embedding — text-embedding-004, multilingual.
Mistral Embed — 1024d, multilingual, normalized.

Pick the model that wins on your eval set, not on MTEB. Public benchmarks are averages; your corpus is specific.

10) Failure modes — three ways the metric fails silently¶

Magnitude bias. You used dot product on unnormalized vectors. A few long vectors dominate every top-k. Fix: normalize at index time, or switch to cosine.

Model mismatch. Query embedded with model A. Documents indexed with model B. Same dimension, totally different geometry. Top-k looks random. Fix: version-pin the embedder; log model name and version into the index metadata.

Asymmetry ignored. You used an e5 or BGE model without prefixes. Retrieval is mediocre, but nothing errors. Fix: read the model card; apply query:/passage: consistently at index and query time. This one is invisible until you A/B against the prefixed version.

11) Metric defaults across vector stores¶

Every vector store and retrieval system exposes some subset of these metrics. The defaults reveal the team's assumptions.

Pinecone — supports cosine, dot product, and Euclidean. Cosine is the documented default for text.
Weaviate — configurable distance: cosine, dot, L2-squared, Hamming, Manhattan.
Qdrant — cosine, dot, Euclidean, Manhattan. Cosine is the typical text default.
Milvus / Zilliz — supports L2, IP (inner product), cosine. Cosine for normalized text embeddings.
Vespa — cosine, dot product, Euclidean; native ColBERT and late-interaction support.
pgvector (Postgres extension) — operators <=> (cosine), <#> (negative dot), <-> (L2).
Elasticsearch / OpenSearch k-NN — cosine, dot product, L2; supports HNSW and IVF.
FAISS — IndexFlatIP, IndexFlatL2; cosine via pre-normalization plus IP.
Chroma — cosine, L2, IP; defaults to L2 unless overridden.
ScaNN (Google) — dot product is the canonical metric; cosine via normalization.
USearch — cosine, dot, L2, Haversine, and custom metrics.
LanceDB — cosine, L2, dot product over the Lance columnar format.
Marqo — cosine over normalized embeddings, managed pipeline.
Vald — supports cosine, L2, dot via underlying NGT engine.
Redis Vector Search — cosine, L2, inner product.
MongoDB Atlas Vector Search — cosine, dot product, Euclidean.
Azure AI Search — cosine, dot product, Euclidean; hybrid + semantic ranker layered on top.
Vertex AI Vector Search (Matching Engine) — dot product, cosine, L2.
Pinecone Assistant — managed RAG on top of Pinecone, cosine by default.
ColBERT in Vespa — production-grade late-interaction retrieval.
RAGatouille — Python wrapper around ColBERTv2 for late interaction.
JaColBERT — Japanese-tuned ColBERT used in production search.
Cohere Rerank — cross-encoder reranker that sits after a cosine-based retriever.
Jina Reranker — same role, different family.

The pattern: almost every text-retrieval stack ships cosine (or normalized dot product) as the default, exposes dot product and L2 as alternatives, and treats reranking as a separate later stage.

12) Numbers worth remembering¶

Cosine cost ≈ dot product cost + 2 norms + 1 division. For unit vectors, exactly equal to dot product.
Normalization at index time: 1 sqrt + 1 division per vector. Negligible vs the embedding model forward pass (which is millions of FLOPs).
1536-dim float32 vector = 6.1 KB raw. 1M vectors = ~6.1 GB. Quantize to int8 → 1.5 GB. Quantize to binary → 192 MB.
ColBERT-style late interaction: ~200× storage blow-up at 200 tokens per chunk. Worth it only when precision matters more than disk.
e5 prefix bug (skipping query:/passage:): typically costs 3–7 points of NDCG@10. Invisible unless you measure.

13) Recall — eight questions on metric and model¶

What does cosine ignore that dot product preserves?
Under what condition are cosine and dot product the same?
Why is L2 distance less popular for text retrieval?
What does "asymmetric encoder" mean, and which prefix conventions have you seen?
In ColBERT, what does late interaction mean, and what does it cost?
What is the difference between a bi-encoder and a cross-encoder?
You forgot to normalize before indexing with dot product — what symptom would you see?
You upgraded your embedding model on the query side only — what happens?

14) Interview Q&A¶

Q1. Why is cosine similarity the default metric for text retrieval? A. Because text embedding models encode meaning in direction, not in magnitude. Cosine reads direction only, so it isolates the signal you actually care about. Length variations from training noise do not distort the ranking. Common wrong answer to avoid: "Because cosine is the most accurate metric in general."

Q2. When can I use dot product instead of cosine? A. When the embedding vectors are L2-normalized to unit length. Then cosine and dot product give the exact same ranking, and dot product is cheaper to compute in the inner loop. Common wrong answer to avoid: "Dot product is just a faster approximation of cosine."

Q3. My team uses dot product on unnormalized vectors. What can go wrong? A. Magnitude bias. A small number of long vectors will dominate every top-k regardless of their actual semantic relevance. Either normalize at index time or switch the metric to cosine. Common wrong answer to avoid: "Nothing — dot product and cosine are interchangeable."

Q4. What is an asymmetric encoder, and how do you use it correctly? A. A model trained to encode queries and passages differently, usually by routing them through the same network with distinct prefix tokens like query: and passage:. You must apply the right prefix on the right side at both index and query time, or retrieval quietly degrades. Common wrong answer to avoid: "Asymmetric encoder just means a two-tower model."

Q5. What does ColBERT's late interaction give you that a regular bi-encoder cannot? A. Token-level precision. Instead of one vector per chunk, ColBERT keeps one vector per token and scores via MaxSim — for each query token, take its best match among the doc tokens. This recovers fine-grained term-level relevance while staying retrievable at scale, unlike a cross-encoder. Common wrong answer to avoid: "ColBERT is just a faster cross-encoder."

Q6. Bi-encoder vs cross-encoder — when does each one fit? A. Bi-encoder embeds query and doc independently, compares by cosine or dot. Cheap at scale, used for the initial retrieve stage. Cross-encoder feeds both texts jointly into a transformer that outputs a single relevance score. Expensive per pair, used for reranking the top 20–100 from the bi-encoder. Common wrong answer to avoid: "Cross-encoders replace bi-encoders entirely."

Q7. You upgraded the embedding model but only re-embedded the queries. What happens? A. The query vector lives in the new model's geometry; the index lives in the old model's geometry. Same dimension, totally different coordinate systems. Top-k results become effectively random. Fix: re-embed the whole corpus on any model change, version-pin in the index metadata, and gate model upgrades behind a full re-indexing job. Common wrong answer to avoid: "As long as the dimension matches, mixing models is fine."

Q8. A negative cosine score appears in your top-k. What does that mean? A. The query and the chunk point in opposite directions in the embedding space. The model thinks they are semantically anti-aligned. In text retrieval this is rare and worth investigating — usually it means the chunk is genuinely off-topic, or the embedder is misbehaving on that input. Common wrong answer to avoid: "It means the chunk is missing or corrupt."

15) Apply now (10 min)¶

Step 1 — model the exercise. Here is the worked trace for our three cards:

A = [0.8, 0.6]    B = [1.6, 1.2]    C = [0.6, 0.8]    q = [0.9, 0.4]
cosine:   A and B tie (0.975), C lower (0.873)
dot:      B wins (1.92), then A (0.96), then C (0.86)
L2:       A closest (0.224), C next (0.500), B far (1.063)

Step 2 — your turn. Take q = [1, 1, 0, 1] and two candidates: d1 = [2, 2, 0, 2], d2 = [1, 0, 1, 1]. Compute dot product, cosine, and L2 for both. Which one wins under each metric? Then normalize d1 and d2 and recompute. Notice what changes.

Step 3 — sketch from memory. Redraw the 2D picture from section 2. Label one arrow that wins under cosine but loses under L2. Write one sentence on why.

Step 4 — model audit. Open the model card for the embedder you use in production. Find the prefix or instruction convention. Search your own indexing code for that string. If it is missing — that is a real bug you just found.

What you should remember¶

This chapter explained why cosine is the default ruler for text retrieval and what happens when you pick the wrong one. Embedding models encode meaning in direction, not magnitude — so cosine reads the signal and ignores the noise, while dot product on unnormalized vectors lets a few long vectors dominate the top-k for the wrong reason. L2 distance answers a different question entirely; use it only when coordinate distance has physical meaning.

You also learned that the model picks the map and the metric is just the ruler. A wrong embedder produces broken geometry that no metric can fix. The five checks before picking one — dimension, domain fit, language coverage, cost and latency, asymmetric-prefix conventions — are the audit you run before any indexing job, not after. The classic silent bug is forgetting query: / passage: prefixes on an asymmetric encoder: nothing errors, retrieval just gets quietly worse by 3–7 points NDCG.

Carry this diagnostic forward: when top-k looks random after a model or stack change, suspect geometry mismatch before suspecting the metric. Log the embedder model name and version into the index metadata. A change in either invalidates the whole index.

Remember:

Cosine = dot product only for L2-normalized vectors. Otherwise dot product punishes short vectors and rewards long ones.
Pick the model on your eval set, not on MTEB. Public benchmarks are averages; your corpus is specific.
Asymmetric encoders (e5, BGE) need prefixes. Skipping them is a silent retrieval regression.
ColBERT trades 100–200× storage for token-level precision. Reach for it when paraphrase-level matching is not enough.
Re-embed on any model change. Same dimension, different geometry — mixing destroys top-k.

Bridge. The ruler is settled. Cosine for normalized embeddings, dot product when the model encodes magnitude meaningfully, L2 for the rare cases where coordinate distance matters. But computing the score against every card on the bookshelf is still O(N) per query. At a million chunks, that is too slow. The next file shows how vector stores cheat — approximate nearest neighbour indexes that find the right neighbourhood without scanning the whole library.

→ 07-vector-stores-ann.md