02. Vector similarity metrics — the ruler decides the answer¶

~12 min read. Same vectors, different metric, different neighbors. That is why metric choice is a product decision.

Continues from the first-principles overview in 00-first-principles.md. The package tag — the coordinate label on each parcel — only becomes useful after we choose the ruler that compares tags.

1) One picture, three rulers¶

Begin with a concrete workload: A vector is just coordinates. A similarity metric says how to compare coordinates. Without the metric, the coordinates are incomplete.

The three common rulers are cosine similarity, dot product, and Euclidean distance. All three look reasonable. All three reward different things.

Cosine cares about angle. Dot product cares about angle and magnitude. Euclidean distance cares about straight-line distance.

Read the picture before the formulas.

same direction, different length

           B (6,6)
          /
         /
        /
       /
      /
     /
A (3,3)
   /
  /
 /
q (1,1)

C (1,-1) points away in another direction

Query q points northeast. A and B point in the same direction. C points differently. Cosine will love A and B similarly. Dot product will like B more because it is longer. Euclidean may like A more because it is closer in absolute distance.

The practical rule is: First understand what your embedding model encodes. Then choose the ruler that matches that encoding.

2) Cosine similarity: direction matters most¶

Cosine similarity asks a clean question. Do these two vectors point in the same direction? It divides out vector length. So magnitude mostly disappears.

The formula names that intuition.

cosine(q, x) = (q · x) / (||q|| ||x||)

If the angle is zero, cosine is 1. If vectors are orthogonal, cosine is 0. If they point opposite, cosine is -1.

Worked example. Take:

q = (1, 1)
A = (2, 2)
B = (10, 10)
C = (1, -1)

First norms.

||q|| = sqrt(1^2 + 1^2) = sqrt(2)
||A|| = sqrt(4 + 4) = sqrt(8)
||B|| = sqrt(100 + 100) = sqrt(200)
||C|| = sqrt(1 + 1) = sqrt(2)

The dot products are:

q · A = 1*2 + 1*2 = 4
q · B = 1*10 + 1*10 = 20
q · C = 1*1 + 1*(-1) = 0

The cosine scores are:

cos(q,A) = 4 / (sqrt(2)*sqrt(8)) = 4 / 4 = 1
cos(q,B) = 20 / (sqrt(2)*sqrt(200)) = 20 / 20 = 1
cos(q,C) = 0 / (sqrt(2)*sqrt(2)) = 0

A and B tie. Why? Same direction. Different lengths do not matter.

This is very attractive for text embeddings. Sentence length or embedding norm may vary. But semantic direction often matters more. The scout robot then ignores how long the arrow is. It follows orientation on the warehouse floor.

3) Dot product: direction plus strength¶

Dot product looks simpler.

dot(q, x) = q · x = Σ qi xi

No normalization step. So magnitude stays in the game. This can be helpful. It can also surprise you.

Use the same example. Scores are already computed.

dot(q,A) = 4
dot(q,B) = 20
dot(q,C) = 0

With dot product, B wins strongly even though direction did not change; its larger norm is now part of the score.

The product question is whether that magnitude has meaning. Sometimes embedding magnitude encodes confidence, frequency, or importance. Then dot product is useful. Recommendation systems often like this. A highly active user vector can rightfully carry more weight. A popular item vector may also have larger norm.

But there is a trap. If magnitude differences are accidental, dot product skews results. The route map then keeps favoring long vectors. You may think you built semantic search. Actually you built norm search. That hurts quietly.

A toy diagram makes the norm effect visible.

same angle

q ---->
A ------------>
B -------------------------------->

cosine says A = B
dot product says B > A

So in interviews, say this carefully. Dot product is not better or worse by itself. It assumes magnitude should matter. That assumption must be justified.

4) Euclidean distance: absolute closeness in space¶

Euclidean distance asks a different question. How far apart are the points physically? That is straight-line distance.

d(q, x) = sqrt(Σ (qi - xi)^2)

Use the same vectors again.

Distance to A: sqrt((1-2)^2 + (1-2)^2) = sqrt(1 + 1) = sqrt(2) ≈ 1.41

Distance to B: sqrt((1-10)^2 + (1-10)^2) = sqrt(81 + 81) = sqrt(162) ≈ 12.73

Distance to C: sqrt((1-1)^2 + (1+1)^2) = sqrt(0 + 4) = 2

So Euclidean ranking is A, then C, then B. That is very different from dot product. And slightly different from cosine.

Why did B drop so hard? Because B is far away in absolute position. Same direction is not enough.

This matters when embeddings are normalized. If every vector is L2-normalized to unit length, something neat happens. Cosine similarity and Euclidean distance become monotonic transforms of each other. They give the same ranking. The engineering move is to check whether your system normalizes vectors before indexing.

5) Practical selection rules¶

Use cosine when semantic direction matters and norms are noisy. This is common in text retrieval. Use dot product when norms carry signal. This is common in recommendation or contrastive setups. Use Euclidean when absolute geometry matters naturally. This appears in some vision, clustering, and anomaly workflows.

One more worked comparison makes the shortcut concrete. Suppose normalized vectors:

q = (0.6, 0.8)
A = (0.8, 0.6)
B = (0, 1)

All norms are 1. Dot products:

q·A = 0.48 + 0.48 = 0.96
q·B = 0 + 0.8 = 0.8

Cosine scores are the same numbers because norms are 1. Now Euclidean distances.

d(q,A) = sqrt((0.6-0.8)^2 + (0.8-0.6)^2) = sqrt(0.04 + 0.04) = sqrt(0.08) ≈ 0.283
d(q,B) = sqrt((0.6-0)^2 + (0.8-1)^2) = sqrt(0.36 + 0.04) = sqrt(0.40) ≈ 0.632

Same winner. That is the normalized-vector shortcut.

The loading dock should document this choice. If embeddings are normalized during ingestion, say so clearly. If the ANN engine expects one metric, match it consistently. If offline evaluation uses cosine but production index uses dot product, results drift. Then debugging becomes a mess.

6) Why not choosing the default metric without matching the embedding model under this workload¶

The tempting alternative is choosing the default metric without matching the embedding model because it keeps the architecture small and makes the first demo look clean. That story is useful for a prototype, but it becomes dangerous once the workload has real scale, filters, freshness pressure, and evaluation data.

It fails when the distance metric decides which neighbors exist and can flip the result order. At that point the system needs an inspectable artifact — same vectors scored by cosine, dot product, and L2 — because otherwise every bad answer turns into a vague argument about whether embeddings, ANN, metadata filters, lifecycle, or evaluation are guilty.

Option	Works when	Fails when	Cost moves to
choosing the default metric without matching the embedding model	corpus is small or low-risk	the distance metric decides which neighbors exist and can flip the result order	latency, recall, or user trust
similarity metrics	the failure can be measured in the index path	traces or baselines are missing	memory, rebuilds, evals, operations

Mini-FAQ. "Is this always worth adding?" No. The RAG-fundamentals rule still applies: add machinery only when a measured workload pressure earns it. If exact search is cheap, if filters are simple, or if evaluation is missing, the clever index can become a more expensive way to stay confused.

7) Production signals — know whether similarity metrics is working¶

Healthy behavior means same vectors scored by cosine, dot product, and L2 explains why the returned neighbors changed. In a real incident review, you should be able to point at that artifact and explain why the candidate set changed, not merely say that the database returned something.

The first metric to watch is metric-flip rate on judged queries. Track it by query family, tenant, corpus slice, and index version, because global averages hide exactly the failures users notice first.

The misleading metric is database uptime. A vector database can be perfectly available while recall, filtering, freshness, or embedding compatibility is broken, so uptime only proves the warehouse doors opened; it does not prove the scout robot found the right shelf.

The expert graph compares exact baseline recall, p50/p99 latency, filter selectivity, index version, embedding version, and bad-query examples by slice. That graph is the difference between tuning knobs and debugging a retrieval system.

bad retrieval
   -> query vector / filter
   -> index path
   -> candidate neighbors
   -> score and metadata trace
   -> exact baseline or judged list

8) Boundary — where similarity metrics helps and where it does not¶

Use this mechanism when the failure happens inside vector geometry, index traversal, filtering, lifecycle, or serving operations. That is the zone where vector-database machinery can actually change the returned neighbors, the latency curve, or the operational envelope.

Do not expect it to fix cases where the source content is wrong, the embedding model is poor for the domain, or the product definition of relevance is unresolved. Those are upstream or product-definition failures, and better ANN settings will only make the wrong evidence arrive faster.

The common pathology is that teams keep tuning ANN knobs when the real issue is bad chunks, stale data, weak labels, or missing evals. In interviews, call this out explicitly: the index is not the whole retrieval system, it is one stage inside a pipeline that also depends on documents, chunks, labels, and evals.

The scale limit is blunt: every improvement spends something — RAM, disk, build time, query latency, engineering time, or vendor lock-in. The mature answer is not to pick the fanciest mechanism; it is to choose the pressure you are willing to pay for.

9) Wrong model — all vector metrics are interchangeable¶

The wrong model is attractive because it compresses the system into one easy story, and easy stories feel good in design docs. The trouble is that production vector search is not one story; it is embedding quality, distance metric, ANN index, metadata filters, lifecycle, sharding, vendor operations, and monitoring all interacting under traffic.

If similarity metrics cannot change recall, latency, cost, freshness, or debug visibility, it is not carrying its weight; it is vocabulary without leverage.

10) Failure taxonomy for similarity metrics¶

Geometry failure — the embedding space does not put useful neighbors close enough.
Metric failure — the chosen similarity ruler disagrees with the model or workload.
Index failure — ANN skips relevant vectors or returns unstable candidates.
Filtering failure — metadata filters erase good candidates or violate scope.
Lifecycle failure — stale, mixed-version, or partially rebuilt indexes serve traffic.
Scale failure — fan-out, memory, or rebuild cost breaks the SLO.
Debugging failure — no trace connects query vector, index path, candidates, and final result.

11) Pattern transfer — where this returns later¶

RAG uses vector DBs as the evidence gateway before generation.
Retrieval and ranking supplies the metrics and fusion logic used here.
Data engineering supplies chunk quality, metadata, and embedding-version hygiene.
Production evals decide whether recall and relevance changes actually help users.

12) Design review checklist¶

What pressure is this mechanism relieving: latency, memory, filtering, freshness, scale, or evaluation?
What artifact would you inspect first: vector neighbors, index trace, filter plan, namespace manifest, or exact baseline?
Why is choosing the default metric without matching the embedding model weaker for this workload?
Which slice should improve first?
Which cost rises first: RAM, disk, build time, query latency, or operational complexity?
What rollback signal tells you the index change hurt retrieval?

Where this lives in the wild¶

OpenAI retrieval builders — search engineer. Cosine similarity is commonly used for normalized text embeddings in semantic search.
YouTube recommendations — recommender engineer. Dot product is natural when user and video embedding magnitudes carry engagement signal.
Pinterest visual search — computer vision engineer. Euclidean distance is often used after image-feature normalization and ANN indexing.
Qdrant deployments for support bots — platform engineer. Metric choice must match embedding model assumptions before HNSW tuning even begins.
pgvector inside PostgreSQL — database engineer. Teams explicitly choose vector_cosine_ops, vector_ip_ops, or vector_l2_ops based on workload semantics.
Enterprise RAG — vector DBs store policy, wiki, ticket, and document chunks for semantic retrieval.
Ecommerce search — vectors help with descriptive queries while filters protect catalog scope.
Support copilots — need metadata filters for tenant, product, language, and freshness.
Code search — mixes semantic vectors with exact identifiers and repository permissions.
Recommendation systems — use nearest-neighbor retrieval before ranking models.
Image and multimodal search — embeddings represent images, captions, and cross-modal queries.
Legal discovery — recall and auditability are more important than average latency alone.
Healthcare retrieval — metadata, permissions, and freshness are safety boundaries.
Fraud and anomaly systems — vector similarity finds nearby behavior patterns.
Personalization systems — user and item embeddings need versioned lifecycle management.

Recall checkpoint¶

What does cosine ignore that dot product keeps?
Why can dot product accidentally become a norm-ranking system?
When do cosine and Euclidean produce the same ranking?
Which metric would you test first for text embeddings, and why?
Which artifact would you inspect first for similarity metrics?
What query or corpus slice would prove the improvement is real?
What is the first operational cost this mechanism adds?

Interview Q&A¶

Q: Why use cosine similarity and not dot product for many text-search systems? A: Because cosine removes magnitude effects and focuses on direction, which often tracks semantic similarity better when norms are unstable.

Common wrong answer to avoid: "Cosine is always more accurate." It is only better when magnitude should not matter.

Q: Why might a recommender choose dot product and not cosine? A: Because user or item norm can encode strength, popularity, or confidence, so removing magnitude would discard useful signal.

Common wrong answer to avoid: "Because dot product is faster." The choice is usually semantic first, speed second.

Q: Why does Euclidean distance behave differently from cosine on unnormalized vectors? A: Because Euclidean punishes absolute position differences, while cosine only looks at angle between directions.

Common wrong answer to avoid: "They are basically the same metric." Only under normalization do rankings align.

Q: Why is metric mismatch between offline eval and production dangerous? A: Because you may tune embeddings and recall against one ruler, then deploy another ruler and silently change neighbor ranking.

Common wrong answer to avoid: "ANN indexes fix that automatically." The index only accelerates the chosen metric.

Q: What artifact would you inspect first when similarity metrics fails? A: I would inspect same vectors scored by cosine, dot product, and L2, then compare it with exact baseline, filter state, index version, and embedding version.

Common wrong answer to avoid: "Just check whether the vector DB is up." — Availability does not prove recall, freshness, or relevance.

Q: How do you know the change helped? A: Track metric-flip rate on judged queries on a representative query slice and compare it with latency, memory, build time, and filtered-result behavior.

Common wrong answer to avoid: "The average similarity score increased." — Similarity scores are not product-quality metrics by themselves.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is small, exact search is cheap, or the team lacks evaluation data to prove the extra complexity helps.

Common wrong answer to avoid: "Every production AI system needs the most advanced vector index." — The right index depends on workload, scale, filters, and operational constraints.

Apply now (10 min)¶

Exercise. Take three vectors from this file. Compute cosine, dot product, and Euclidean rankings by hand. Then write one sentence on which ranking feels right for semantic search.

Sketch from memory. Draw one arrow diagram with same direction but different lengths. Label where cosine ties, where dot product separates, and where Euclidean punishes distance on the warehouse floor.

Reproduce from memory: explain similarity metrics with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Similarity metrics exists because the distance metric decides which neighbors exist and can flip the result order. The point is not to memorize a vendor feature; it is to know which workload pressure the mechanism relieves and which cost it creates.

The artifact to inspect is same vectors scored by cosine, dot product, and L2. If you cannot inspect it, vector search debugging becomes guesswork.

Remember:

Vector search fails through geometry, metrics, indexes, filters, lifecycle, scale, and monitoring.
Watch metric-flip rate on judged queries by query and corpus slice before trusting global averages.
Exact baselines and judged lists are how you keep ANN tuning honest.
Every vector database choice moves cost between recall, latency, memory, rebuilds, and operations.

Bridge. Once the ruler is fixed, we still need a baseline search method. Start with the honest one first: compare the query against every package tag. → 03-brute-force-baseline.md