Skip to content

Embedding Search — Analysis

Setup

  • 10,000 vectors, 64 dimensions, Gaussian random.
  • 20 query vectors, also Gaussian.
  • Comparing exact cosine over the full corpus vs. IVF with various n_centroids and probe settings.

Run output

corpus: 10000 vectors, dim=64, k=10

exact:          0.10 ms/query, recall@10 = 1.000
IVF n_c= 32 probe=2:    0.12 ms/query, recall@10 = 0.230
IVF n_c= 32 probe=4:    0.23 ms/query, recall@10 = 0.345
IVF n_c= 32 probe=8:    0.48 ms/query, recall@10 = 0.540
IVF n_c= 64 probe=2:    0.07 ms/query, recall@10 = 0.180
IVF n_c= 64 probe=4:    0.13 ms/query, recall@10 = 0.260
IVF n_c= 64 probe=8:    0.26 ms/query, recall@10 = 0.400
IVF n_c=128 probe=2:    0.05 ms/query, recall@10 = 0.095
IVF n_c=128 probe=4:    0.07 ms/query, recall@10 = 0.165
IVF n_c=128 probe=8:    0.14 ms/query, recall@10 = 0.275

What this shows

Gaussian random vectors are an unfriendly workload for IVF. Random points in high-dimensional space have no real cluster structure; the centroids don't carve meaningful neighbourhoods. Recall is poor across the board.

This is by design for the exercise: the implementation is correct, but the workload reveals the IVF method's structural limitation. Real embeddings from a trained model (sentence transformers, OpenAI ada, etc.) have actual semantic structure — clusters of topically-related vectors. IVF works much better there: recall@10 of 0.85-0.95 with similar probe counts.

The recall × latency trade-off

The output shows the trade-off clearly. For the same n_centroids, more probes → higher recall → higher latency. For fewer centroids, each shard is bigger, so even at small probe counts a larger fraction of the corpus is scanned.

The dial production teams use:

  • More centroids + more probes → high recall, modest latency win.
  • More centroids + few probes → low recall (this corpus is bad), big latency win.
  • Few centroids + few probes → recall is OK; latency is similar to exact.
  • Few centroids + many probes → equivalent to exact (you scan everything).

The takeaway: tune n_centroids to about sqrt(N) (here, 100ish for 10K vectors), then set probe to hit your recall target.

Why normalization makes cosine == dot product

cosine(u, v) = (u · v) / (||u|| × ||v||)

If ||u|| == ||v|| == 1, the denominator is 1 and cosine reduces to the dot product. Normalising once at insert lets the search be a single matmul (vecs @ q), which is what makes the exact-search code so fast. Without normalisation, every search would need to divide by ||v|| per vector — many more operations.

Why exact is competitive at this scale

On 10K × 64 vectors, the full matmul is ~640K float operations, which finishes in ~0.1 ms on modern CPUs. IVF adds overhead (centroid scoring, shard merging) that competes with this. IVF wins at much larger corpora — 1M+ vectors — where the linear scan becomes the bottleneck.

The mental model: at this scale, exact is faster than most ANN methods because the constant factors haven't been overcome. ANN's structural advantage manifests at 100K-1M vectors and above.

What this exercise teaches

  • The math is simple. Cosine = dot product after normalisation; IVF = cluster-and-probe.
  • The constants matter. Below ~100K vectors, exact often beats ANN.
  • The workload matters more than the algorithm. Random Gaussian vectors are terrible for IVF; real embeddings are much friendlier.
  • The dials are workload-specific. No universal n_centroids or probe; tune to the data.

Live-coding interviewers typically push on:

  • "Why normalise vectors?"
  • "Walk through the IVF lookup."
  • "What is recall@k and how do you measure it?"
  • "When would you use IVF vs. HNSW vs. exact?"
  • "What is the tradeoff dial in IVF?"

Each has a one-paragraph answer drawn from this exercise.