Embedding Search — Analysis¶
Setup¶
- 10,000 vectors, 64 dimensions, Gaussian random.
- 20 query vectors, also Gaussian.
- Comparing exact cosine over the full corpus vs. IVF with various
n_centroidsandprobesettings.
Run output¶
corpus: 10000 vectors, dim=64, k=10
exact: 0.10 ms/query, recall@10 = 1.000
IVF n_c= 32 probe=2: 0.12 ms/query, recall@10 = 0.230
IVF n_c= 32 probe=4: 0.23 ms/query, recall@10 = 0.345
IVF n_c= 32 probe=8: 0.48 ms/query, recall@10 = 0.540
IVF n_c= 64 probe=2: 0.07 ms/query, recall@10 = 0.180
IVF n_c= 64 probe=4: 0.13 ms/query, recall@10 = 0.260
IVF n_c= 64 probe=8: 0.26 ms/query, recall@10 = 0.400
IVF n_c=128 probe=2: 0.05 ms/query, recall@10 = 0.095
IVF n_c=128 probe=4: 0.07 ms/query, recall@10 = 0.165
IVF n_c=128 probe=8: 0.14 ms/query, recall@10 = 0.275
What this shows¶
Gaussian random vectors are an unfriendly workload for IVF. Random points in high-dimensional space have no real cluster structure; the centroids don't carve meaningful neighbourhoods. Recall is poor across the board.
This is by design for the exercise: the implementation is correct, but the workload reveals the IVF method's structural limitation. Real embeddings from a trained model (sentence transformers, OpenAI ada, etc.) have actual semantic structure — clusters of topically-related vectors. IVF works much better there: recall@10 of 0.85-0.95 with similar probe counts.
The recall × latency trade-off¶
The output shows the trade-off clearly. For the same n_centroids, more probes → higher recall → higher latency. For fewer centroids, each shard is bigger, so even at small probe counts a larger fraction of the corpus is scanned.
The dial production teams use:
- More centroids + more probes → high recall, modest latency win.
- More centroids + few probes → low recall (this corpus is bad), big latency win.
- Few centroids + few probes → recall is OK; latency is similar to exact.
- Few centroids + many probes → equivalent to exact (you scan everything).
The takeaway: tune n_centroids to about sqrt(N) (here, 100ish for 10K vectors), then set probe to hit your recall target.
Why normalization makes cosine == dot product¶
If ||u|| == ||v|| == 1, the denominator is 1 and cosine reduces to the dot product. Normalising once at insert lets the search be a single matmul (vecs @ q), which is what makes the exact-search code so fast. Without normalisation, every search would need to divide by ||v|| per vector — many more operations.
Why exact is competitive at this scale¶
On 10K × 64 vectors, the full matmul is ~640K float operations, which finishes in ~0.1 ms on modern CPUs. IVF adds overhead (centroid scoring, shard merging) that competes with this. IVF wins at much larger corpora — 1M+ vectors — where the linear scan becomes the bottleneck.
The mental model: at this scale, exact is faster than most ANN methods because the constant factors haven't been overcome. ANN's structural advantage manifests at 100K-1M vectors and above.
What this exercise teaches¶
- The math is simple. Cosine = dot product after normalisation; IVF = cluster-and-probe.
- The constants matter. Below ~100K vectors, exact often beats ANN.
- The workload matters more than the algorithm. Random Gaussian vectors are terrible for IVF; real embeddings are much friendlier.
- The dials are workload-specific. No universal
n_centroidsorprobe; tune to the data.
Live-coding interviewers typically push on:
- "Why normalise vectors?"
- "Walk through the IVF lookup."
- "What is recall@k and how do you measure it?"
- "When would you use IVF vs. HNSW vs. exact?"
- "What is the tradeoff dial in IVF?"
Each has a one-paragraph answer drawn from this exercise.