06. Product quantization — smaller tags, cheaper memory, fuzzier geometry¶

~13 min read. ANN speed alone is not enough. Storage cost becomes brutal unless we compress vectors intelligently.

Continues from the first-principles overview in 00-first-principles.md. The package tag — the coordinate label on each parcel — can be shrunk into short codes so the warehouse stores far more items in memory.

1) Why memory becomes the next bottleneck¶

Begin with a concrete workload: A 768-dimensional float32 vector needs 768 * 4 = 3072 bytes. That is about 3 KB. A million vectors need about 3 GB. A hundred million vectors need about 300 GB. That is only raw vectors. No graph edges yet. No metadata yet.

In production, the failure is specific: High recall wants vectors in memory. Memory is expensive. Cache misses hurt latency. The practical rule is: Compress the vectors. But do not destroy the neighborhoods too badly. That is where product quantization enters.

Use this picture as the mental model before the details.

full tag: [0.12, -0.31, 0.84, 0.07,  ... ]

split into chunks:
[0.12, -0.31] [0.84, 0.07] [...]
      │             │
      ▼             ▼
   code 5        code 12

store short code ids instead of full floats

The package tag becomes a sequence of small codebook IDs. The geometry becomes approximate. Memory drops sharply.

2) The idea: split dimensions into subspaces¶

PQ does not quantize the whole vector with one giant codebook. That would be unwieldy. Instead, it splits the vector into smaller chunks. Each chunk gets its own codebook.

Suppose dimension d = 8. Choose m = 4 subspaces. Then each subvector has dimension 2. For each subspace, learn k centroids. Usually k = 256 so one byte stores the centroid ID.

See the shape.

original vector
[x1 x2 | x3 x4 | x5 x6 | x7 x8]
   S1      S2      S3      S4

for each Si, choose nearest centroid id
[id1 | id2 | id3 | id4]

Storage becomes dramatically smaller. Instead of eight floats, you store four bytes. In real systems, 768 floats can become 96 bytes with m = 96 and 8-bit codes. That is a huge saving. The loading dock can now pack many more parcels into RAM.

3) Worked numerical example with tiny codebooks¶

Take a 4D vector:

x = [1.2, 0.9 | -0.4, 0.3]

Split into two 2D subspaces. So m = 2.

Codebook for subspace 1 has centroids: - c1_0 = [1.0, 1.0] - c1_1 = [0.0, 0.0]

Codebook for subspace 2 has centroids: - c2_0 = [-0.5, 0.5] - c2_1 = [0.5, -0.5]

Use this picture as the mental model before the details.

subspace 1                    subspace 2

x1=(1.2,0.9)                  x2=(-0.4,0.3)
  near c1_0                     near c2_0

store code 0                  store code 0

Compute the subspace distances:

Subspace 1 to c1_0: sqrt((1.2-1.0)^2 + (0.9-1.0)^2) = sqrt(0.04 + 0.01) = sqrt(0.05) ≈ 0.224

Subspace 1 to c1_1: sqrt((1.2-0)^2 + (0.9-0)^2) = sqrt(1.44 + 0.81) = sqrt(2.25) = 1.5

Choose code 0 for subspace 1.

Subspace 2 to c2_0: sqrt((-0.4+0.5)^2 + (0.3-0.5)^2) = sqrt(0.01 + 0.04) = sqrt(0.05) ≈ 0.224

Subspace 2 to c2_1: sqrt((-0.4-0.5)^2 + (0.3+0.5)^2) = sqrt(0.81 + 0.64) = sqrt(1.45) ≈ 1.204

Choose code 0 for subspace 2.

Final PQ code is [0 | 0]. Approximate reconstructed vector is: [1.0, 1.0 | -0.5, 0.5]

Compression succeeded. But we changed the point. That is the accuracy cost.

4) Distance estimation with ADC¶

Search usually works through a pattern called asymmetric distance computation, or ADC. One common pattern is asymmetric distance computation, or ADC. Store database vectors as PQ codes. Keep the query as a full vector. Then precompute distances from each query subvector to each centroid.

Use this picture as the mental model before the details.

query q split into subspaces
   │
   ├─ table for subspace 1: dist(q1, c1_0), dist(q1, c1_1), ...
   └─ table for subspace 2: dist(q2, c2_0), dist(q2, c2_1), ...

code [0|0] -> lookup two table entries -> add them

Worked mini example. Let query be: q = [1.1, 1.0 | -0.6, 0.4]

Subspace 1 distances: - to c1_0: sqrt((1.1-1.0)^2 + (1.0-1.0)^2) = 0.1 - to c1_1: sqrt(1.21 + 1.0) ≈ 1.487

Subspace 2 distances: - to c2_0: sqrt((-0.6+0.5)^2 + (0.4-0.5)^2) = sqrt(0.01 + 0.01) ≈ 0.141 - to c2_1: much larger

For database code [0|0], approximate distance is: 0.1 + 0.141 = 0.241 if using additive lookup intuition. The exact formula choice depends on implementation. But the key idea is clear. We score compressed codes quickly using lookup tables.

That is why IVF-PQ is powerful. Routing narrows the candidate set. PQ shrinks memory and speeds comparisons.

5) What you gain and what you lose¶

Gain one. Huge RAM savings. Gain two. Better cache behavior. Gain three. Cheaper billion-scale search.

Loss one. Approximation error. Loss two. Harder updates and rebuild logic. Loss three. More tuning knobs.

In production, the failure is specific: If the query needs fine semantic distinctions, coarse codes blur them. Two nearby vectors may quantize to the same code. Or one good vector may jump past a decision boundary. That hurts recall.

The blur picture shows the cost.

true points:     ●  ●   q   ●
PQ buckets:     [same code][same code]

fine ordering inside a bucket gets fuzzy

The practical rule is: Often teams use a two-stage design. First retrieve many candidates using compressed search. Then rerank top candidates with original vectors. That rescues quality. The scout robot first moves cheaply. Then it becomes precise near the finish line.

6) Why not buying more RAM forever under this workload¶

The tempting alternative is buying more RAM forever because it keeps the architecture small and makes the first demo look clean. That story is useful for a prototype, but it becomes dangerous once the workload has real scale, filters, freshness pressure, and evaluation data.

It fails when vector memory dominates cost, but compression distorts geometry. At that point the system needs an inspectable artifact — compressed codebooks and approximate distance table — because otherwise every bad answer turns into a vague argument about whether embeddings, ANN, metadata filters, lifecycle, or evaluation are guilty.

Option	Works when	Fails when	Cost moves to
buying more RAM forever	corpus is small or low-risk	vector memory dominates cost, but compression distorts geometry	latency, recall, or user trust
product quantization	the failure can be measured in the index path	traces or baselines are missing	memory, rebuilds, evals, operations

Mini-FAQ. "Is this always worth adding?" No. The RAG-fundamentals rule still applies: add machinery only when a measured workload pressure earns it. If exact search is cheap, if filters are simple, or if evaluation is missing, the clever index can become a more expensive way to stay confused.

7) Production signals — know whether product quantization is working¶

Healthy behavior means compressed codebooks and approximate distance table explains why the returned neighbors changed. In a real incident review, you should be able to point at that artifact and explain why the candidate set changed, not merely say that the database returned something.

The first metric to watch is recall drop per byte saved. Track it by query family, tenant, corpus slice, and index version, because global averages hide exactly the failures users notice first.

The misleading metric is database uptime. A vector database can be perfectly available while recall, filtering, freshness, or embedding compatibility is broken, so uptime only proves the warehouse doors opened; it does not prove the scout robot found the right shelf.

The expert graph compares exact baseline recall, p50/p99 latency, filter selectivity, index version, embedding version, and bad-query examples by slice. That graph is the difference between tuning knobs and debugging a retrieval system.

bad retrieval
   -> query vector / filter
   -> index path
   -> candidate neighbors
   -> score and metadata trace
   -> exact baseline or judged list

8) Boundary — where product quantization helps and where it does not¶

Use this mechanism when the failure happens inside vector geometry, index traversal, filtering, lifecycle, or serving operations. That is the zone where vector-database machinery can actually change the returned neighbors, the latency curve, or the operational envelope.

Do not expect it to fix cases where the source content is wrong, the embedding model is poor for the domain, or the product definition of relevance is unresolved. Those are upstream or product-definition failures, and better ANN settings will only make the wrong evidence arrive faster.

The common pathology is that teams keep tuning ANN knobs when the real issue is bad chunks, stale data, weak labels, or missing evals. In interviews, call this out explicitly: the index is not the whole retrieval system, it is one stage inside a pipeline that also depends on documents, chunks, labels, and evals.

The scale limit is blunt: every improvement spends something — RAM, disk, build time, query latency, engineering time, or vendor lock-in. The mature answer is not to pick the fanciest mechanism; it is to choose the pressure you are willing to pay for.

9) Wrong model — compression only affects storage, not relevance¶

The wrong model is attractive because it compresses the system into one easy story, and easy stories feel good in design docs. The trouble is that production vector search is not one story; it is embedding quality, distance metric, ANN index, metadata filters, lifecycle, sharding, vendor operations, and monitoring all interacting under traffic.

If product quantization cannot change recall, latency, cost, freshness, or debug visibility, it is not carrying its weight; it is vocabulary without leverage.

10) Failure taxonomy for product quantization¶

Geometry failure — the embedding space does not put useful neighbors close enough.
Metric failure — the chosen similarity ruler disagrees with the model or workload.
Index failure — ANN skips relevant vectors or returns unstable candidates.
Filtering failure — metadata filters erase good candidates or violate scope.
Lifecycle failure — stale, mixed-version, or partially rebuilt indexes serve traffic.
Scale failure — fan-out, memory, or rebuild cost breaks the SLO.
Debugging failure — no trace connects query vector, index path, candidates, and final result.

11) Pattern transfer — where this returns later¶

RAG uses vector DBs as the evidence gateway before generation.
Retrieval and ranking supplies the metrics and fusion logic used here.
Data engineering supplies chunk quality, metadata, and embedding-version hygiene.
Production evals decide whether recall and relevance changes actually help users.

12) Design review checklist¶

What pressure is this mechanism relieving: latency, memory, filtering, freshness, scale, or evaluation?
What artifact would you inspect first: vector neighbors, index trace, filter plan, namespace manifest, or exact baseline?
Why is buying more RAM forever weaker for this workload?
Which slice should improve first?
Which cost rises first: RAM, disk, build time, query latency, or operational complexity?
What rollback signal tells you the index change hurt retrieval?

Where this lives in the wild¶

FAISS IVF-PQ deployments — vector infra engineer. Product quantization keeps billion-scale search inside practical RAM budgets.
Image deduplication platforms — computer vision engineer. Compressed embeddings make large visual corpora searchable without absurd hardware spend.
Ad ranking candidate generation — recommender systems engineer. PQ reduces memory footprint for huge item towers before later-stage reranking.
Large multilingual support search — search platform engineer. Teams use compressed vector indexes so many tenant shards stay memory-resident.
Edge-device retrieval experiments — applied ML engineer. Quantized storage enables local semantic lookup on constrained hardware.
Enterprise RAG — vector DBs store policy, wiki, ticket, and document chunks for semantic retrieval.
Ecommerce search — vectors help with descriptive queries while filters protect catalog scope.
Support copilots — need metadata filters for tenant, product, language, and freshness.
Code search — mixes semantic vectors with exact identifiers and repository permissions.
Recommendation systems — use nearest-neighbor retrieval before ranking models.
Image and multimodal search — embeddings represent images, captions, and cross-modal queries.
Legal discovery — recall and auditability are more important than average latency alone.
Healthcare retrieval — metadata, permissions, and freshness are safety boundaries.
Fraud and anomaly systems — vector similarity finds nearby behavior patterns.
Personalization systems — user and item embeddings need versioned lifecycle management.

Recall checkpoint¶

Why does PQ split vectors into subspaces instead of one giant codebook?
What is stored in place of full float vectors?
Why does PQ help cache and RAM behavior?
Why do teams often rerank after PQ retrieval?
Which artifact would you inspect first for product quantization?
What query or corpus slice would prove the improvement is real?
What is the first operational cost this mechanism adds?

Interview Q&A¶

Q: Why use PQ and not only float32 vectors when recall matters? A: Because memory pressure can dominate cost and latency at scale, and modest quantization error may be acceptable if reranking restores quality.

Common wrong answer to avoid: "Because ANN requires compressed vectors." Many ANN systems work with raw floats too.

Q: Why split vectors into subspaces in PQ? A: Because smaller independent codebooks are tractable to train and store, while one full-space codebook would explode combinatorially.

Common wrong answer to avoid: "Just for parallelism." The main reason is representational tractability.

Q: Why does asymmetric distance computation help? A: Because it keeps the query exact while scoring compressed database codes efficiently via lookup tables.

Common wrong answer to avoid: "ADC makes search exact again." It remains approximate.

Q: Why might PQ hurt semantic search quality on fine-grained tasks? A: Because quantization blurs small geometric differences that may matter for ranking close candidates.

Common wrong answer to avoid: "Only very large distances are affected." Near-neighbor ordering is often where the pain appears.

Q: What artifact would you inspect first when product quantization fails? A: I would inspect compressed codebooks and approximate distance table, then compare it with exact baseline, filter state, index version, and embedding version.

Common wrong answer to avoid: "Just check whether the vector DB is up." — Availability does not prove recall, freshness, or relevance.

Q: How do you know the change helped? A: Track recall drop per byte saved on a representative query slice and compare it with latency, memory, build time, and filtered-result behavior.

Common wrong answer to avoid: "The average similarity score increased." — Similarity scores are not product-quality metrics by themselves.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is small, exact search is cheap, or the team lacks evaluation data to prove the extra complexity helps.

Common wrong answer to avoid: "Every production AI system needs the most advanced vector index." — The right index depends on workload, scale, filters, and operational constraints.

Apply now (10 min)¶

Exercise. Take a 4D toy vector. Split it into two 2D chunks. Invent two centroids per chunk. Assign the nearest code in each chunk. Then reconstruct the approximate vector. Compare it with the original.

Sketch from memory. Draw a long package tag becoming short code IDs. Label one memory saving and one ranking risk for the route map.

Reproduce from memory: explain product quantization with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Product quantization exists because vector memory dominates cost, but compression distorts geometry. The point is not to memorize a vendor feature; it is to know which workload pressure the mechanism relieves and which cost it creates.

The artifact to inspect is compressed codebooks and approximate distance table. If you cannot inspect it, vector search debugging becomes guesswork.

Remember:

Vector search fails through geometry, metrics, indexes, filters, lifecycle, scale, and monitoring.
Watch recall drop per byte saved by query and corpus slice before trusting global averages.
Exact baselines and judged lists are how you keep ANN tuning honest.
Every vector database choice moves cost between recall, latency, memory, rebuilds, and operations.

Bridge. Fast search and compact storage still fail if results violate product rules. The next file asks how metadata filters interact with ANN search. → 07-metadata-filtering.md