07. Vector stores and ANN — the bookshelf at a hundred million chunks¶

~12 min read. You have great embeddings. You have a clean corpus. Then somebody loads 100M chunks at 1536 dimensions, and p99 latency walks off a cliff. By the end of this page you will know why brute force breaks, how IVF and HNSW dodge that cliff, and which vector store to pick when.

Builds on 06-similarity-and-models.md. The bookshelf from the ELI5 is exactly the vector store. ANN is the librarian's shortcut routes to skip wide swaths of shelves without missing the right book.

1) The wall — when brute force breaks¶

Picture the running example for this page. A support corpus. 50M chunks. Each chunk embedded at 1536 dimensions in float32.

That is 50,000,000 × 1536 × 4 bytes. About 307 GB of raw vectors. Before you index a single edge.

Now compute one query the naive way. Cosine against every vector. 50M dot products at 1536 dims is roughly 77 billion multiply-adds per query. On a modern CPU with AVX2, optimistic throughput, that is hundreds of milliseconds per query. On one machine. With no concurrency.

Push to 100 QPS, watch p99 blow past 2 seconds. Push to 1000 QPS, the box is on fire.

That is the wall. Brute force is O(n) per query, and n grows because the product grows. You cannot embed your way out of this. You need a different shape on the shelf.

Approximate Nearest Neighbour search makes a clean trade. Accept a small recall loss — say recall@10 of 0.95 instead of 1.0 — and search collapses from O(n) to roughly O(log n) or O(√n). On the same 50M corpus, queries drop from ~300 ms to under 10 ms. The same hardware now serves 100x more traffic.

The bookshelf metaphor sharpens here. Brute force is the librarian reading every spine, one by one. ANN is the librarian who knows the layout — fiction is upstairs, refunds are on aisle 7, GPU pricing is in the basement — and walks straight to the right section. Some books in odd corners may be missed. The librarian moves a hundred times faster.

Mini-FAQ. "What does recall@10 actually mean, and how is it measured?" Recall@10 is the fraction of the true top-10 nearest neighbours (computed by brute force on a held-out sample) that your ANN system also returns in its top-10. Measured offline: pick 1000 random queries, run exact KNN to get ground truth, then run your ANN index and count overlap. Production targets: 0.95–0.99 for most RAG. Below 0.9 starts hurting visible quality.

2) IVF — cluster first, search nearby¶

IVF stands for Inverted File index. The idea is older than HNSW and simpler to picture.

Step 1 — partition the space. Run k-means on a sample of vectors. Pick nlist centroids (typical: 1024, 4096, or 16384 for 50M corpora). Every vector now belongs to its nearest centroid's cluster.

Step 2 — at query time, probe a few clusters. The query vector also picks its nearest centroid. Then it searches inside the top nprobe clusters only (typical: 8 to 64). Everything else on the shelf is skipped.

Here is the picture. Two-dimensional projection of a clustered embedding space.

        ◦ ◦                       ◦ ◦ ◦
       ◦ C1 ◦                    ◦ C2 ◦
        ◦ ◦                       ◦ ◦
                                   ◦
              ◦ ◦                   ◦ ◦
             ◦ C3 ◦       q★      ◦ C4 ◦   ← probe these
              ◦ ◦  ── ── ── ──     ◦ ◦       two clusters
                  (nearest 2)        ◦
        ◦ ◦                       ◦ ◦ ◦
       ◦ C5 ◦                    ◦ C6 ◦
        ◦ ◦                       ◦ ◦

   Skipped: C1, C2, C3, C5, C6  (the librarian
                                  never walks
                                  those aisles)

The math is friendly. If you have 1024 clusters of roughly equal size, and you probe 16 of them (nprobe=16), you search 1.6% of the corpus. On our 50M shelf, that is 800K dot products instead of 50M. A 62x speedup before any other trick.

The recall cliff. A relevant chunk that sits in a skipped cluster is gone — there is no way for the query to find it. Increase nprobe and recall recovers, but so does cost. Typical sweet spot for 50M vectors at 1536 dims with IVF: nlist=4096, nprobe=32, recall@10 ≈ 0.93, query time ≈ 15–25 ms on a single CPU thread.

IVF with PQ — the memory-saving variant. Plain IVF still stores full float32 vectors. Pair it with Product Quantization (IVF-PQ) and each 1536-dim vector compresses to ~96 bytes instead of 6144. Our 50M corpus drops from 307 GB to ~5 GB of index, fitting in RAM on one machine. Recall takes a small hit. The cost-per-query plummets.

This is the FAISS-on-Spotify-recommendations recipe. Billions of vectors, single-digit milliseconds, mostly RAM.

Mini-FAQ. "IVF vs HNSW — when does each win?" IVF wins on memory and on very large corpora (100M+) where RAM is the binding constraint. HNSW wins on recall-at-the-same-latency for medium corpora (1M–100M). IVF is what you reach for when the dataset is bigger than RAM. HNSW is what you reach for when latency-per-recall is the metric.

3) HNSW — graph search with express lanes¶

HNSW stands for Hierarchical Navigable Small World. Long name. The idea is a multi-layer subway map.

Bottom layer (Layer 0). Every vector is a node. Each node is connected to its M nearest neighbours (typical: M=16 to M=64). Walking this graph means hopping from neighbour to neighbour, getting closer each step.

Upper layers. Same nodes, but progressively sparser. Layer 1 has maybe 1/16th of the nodes; Layer 2 has 1/256th; and so on. The top layer is tiny — a handful of nodes spanning the whole space. These are the express lanes.

Search starts at the top. Greedy walk on a sparse layer covers huge distance fast. Drop down a layer, refine. Drop again. At Layer 0, do a careful local search controlled by ef_search (the candidate-list size during query).

Here is the shape.

   Layer 2  ●━━━━━━━━━━━━━━━━━━━━━━●            ← express lanes
            ┃                       ┃              (few nodes,
            ┃    entry point        ┃               long edges)
            ┃    starts here ↓      ┃
   Layer 1  ●━━━━━●━━━━━●━━━━━●━━━━●            ← regional
              ╲   ┃  ╲  ┃  ╱  ┃   ╱                (denser,
               ╲  ┃   ╲ ┃ ╱   ┃  ╱                  shorter
                ╲ ┃    ╲┃╱    ┃ ╱                   edges)
   Layer 0  ●─●─●─●─●─●─●─●─●─●─●─●─●            ← full graph
            │ ╲ │ │ │ │q│ │ │ │ │ │ │              (every node,
            ●─●─●─●─●─●─●─●─●─●─●─●─●               local edges)
                       ↑
                  query lands here,
                  walks to nearest neighbours

Latency tracks log(N) for the walk plus ef_search for the bottom-layer refinement. On 50M vectors at 1536 dims: query in 2–8 ms with recall@10 of 0.97. That is 30x faster than IVF at higher recall.

The price is memory. Each node stores M edges. At M=32, that is 32 × 4 bytes (int32 IDs) per node per layer, times the layer multiplier. Our 50M corpus needs roughly 350–450 GB of index memory on top of the 307 GB of vectors. That is why HNSW is a "small-medium corpus, premium recall" tool, not a billion-vector tool.

The three knobs that matter.

Knob	What it controls	Typical	Effect of raising it
`M`	edges per node	16–64	Higher recall, more RAM
`ef_construction`	build-time effort	100–500	Better graph, slower build
`ef_search`	query-time effort	50–400	Higher recall, slower queries

The standard tuning recipe: set M=32, ef_construction=200 at build time, then sweep ef_search against your latency target on a held-out query set. The ef_search knob is the runtime dial — you can change it per query if some queries deserve more effort.

Mini-FAQ. "Why is HNSW so memory-hungry?" Because it stores the graph in addition to the vectors. Every edge is a pointer. M=32 means 32 pointers per node per layer, and layers stack. Memory grows roughly linearly with N but the constant is large. DiskANN (next section) is HNSW's "spill the graph to SSD" cousin, built precisely to dodge this cost.

4) DiskANN — when the corpus is too big for RAM¶

Microsoft's DiskANN solves the billion-vector case. Idea: keep a small navigation graph in RAM, push the actual vectors and most of the edges to NVMe SSD, and use a layout that turns each graph hop into roughly one SSD read.

On a single machine with a fast NVMe, DiskANN indexes 1B vectors at 100D in about 50 GB of disk with recall@10 ≈ 0.95 and p99 query latency around 5–10 ms. That is a corpus that would need a 200+ GB RAM box with HNSW. SSD pricing changes the economics of "vector search at scale."

DiskANN is what powers Microsoft's Bing and what shows up under the hood in Azure AI Search at very large scales. You will rarely tune it by hand. You will, however, recognize its signature: low RAM, NVMe SSD, very high vector counts.

5) The tradeoff triangle — recall, latency, index size¶

You cannot maximize all three. Pick two.

                      RECALL
                       ▲
                      ╱│╲
                     ╱ │ ╲
                    ╱  │  ╲
                   ╱   │   ╲
                  ╱   HNSW   ╲       (high recall,
                 ╱  (in-RAM)  ╲       low latency,
                ╱      ●       ╲      huge memory)
               ╱       │        ╲
              ╱   DiskANN        ╲   (high recall,
             ╱      ●              ╲  medium latency,
            ╱       │                ╲ small memory)
           ╱   IVF-PQ                 ╲ (medium recall,
          ╱      ●                     ╲ low latency,
         ╱       │                      ╲ tiny memory)
        ╱        │                       ╲
       ╱─────────┴────────────────────────╲
   LATENCY ◀────────────────────────▶  INDEX SIZE

Concrete numbers for our 50M-chunk, 1536-dim corpus:

Index	RAM	Build time	p50 query	p99 query	recall@10
Brute force	307 GB	0	280 ms	320 ms	1.00
IVF flat	310 GB	30 min	15 ms	35 ms	0.93
IVF-PQ	6 GB	90 min	8 ms	20 ms	0.88
HNSW (M=32)	720 GB	4 hr	3 ms	8 ms	0.97
DiskANN	30 GB RAM + 350 GB SSD	8 hr	6 ms	15 ms	0.95

Numbers vary by hardware and library; the ratios are what matter. Always benchmark on your own corpus before quoting any of these in a design doc.

Pick the index for a 5M-chunk corpus, then read on¶

Before reading the next section, predict: a 5M-chunk corpus where queries must finish in 20 ms, the cluster has 64 GB of RAM, and recall must be at least 0.97. Which index do you reach for, and why? Then continue and check.

6) The metadata problem — filtered search is harder than it looks¶

Real queries are never "give me chunks near this vector." They are "give me chunks near this vector, from tenant X, in product Y, written after Jan 2026, that this user is allowed to read."

Filters wreck ANN's assumptions. Three strategies, all with sharp edges.

Pre-filter. Apply the metadata filter first, then run ANN on the filtered subset. Sounds clean. Breaks when the filter is highly selective — if only 0.1% of vectors match, the ANN index is largely empty in the relevant region, and recall collapses because the graph's edges point to vectors that got filtered out. The walk runs into dead ends.

Post-filter. Run ANN on the full corpus, then drop results that fail the filter. Works when the filter is loose. Breaks when the filter is tight — you may pull top-100 candidates and zero pass the filter, and now you have no results. The fix is to overfetch dramatically (top-1000? top-10000?), which kills latency.

Hybrid (filtered HNSW / payload-aware graph). Qdrant, Weaviate, Milvus, and Pinecone all build variants of HNSW where the graph walk itself checks filters at each step. Better recall under tight filters, more bookkeeping. This is the modern default.

Mini-FAQ. "Why is filtered search hard if filters are just SQL WHERE clauses?" Because ANN does not scan rows. It walks a graph or probes clusters that were built assuming all vectors were candidates. Filter out 95% of them and the graph's connectivity assumptions break — the librarian has memorized routes that now lead to closed doors. Filtered HNSW patches this by checking the filter mid-walk and steering around dead ends.

7) When does pgvector hit its wall?¶

pgvector is a Postgres extension. You get vector search inside the database that already holds your users, your orders, your invoices. Joins work. ACID works. Filtering by WHERE tenant_id = $1 works using ordinary B-tree indexes. For many teams, this is the right answer.

It also has limits. Rough operating ranges, as of 2026:

Under 1M vectors. pgvector with HNSW is fast enough. p50 query in 2–10 ms. No reason to add a dedicated vector DB.
1M to 10M vectors. Works, but tune carefully. HNSW build takes hours on Postgres because it is single-process per index. Memory pressure shows up. Filtered search with tight filters works well precisely because Postgres knows how to combine indexes.
10M to 100M vectors. Painful. HNSW build times stretch to many hours, memory pressure forces partitioning, and Postgres replication does not love giant indexes. Many teams who started on pgvector migrate out here.
Above 100M vectors. Wrong tool. Move to a dedicated vector DB or DiskANN-backed engine.

Mini-FAQ. "When does pgvector hit its wall?" Roughly 10–50M vectors, depending on dimensions and filter complexity. The warning signs: index build times measured in days, query p99 creeping above 100 ms, and replica lag spiking on write bursts. When you see all three, plan the migration.

8) The vector DB landscape — when each one fits¶

There are too many vector databases. The question is never "which is best?" — it is "which fits my constraints?"

┌────────────────────────────────────────────────────────┐
│  CORPUS SIZE     │  DEFAULT CHOICE                     │
├────────────────────────────────────────────────────────┤
│  < 100K          │  Chroma, LanceDB, in-memory FAISS   │
│  100K – 1M       │  pgvector, Qdrant, Weaviate         │
│  1M – 50M        │  Qdrant, Weaviate, Pinecone,        │
│                  │  Milvus, OpenSearch, Vespa          │
│  50M – 1B        │  Milvus, Vespa, Pinecone, Vertex    │
│                  │  Matching Engine, Turbopuffer       │
│  > 1B            │  Vespa, DiskANN-backed engines,     │
│                  │  Vertex Matching Engine, ScaNN      │
└────────────────────────────────────────────────────────┘

Managed (you don't run the box). Pinecone, Vertex Matching Engine, Azure AI Search, AWS OpenSearch Service vector, MongoDB Atlas Vector Search, Turbopuffer.

Self-hosted (open source, you run it). Weaviate, Qdrant, Milvus, Vespa, Marqo, Vald, Chroma, LanceDB.

Libraries (no server, you embed in your app). FAISS (Meta), ScaNN (Google), HNSWlib, USearch, DiskANN (Microsoft).

Inside-the-database (extensions). pgvector for Postgres, Redis Vector, Elastic kNN, OpenSearch kNN.

Mini-FAQ. "Dedicated vector DB vs pgvector vs an ANN library — when each?" Library (FAISS/USearch) when you want full control and the index lives inside your service. pgvector when the corpus fits and you already have Postgres. Dedicated vector DB when scale, filtered search, multi-tenancy, or operational concerns push past what your database can do. Start small. Migrate when you see the warning signs, not before.

9) Failure modes you will see in production¶

Eight things that go wrong, and what causes each.

Recall cliff under load. ef_search is fine on a quiet box but you bumped QPS by 10x. The shared candidate-list allocator hits contention, you lower ef_search to keep latency, recall drops below 0.9, retrieval quality visibly tanks.
Filtered search collapsing recall. Tight filter, pre-filter strategy, ANN graph has no path through the surviving subset. Returns 3 results when you asked for 20.
Memory ballooning with HNSW. You sized the box for vectors. You forgot the graph. Day 30, OOM at 2 am.
Oversharding. 8 shards for 5M vectors. Each query hits all 8 shards, network overhead dominates, p99 is worse than a single shard. Smaller is often faster until you are forced to grow.
Undersharding. 1 shard for 200M vectors. Build never finishes. Index too big to fit on one machine. The opposite mistake.
Stale index after re-embedding. You changed embedding models. Old vectors in the index. New query vectors in a different geometry. Recall is mathematically near zero and nobody notices until quality complaints pile up.
Wrong distance metric. Index built for L2 distance, queries computed for cosine. Results look "reasonable" but precision is broken in subtle ways. Always pin the metric to the embedding model's training metric.
Background compaction freezing queries. Some engines (older Milvus, certain Elasticsearch versions) stall reads during segment merges. p99 spikes for 30–90 seconds, on a schedule.

Each of these is fixable. Each of these has bitten someone you respect.

ANN indexes across shipped vector stores¶

The same ANN ideas show up in every product that searches over embeddings. The list is long because every major vendor has shipped some flavour.

Pinecone — managed vector DB, custom hybrid index, the canonical "we run it for you" choice.
Weaviate — open-source vector DB, HNSW + filter-aware graph, modular embedding adapters.
Qdrant — open-source vector DB built in Rust, payload-filter-aware HNSW, popular for filtered search.
Milvus / Zilliz Cloud — open-source vector DB optimized for billions of vectors, IVF/HNSW/DiskANN backends.
Vespa — Yahoo's open-source engine, hybrid sparse+dense, used by Spotify, Yahoo Mail, large retailers.
pgvector — Postgres extension, HNSW and IVF-Flat, what most prototypes start on.
Elastic vector search — kNN inside Elasticsearch via HNSW, integrates with existing text search.
AWS OpenSearch Service vector — kNN plugin, HNSW and IVF, runs alongside OpenSearch text indexes.
Redis Vector (Redis Stack) — in-memory vector index, HNSW and FLAT, very low latency at small scale.
MongoDB Atlas Vector Search — vector index inside MongoDB collections, HNSW under the hood.
Chroma — local-first embedded vector store, HNSWlib backend, dominant in notebooks and prototypes.
LanceDB — columnar embedded vector DB, IVF-PQ, optimized for serverless and edge deployment.
Marqo — managed search engine combining text and vector with its own retrieval stack.
USearch — header-only C++ library, single-file embeddable HNSW, used inside many SaaS products.
FAISS (Meta) — the foundational ANN library, IVF/HNSW/PQ/OPQ, what most other engines borrow from.
ScaNN (Google) — Google's anisotropic-quantization ANN library, powers parts of Google Search and YouTube recommendations.
DiskANN (Microsoft) — SSD-based ANN, powers Bing and Azure Cognitive Services at scale.
HNSWlib — the reference HNSW implementation, embedded in Chroma, pgvector, and many others.
Vertex AI Matching Engine — Google Cloud's managed ANN, used by Mercari, Snap, and large e-commerce.
Azure AI Search — Microsoft's managed search with vector + semantic reranker, DiskANN-backed at scale.
Turbopuffer — object-storage-backed vector DB, optimized for very low cost per stored vector.
Vald — Kubernetes-native distributed ANN built by Yahoo Japan, used in their internal search.
Spotify Voyager — Spotify's open-source HNSW library, drives parts of music recommendation.
Vald and Jina AI's Finetuner / Vector DB — Jina's stack for embedding + retrieval pipelines.
Twitter's Embedding-Based Retrieval (HNSW + filters) — search and recommendation at multi-billion scale.

Different names. Same three ideas under the covers: cluster the space, walk a graph, or push the heavy parts to disk.

Recall — index choices cold, eight questions¶

Why is brute-force vector search O(n) per query, and at what corpus size does it visibly hurt?
What does recall@10 measure, and what is a typical production target?
IVF: what are nlist and nprobe, and how do they trade recall against latency?
HNSW: what is the role of upper layers, and why does memory grow with M?
When would you choose DiskANN over HNSW?
Why does pre-filtering break ANN recall under tight filters?
Name three signs that pgvector has reached its wall on your workload.
Why is "which vector DB should I use?" a trap question without context?

Interview Q&A¶

Q1. Why do production systems use ANN instead of exact nearest neighbours? A. Exact KNN is O(n) per query. At 50M vectors and 1536 dims, that is ~300 ms per query on one CPU thread, which collapses any system above a handful of QPS. ANN trades a small recall loss (typically 0.95–0.99 vs 1.0) for 30–100x lower latency. Common wrong answer to avoid: "ANN is required because exact similarity is computationally undefined."

Q2. Explain HNSW in two sentences. A. HNSW builds a multi-layer graph where upper layers are sparse express lanes and the bottom layer connects every node to its M nearest neighbours. Queries enter at the top, walk greedily toward the target, drop down layers, and refine — giving roughly log-N query time at the cost of large in-memory edge storage. Common wrong answer to avoid: "HNSW is cosine similarity with a tree on top."

Q3. Explain IVF in two sentences. A. IVF clusters all vectors into nlist buckets via k-means; at query time, the query picks the nearest nprobe clusters and searches only inside them. Recall depends on nprobe — too low and relevant vectors in skipped clusters are lost. Common wrong answer to avoid: "IVF stores only the top vectors and discards the rest."

Q4. Which vector database should I use for my project? A. The right answer is a question back: what are your constraints? Corpus size, filter selectivity, write rate, latency target, ops budget, existing stack. Under 1M with Postgres already in play → pgvector. 1–50M with tight filters → Qdrant or Weaviate. 50M–1B managed → Pinecone or Vertex Matching Engine. Above 1B → Vespa, Milvus, or DiskANN-backed. Picking before knowing constraints is the classic mistake. Common wrong answer to avoid: "Pinecone, it's the most popular."

Q5. Why is filtered vector search harder than it looks? A. ANN indexes assume all vectors are valid candidates. Tight pre-filters strand the graph in subspaces where edges lead to filtered-out neighbours; tight post-filters drop too many candidates after retrieval. Modern engines build filter-aware graph walks (Qdrant, Weaviate, Pinecone) that check the filter mid-walk and steer around dead ends. Common wrong answer to avoid: "You just add a WHERE clause after the search."

Q6. You see recall@10 drop from 0.97 to 0.85 under high QPS. What's likely happening? A. Probable causes: ef_search was lowered to meet latency, candidate-list allocator contention, or the index was sharded and per-shard ef_search is too low. Diagnose by measuring recall@10 at different QPS levels offline and by checking that ef_search is honoured under load. Fix: raise ef_search, scale horizontally, or move to a higher-recall index variant. Common wrong answer to avoid: "Just retrain the embedding model."

Q7. When does HNSW lose to DiskANN? A. When the corpus is too big to keep the graph in RAM. HNSW's per-vector memory overhead (vector + edges) makes 500M+ vectors prohibitively expensive in pure RAM. DiskANN keeps a small navigation graph in RAM and the bulk on NVMe SSD, hitting recall@10 ≈ 0.95 at billion-vector scale on a single machine. Common wrong answer to avoid: "DiskANN is always slower because it uses SSD."

Q8. Your team built RAG on pgvector. Queries are now p99 = 400 ms at 5M vectors. What do you investigate before migrating? A. Before migrating: (1) confirm HNSW index is actually built, not falling back to sequential scan; (2) check maintenance_work_mem was large enough during index build; (3) profile filtered queries — Postgres may be choosing the wrong index combination; (4) verify autovacuum on the table; (5) measure recall@10 separately so you know if you have a recall problem or a latency problem. After all that, if you still cannot meet target, migrate to Qdrant, Weaviate, or Pinecone. Common wrong answer to avoid: "Just switch to Pinecone."

Apply now (10 min)¶

Step 1 — model the exercise. Here is how I would size an index for a 20M-chunk legal corpus with tight tenant filters and p95 < 50 ms target.

Decision	Choice	Why
Index type	HNSW with filter-aware walk	Tight filters demand it; 20M fits in RAM
Engine	Qdrant or Weaviate	Both support payload-aware HNSW
M	32	Standard for high-recall text retrieval
`ef_construction`	200	Build-time effort, one-shot cost
`ef_search`	128 (tuned)	Sweep against p95 target
Sharding	2 shards	Headroom for growth, low network overhead
Re-embed plan	Blue-green index	Avoid stale-index trap during model upgrades

Step 2 — your turn. Pick a real corpus you know. Write the same seven rows. For each row, justify the choice in one sentence. Then write the one number you would measure to know if the choice was wrong (e.g., recall@10 < 0.95, p95 > target, RAM > budget).

Step 3 — sketch from memory. Draw the IVF picture (centroids with probed regions) and the HNSW picture (layered graph with entry point and walk). Label where the librarian saves time in each. If you can do this cold, you understand ANN.

What you should remember¶

This chapter explained why brute-force nearest-neighbour search dies at scale and how three families — IVF, HNSW, and DiskANN — buy back acceptable latency by trading a small slice of recall for orders of magnitude speedup. Each one is a different bet about where the bottleneck lives: IVF clusters to skip most of the corpus, HNSW walks a layered graph in roughly log-N time, DiskANN pushes the heavy parts onto NVMe so billion-vector indexes fit on a single box.

You also learned that filtered ANN is the silent assassin of production RAG. Pre-filter strands the graph in subspaces with dead-end edges; post-filter collapses to zero results when the filter is tight. Filter-aware HNSW — Qdrant, Weaviate, Pinecone, Milvus — is the modern fix and the reason "just add a WHERE clause" is the wrong mental model.

Carry this diagnostic forward: when recall drops under load, suspect ef_search being lowered to meet latency before suspecting the embedder. When recall drops after a model swap, suspect stale index in different geometry. When pgvector p99 creeps over 100 ms with index build days, you have crossed the wall — plan the migration before quality complaints arrive.

Remember:

Brute force is fine under 100K. Above that, an ANN index is mandatory.
HNSW for in-RAM workloads under ~500M. DiskANN above that. IVF for memory-constrained mid-scale.
Tight filters break vanilla ANN. Use filter-aware HNSW or pay the over-fetch tax.
pgvector is the right answer up to ~10M. After that the index-build and replication math turns against you.
Pin the metric to the embedder's training metric, and version-pin the embedder in index metadata. Stale-index bugs are silent.

Bridge. The bookshelf is now real. Cluster centroids and graph layers are not magic — they are the librarian's shortcut routes through a corpus too large to scan. With storage and search both grounded, it is time to walk the full path: from the user's question to the grounded answer, eight stages and eight failure modes.

→ 08-rag-pipeline.md