10. Scaling and sharding — one warehouse becomes many buildings¶

~13 min read. ANN quality is useless if a single box must serve every tenant, every query, and every write.

Continues from the first-principles overview in 00-first-principles.md. The warehouse floor — the space where parcels sit — eventually becomes too large for one machine, so we split the warehouse and coordinate many scout robots.

1) Why scaling is tricky for vector search¶

Begin with a concrete workload: a support assistant serves millions of document chunks across hundreds of tenants, and the index no longer fits comfortably on one machine. Scaling ordinary key-value storage is already work; vector search adds geometry, because useful neighbors can live anywhere on the warehouse floor. A careless shard plan either splits nearby vectors across buildings, which can hurt recall, or forces every query to fan out everywhere, which can hurt latency.

Use this picture as the mental model before the details.

single node
query -> one index -> results

many nodes
query -> router -> shard 1
                -> shard 2
                -> shard 3
                -> merge top k

The moment you shard a vector index, the system owns routing, fan-out, global top-k merge logic, replica consistency, and tenant isolation. That is more than "add servers." The scout robot is no longer walking one warehouse; it is coordinating a search across several buildings while still trying to return the closest parcels.

2) Common sharding strategies¶

Strategy one — hash by document ID. This balances load nicely, but it scatters geometric neighbors randomly, so most queries must fan out to many shards and wait for the slowest shard plus merge cost.

Strategy two — partition by tenant. This is excellent when most queries are tenant-scoped, because the aisle sticker and the shard boundary align: security is simpler, cache locality improves, and the scout robot avoids buildings it was never allowed to enter.

Strategy three — partition by semantic clusters. Nearby vectors live together more often, which can reduce fan-out, but rebalancing becomes painful when the corpus shifts and yesterday's clusters no longer match today's traffic.

Strategy four — hybrid partitioning. Split first by tenant or region, then by hash or semantic cluster within that boundary; this is common in SaaS systems because it respects authorization first and optimizes geometry second.

The trade table is easier to read as a picture.

hash shard      -> easy balance, high fan-out
tenant shard    -> low fan-out for tenant queries, possible hot tenants
semantic shard  -> locality friendly, rebalancing hard
hybrid shard    -> practical compromise, more operational complexity

3) Worked numerical example: fan-out cost¶

Suppose one query takes 20 ms on a single shard. After a hash split into four shards, every query must hit all four because geometric neighbors were scattered randomly.

shard 1 = 18 ms
shard 2 = 21 ms
shard 3 = 25 ms
shard 4 = 19 ms
merge overhead = 4 ms

Total latency is not the average shard time; it is roughly the slowest shard plus merge overhead. In this case the request takes about 25 + 4 = 29 ms. With 16 shards, one slow outlier at 40 ms can dominate the query; add a 5 ms merge step and the user sees roughly 45 ms even if most shards were healthy.

Tenant sharding changes the shape of the request. If the query belongs to tenant Acme and Acme has a dedicated shard, the router may search one shard, pay 21 ms of shard work, add 2 ms of merge overhead, and finish around 23 ms. The practical rule is to align the shard key with the dominant query filter when possible. The scout robot should not visit every building if the request slip already names one campus.

4) Replication and consistency¶

Sharding spreads data; replication protects availability. Most production systems need both, which means writes, reads, and freshness now have to be planned together. A primary-replica pattern is common: writes land on the primary shard, while reads may hit replicas once replication catches up.

The sketch looks like this.

write -> primary shard
          ├─ replica A sync later
          └─ replica B sync later

query right after write may miss newest vector on a stale replica

In production, the failure is that a newly indexed vector may not appear immediately on every replica. Users experience that as missing knowledge, not as an internal replication delay. If the product promises read-after-write behavior, the router may need to query the primary for a short window, merge a fresh-write delta index, or keep a small buffer for recent documents. Consistency is a product requirement, not merely a storage setting.

5) Hot shards, rebalancing, and multi-stage retrieval¶

Traffic is rarely uniform. A giant tenant, a celebrity catalog, or a trending corpus can turn one shard into the bottleneck while the rest of the fleet looks calm. The practical response is asymmetric: split that tenant further, add replicas, cache its top queries, or give it a custom retrieval path. There is no shame in treating hot workloads differently when the measurements justify it.

Rebalancing is harder for vector systems than for simple row stores. Moving vectors between shards may require rebuilding or compacting HNSW graphs, IVF lists, or payload indexes on the destination. The loading dock needs migration tooling, not only a copy command.

Multi-stage retrieval adds one more tradeoff. Each shard returns local top-r candidates, the router merges those candidates, and the final system keeps global top-k. If you need global top-10 across five shards and each shard returns only top-2, the merged pool contains at most 10 candidates; one shard might actually contain six of the true top-10, so four good results vanish. If each shard returns top-20, recall improves but network, merge, and rerank cost rise. The route map now lives inside each shard and across shards, which is why large vector systems are distributed systems wearing ANN clothes.

6) Why not one giant index forever under this workload¶

The tempting alternative is one giant index forever because it keeps the architecture small and makes the first demo look clean. That story is useful for a prototype, but it becomes dangerous once the workload has real scale, filters, freshness pressure, and evaluation data.

It fails when one vector index becomes many shards, and fan-out multiplies latency and recall risk. At that point the system needs an inspectable artifact — fan-out plan with shard key, replicas, merge step, and p99 latency — because otherwise every bad answer turns into a vague argument about whether embeddings, ANN, metadata filters, lifecycle, or evaluation are guilty.

Option	Works when	Fails when	Cost moves to
one giant index forever	corpus is small or low-risk	one vector index becomes many shards, and fan-out multiplies latency and recall risk	latency, recall, or user trust
scaling and sharding	the failure can be measured in the index path	traces or baselines are missing	memory, rebuilds, evals, operations

Mini-FAQ. "Is this always worth adding?" No. The RAG-fundamentals rule still applies: add machinery only when a measured workload pressure earns it. If exact search is cheap, if filters are simple, or if evaluation is missing, the clever index can become a more expensive way to stay confused.

7) Production signals — know whether scaling and sharding is working¶

Healthy behavior means fan-out plan with shard key, replicas, merge step, and p99 latency explains why the returned neighbors changed. In a real incident review, you should be able to point at that artifact and explain why the candidate set changed, not merely say that the database returned something.

The first metric to watch is p99 fan-out latency and cross-shard recall. Track it by query family, tenant, corpus slice, and index version, because global averages hide exactly the failures users notice first.

The misleading metric is database uptime. A vector database can be perfectly available while recall, filtering, freshness, or embedding compatibility is broken, so uptime only proves the warehouse doors opened; it does not prove the scout robot found the right shelf.

The expert graph compares exact baseline recall, p50/p99 latency, filter selectivity, index version, embedding version, and bad-query examples by slice. That graph is the difference between tuning knobs and debugging a retrieval system.

bad retrieval
   -> query vector / filter
   -> index path
   -> candidate neighbors
   -> score and metadata trace
   -> exact baseline or judged list

8) Boundary — where scaling and sharding helps and where it does not¶

Use this mechanism when the failure happens inside vector geometry, index traversal, filtering, lifecycle, or serving operations. That is the zone where vector-database machinery can actually change the returned neighbors, the latency curve, or the operational envelope.

Do not expect it to fix cases where the source content is wrong, the embedding model is poor for the domain, or the product definition of relevance is unresolved. Those are upstream or product-definition failures, and better ANN settings will only make the wrong evidence arrive faster.

The common pathology is that teams keep tuning ANN knobs when the real issue is bad chunks, stale data, weak labels, or missing evals. In interviews, call this out explicitly: the index is not the whole retrieval system, it is one stage inside a pipeline that also depends on documents, chunks, labels, and evals.

The scale limit is blunt: every improvement spends something — RAM, disk, build time, query latency, engineering time, or vendor lock-in. The mature answer is not to pick the fanciest mechanism; it is to choose the pressure you are willing to pay for.

9) Wrong model — sharding only adds more machines¶

The wrong model is attractive because it compresses the system into one easy story, and easy stories feel good in design docs. The trouble is that production vector search is not one story; it is embedding quality, distance metric, ANN index, metadata filters, lifecycle, sharding, vendor operations, and monitoring all interacting under traffic.

If scaling and sharding cannot change recall, latency, cost, freshness, or debug visibility, it is not carrying its weight; it is vocabulary without leverage.

10) Failure taxonomy for scaling and sharding¶

Geometry failure — the embedding space does not put useful neighbors close enough.
Metric failure — the chosen similarity ruler disagrees with the model or workload.
Index failure — ANN skips relevant vectors or returns unstable candidates.
Filtering failure — metadata filters erase good candidates or violate scope.
Lifecycle failure — stale, mixed-version, or partially rebuilt indexes serve traffic.
Scale failure — fan-out, memory, or rebuild cost breaks the SLO.
Debugging failure — no trace connects query vector, index path, candidates, and final result.

11) Pattern transfer — where this returns later¶

RAG uses vector DBs as the evidence gateway before generation.
Retrieval and ranking supplies the metrics and fusion logic used here.
Data engineering supplies chunk quality, metadata, and embedding-version hygiene.
Production evals decide whether recall and relevance changes actually help users.

12) Design review checklist¶

What pressure is this mechanism relieving: latency, memory, filtering, freshness, scale, or evaluation?
What artifact would you inspect first: vector neighbors, index trace, filter plan, namespace manifest, or exact baseline?
Why is one giant index forever weaker for this workload?
Which slice should improve first?
Which cost rises first: RAM, disk, build time, query latency, or operational complexity?
What rollback signal tells you the index change hurt retrieval?

Where this lives in the wild¶

Pinecone multi-tenant deployments — vector platform engineer. Tenant-aware partitioning and replicas keep noisy neighbors from dominating latency.
Weaviate clusters on Kubernetes — infrastructure engineer. Shards and replicas are balanced across nodes while HNSW indexes stay queryable.
Qdrant distributed collections — backend platform engineer. Replication plus shard transfer handles capacity growth and node failures.
Elasticsearch hybrid stacks — search SRE. Cross-shard fan-out and top-k merge behavior dominate latency tuning at scale.
Large recommendation retrieval services — distributed systems engineer. Candidate generation uses many partitions and aggressive merging to keep throughput high.
Enterprise RAG — vector DBs store policy, wiki, ticket, and document chunks for semantic retrieval.
Ecommerce search — vectors help with descriptive queries while filters protect catalog scope.
Support copilots — need metadata filters for tenant, product, language, and freshness.
Code search — mixes semantic vectors with exact identifiers and repository permissions.
Recommendation systems — use nearest-neighbor retrieval before ranking models.
Image and multimodal search — embeddings represent images, captions, and cross-modal queries.
Legal discovery — recall and auditability are more important than average latency alone.
Healthcare retrieval — metadata, permissions, and freshness are safety boundaries.
Fraud and anomaly systems — vector similarity finds nearby behavior patterns.
Personalization systems — user and item embeddings need versioned lifecycle management.

Recall checkpoint¶

Why can hash sharding hurt vector-query latency?
When is tenant sharding especially attractive?
Why is max shard latency more important than average latency?
Why might each shard need to return more than k / num_shards results?
Which artifact would you inspect first for scaling and sharding?
What query or corpus slice would prove the improvement is real?
What is the first operational cost this mechanism adds?

Interview Q&A¶

Q: Why shard by tenant and not by random hash in many SaaS vector systems? A: Because queries are usually tenant-scoped, so tenant sharding reduces fan-out, improves isolation, and simplifies filtering.

Common wrong answer to avoid: "Because random hashing is outdated." Hashing is still useful; it just mismatches tenant-heavy workloads.

Q: Why does distributed top-k merge require oversampling from each shard? A: Because the global winners may be concentrated unevenly, so a small local return count can miss true global top results.

Common wrong answer to avoid: "Each shard only needs to return its fair share." Real rankings are rarely uniform.

Q: Why are replicas not only for fault tolerance in vector systems? A: Because replicas also absorb read load, reduce hot-spot pressure, and support regional latency goals.

Common wrong answer to avoid: "Replicas do not affect query performance." They often do.

Q: Why is rebalancing harder for semantic or graph-aligned shards? A: Because moving data changes locality assumptions and may require rebuilding index structures, not just copying rows.

Common wrong answer to avoid: "It is the same as moving SQL rows between partitions." ANN structures make it more involved.

Q: What artifact would you inspect first when scaling and sharding fails? A: I would inspect fan-out plan with shard key, replicas, merge step, and p99 latency, then compare it with exact baseline, filter state, index version, and embedding version.

Common wrong answer to avoid: "Just check whether the vector DB is up." — Availability does not prove recall, freshness, or relevance.

Q: How do you know the change helped? A: Track p99 fan-out latency and cross-shard recall on a representative query slice and compare it with latency, memory, build time, and filtered-result behavior.

Common wrong answer to avoid: "The average similarity score increased." — Similarity scores are not product-quality metrics by themselves.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is small, exact search is cheap, or the team lacks evaluation data to prove the extra complexity helps.

Common wrong answer to avoid: "Every production AI system needs the most advanced vector index." — The right index depends on workload, scale, filters, and operational constraints.

Apply now (10 min)¶

Exercise. Imagine 8 tenants and 4 shards. Decide whether you would shard by tenant, hash, or hybrid. Give one reason about latency and one about isolation. Then compute total latency from four shard times and merge overhead.

Sketch from memory. Draw a router sending one query to many warehouse buildings. Label shards, replicas, and one hot shard that needs special treatment.

Reproduce from memory: explain scaling and sharding with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Scaling and sharding exists because one vector index becomes many shards, and fan-out multiplies latency and recall risk. The point is not to memorize a vendor feature; it is to know which workload pressure the mechanism relieves and which cost it creates.

The artifact to inspect is fan-out plan with shard key, replicas, merge step, and p99 latency. If you cannot inspect it, vector search debugging becomes guesswork.

Remember:

Vector search fails through geometry, metrics, indexes, filters, lifecycle, scale, and monitoring.
Watch p99 fan-out latency and cross-shard recall by query and corpus slice before trusting global averages.
Exact baselines and judged lists are how you keep ANN tuning honest.
Every vector database choice moves cost between recall, latency, memory, rebuilds, and operations.

Bridge. After scaling patterns come product choices. You now know the internals well enough to compare managed services and database options honestly. → 11-managed-services.md