Vector Databases — Interview Questions¶

The "which vector DB would you use and why" round. Senior interviewers don't want a feature checklist — they want a decision, with a scale threshold, an operational trade-off, and a migration path. The 2026 default answer: start with pgvector if you already run Postgres; move to a dedicated vector DB only when you hit a specific limit (scale, multi-tenant isolation, filter performance, hybrid native support). For ANN-algorithm internals (HNSW vs IVF vs DiskANN vs product quantization), see retrieval-and-ranking.md.

The decision question¶

Q: "Which vector database would you use for a new RAG project?"¶

Tags: mid · very-common · scenario · source: Vector DB comparison guides 2026 (Encore, Second Talent, CallSphere, DEV.to); standard senior RAG-architecture probe

Answer outline: - Reframe as "what's your starting scale and constraints?" The answer changes with the situation. Senior tell: candidate names a concrete scale threshold for switching. - Default ladder: - Already on Postgres + <10M vectors: pgvector. Zero new infra; one transactional store for everything; HNSW matches dedicated DBs at this scale. - <10M vectors + want zero ops: Pinecone serverless. Time-to-prod measured in hours, no infra to run. - Best filtered-search performance + self-hostable: Qdrant. Rust core, filter-aware HNSW, strong multi-tenant story. - Native hybrid search + multi-tenant: Weaviate. Built-in BM25 + vector in one query, managed cloud available. - Billions of vectors at high QPS: Milvus or Vespa. Distributed, sharded, the heavy-hitter tier. - Already on Elasticsearch / OpenSearch: their built-in vector search is good enough up to mid-scale, saves an integration. - The 2026 default I'd actually recommend: pgvector for v1, migrate later if a real constraint forces it. Most projects never need to migrate. - Numbers to drop: "pgvector + HNSW at 1M vectors: 5-20ms at 95%+ recall; matches dedicated DBs", "Pinecone serverless: free tier up to 100k vectors, $50-500/month moderate workload", "scale crossover: 10-50M vectors typical pgvector→Pinecone transition"

Common follow-ups: - "What's the scale at which you'd switch?" - "When would you regret picking pgvector?" - "What about Chroma / LanceDB / Milvus / Vespa?"

Traps: - Picking based on hype or vendor pitch. The right answer is workload-driven. - Comparing by feature list without naming the scale tier you're at.

Related cross-cutting: Architecture choices, Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "Walk me through pgvector vs Pinecone vs Qdrant vs Weaviate."¶

Tags: senior · very-common · conceptual · source: Vector DB comparison guides 2026 (Encore, Second Talent, MarkTechPost); standard senior probe

Answer outline: - pgvector: Postgres extension. Vectors live alongside relational data — same DB, same transactions, same SQL. Wins on operational simplicity. Limits: scale (mostly comfortable up to 10M, painful past 50M without careful sharding), no native distributed setup, recall tuning takes some effort. - Pinecone: managed SaaS. Push vectors, query vectors. Serverless tier scales transparently up to billions. Wins on time-to-prod and operational simplicity. Cons: closed-source, per-vector pricing at scale, less flexible filtering, eventual consistency window between writes and reads. - Qdrant: open-source, Rust core, self-hostable or managed cloud. Best-in-class filtered-search performance — filter-aware HNSW lets you do "find similar where tenant=X and date>Y" with minimal recall loss. Strong multi-tenant story (per-collection or per-shard isolation). - Weaviate: open-source with managed cloud. Native hybrid search (BM25 + vector in one query) is its differentiator. Class-based schema (object-oriented feel). Good for RAG over heterogeneous docs. - Milvus: open-source, designed for billion-scale. Distributed by default, multiple index types, very flexible. More complex to operate; right when you genuinely need that scale. - Vespa: Yahoo's engine, hardcore distributed search with ML ranking built-in. Steepest learning curve, most powerful for hybrid search + custom ranking at scale. - Chroma / LanceDB: lightweight, dev-friendly, primarily for prototyping. Production usage exists but they're typically not the high-scale answer. - Numbers to drop: "Pinecone: zero ops, $50-500/month moderate. pgvector: free if you have Postgres. Qdrant: self-host or ~$50-200 managed. Weaviate similar. Milvus: harder ops cost, free.", "filter performance ranking: Qdrant > Weaviate > Pinecone ≈ pgvector"

Common follow-ups: - "Why does Qdrant win on filtered search?" - "What's the difference between Pinecone pod-based and serverless?" - "When does Milvus actually pay off?"

Traps: - Treating these as interchangeable. The differentiation is real and workload-specific.

Related cross-cutting: Architecture choices, Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "When does pgvector beat Pinecone?"¶

Tags: senior · common · scenario · source: pgvector vs Pinecone guides 2026 (Encore, DEV.to, JavaCodeGeeks); standard senior probe

Answer outline: - pgvector wins when: - You already run Postgres and the AI feature is being added to an existing app. Zero new infra; one DB to operate. - Vectors <10M and growth is modest. At this scale pgvector + HNSW matches or beats dedicated DBs. - Transactional consistency matters: vectors and source rows update atomically. Pinecone is eventually consistent — there's a window where a freshly-written vector isn't yet searchable. - Complex SQL filters: pgvector lets you JOIN against other tables, use CTEs, subqueries. Pinecone has a metadata filter syntax but can't cross-join. - Cost-sensitive: pgvector is free if you have Postgres; Pinecone's per-vector pricing adds up. - Pinecone wins when: - >50M vectors and growing — pgvector requires careful tuning past this; Pinecone's architecture scales without operator intervention. - Zero-ops mandate — small team, no DB expertise. - Multi-tenant SaaS with thousands of tenants — namespace isolation and per-tenant scaling is built-in. - Bursty workloads — serverless tier auto-scales without paying for idle capacity. - The honest answer for a v1: start with pgvector. Most projects never hit the threshold where Pinecone wins. Migrating later is straightforward (export embeddings, re-ingest). - Numbers to drop: "pgvector + HNSW at 1M vectors: 5-20ms p95 at 95%+ recall — competitive with Pinecone", "crossover scale: 10-50M vectors; below, pgvector wins on ops; above, Pinecone wins"

Common follow-ups: - "What if I'm not on Postgres?" - "How painful is a pgvector → Pinecone migration?" - "What if I need to scale to 100M+?"

Traps: - Recommending Pinecone for a 100k-vector hobby project. Massive overkill. - Dismissing pgvector as "not a real vector DB". It's competitive up to mid-scale.

Related cross-cutting: Architecture choices, Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "When would you pick Qdrant over Weaviate?"¶

Tags: senior · common · scenario · source: vector DB comparison guides 2026; standard senior probe

Answer outline: - Qdrant wins when: - Filtered search performance is paramount: filter-aware HNSW lets you efficiently query "find similar to X where tenant=Y, date>Z" without scanning. Best-in-class for this. - Raw QPS matters: Rust core, very fast single-node throughput. - You want self-hosting with simple ops: Docker / K8s, lighter operational footprint than Milvus. - Strict multi-tenant isolation: per-collection or per-shard. - Weaviate wins when: - Native hybrid (BM25 + vector) in one query matters: built-in, no separate index to maintain. - Schema-rich object model fits your data: class-based, GraphQL API, feels OO. - Generative-search modules: Weaviate has built-in modules for embedding-and-store, generative-with-LLM, etc. Less integration glue for some patterns. - Both have managed cloud + self-host options; both are open-source. - For a generic RAG: roughly equivalent. Pick based on whether filter performance (Qdrant) or hybrid + schema convenience (Weaviate) matters more for your workload. - Numbers to drop: "Qdrant filter-aware HNSW: <2× the unfiltered query latency on selective filters", "Weaviate hybrid: BM25 and vector in one query without separate index sync"

Common follow-ups: - "What's filter-aware HNSW?" - "Is Weaviate's BM25 as good as Elasticsearch's?"

Traps: - Treating these as interchangeable. The filter-performance and hybrid-native differentiation is real.

Related cross-cutting: Architecture choices, Retrieval Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "When do you reach for Milvus or Vespa?"¶

Tags: staff · occasional · scenario · source: MarkTechPost Best Vector DBs 2026; standard staff-tier probe

Answer outline: - Both are heavy-hitters for billion-scale vector search. - Milvus: distributed by default, multiple index types (HNSW, IVF, DiskANN, SCANN), strong ecosystem (LlamaIndex/LangChain integration), Zilliz Cloud for managed. Right when you need billion-scale, multiple index strategies, or per-collection tuning. - Vespa: Yahoo's open-source. Built-in ML ranking with custom scoring functions, hybrid search via ranking expressions, very flexible. Operational complexity is high — comes with a learning curve. Best for teams doing search at major scale who need ranking-as-code. - Trade-offs at this tier: - Operational complexity is real. Both need dedicated infra + people who know them. - You're rarely doing this for a v1. Migrations into Milvus / Vespa usually come from outgrowing simpler stacks. - At billion-scale, you're also thinking about sharding, replication, multi-region — all of which both handle but require design effort. - 2026 default: most teams won't need Milvus/Vespa. Use Pinecone / Qdrant / Weaviate for high-scale managed; reach for Milvus/Vespa when you need custom ranking, distributed control, or true billion-scale on self-host. - Numbers to drop: "Milvus billion-scale: requires dedicated cluster, careful sharding", "Vespa custom ranking: define scoring functions in YQL"

Common follow-ups: - "What does Vespa's custom ranking buy you?" - "Is Zilliz Cloud worth it over self-hosted Milvus?"

Traps: - Recommending Milvus/Vespa for projects that don't need it. The ops cost is real.

Related cross-cutting: Architecture choices Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

pgvector deep-dive¶

Q: "How does pgvector work and what are its limits?"¶

Tags: senior · common · conceptual · source: pgvector vs Pinecone guides 2026; standard senior probe

Answer outline: - pgvector is a Postgres extension that adds a vector column type and similarity-search operators. - Two index types: IVFFlat (older, faster build, weaker recall/latency at scale) and HNSW (modern default, slower build, much better recall/latency). Use HNSW for production. - HNSW parameters: m (max graph connections, default 16 — higher = better recall, more memory) and ef_construction (build-time search quality, default 64 — higher = better-quality graph, slower build). Query-time: ef_search (default 40 — runtime knob for recall/latency). - Operators: <-> (Euclidean), <=> (cosine), <#> (negative inner product). Match your embedding model's training metric. - Storage: vectors stored as float32 by default. halfvec (FP16) and bit types in newer versions reduce storage 2-4×. - Limits: - Memory: HNSW index is in-memory. A 10M-vector × 1536-dim index needs ~60 GB (vectors) + graph overhead. Past this point you need larger Postgres instances or PQ-style compression. - Index build time: HNSW build on 10M vectors takes hours. Plan for it. - Updates: every insert updates the graph; high-write workloads can stress the index. Reindex periodically. - No native distributed setup: pgvector lives on a single Postgres instance. For >50M vectors at scale, you're sharding via Citus or manually. - Numbers to drop: "pgvector HNSW build: 1-3 hours per 10M vectors", "memory: ~6-12 GB per 1M vectors at 1536 dim FP32 + graph", "query: 5-20ms p95 at 1M, degrades past 50M without sharding"

Common follow-ups: - "How do you tune ef_search?" - "What about pgvector 0.7+ and halfvec / bit vectors?" - "How does pgvector compare to pgvectorscale?"

Traps: - Defaulting to IVFFlat in 2026. HNSW is the modern choice. - Forgetting that HNSW build is expensive. Plan ahead.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/07_search_relevance_ranking/

Q: "What's pgvectorscale and when does it help?"¶

Tags: staff · occasional · conceptual · source: TimescaleDB pgvectorscale docs 2026; specialized senior probe

Answer outline: - pgvectorscale (by Timescale) is an extension layered on pgvector adding StreamingDiskANN (disk-friendly ANN) and statistical binary quantization (SBQ). - StreamingDiskANN: keeps the index partially on disk, streams from SSD as needed. Lets pgvector scale past the "must fit in RAM" wall of HNSW. - SBQ: aggressive quantization that maintains recall via reranking with full-precision vectors on a smaller candidate set. Memory savings substantial. - Result: pgvector + pgvectorscale can comfortably handle 100M+ vectors on commodity hardware where vanilla pgvector would OOM. - When to use: you want to stay on Postgres but push past 10-50M vectors. Avoids the migration to a dedicated vector DB. - Operational caveat: it's a Timescale extension, additional install + ops; not as ubiquitous as vanilla pgvector. Check if your hosted Postgres provider supports it. - Numbers to drop: "StreamingDiskANN: scales to 100M+ vectors on commodity boxes", "SBQ + reranking: closes most of the recall gap from binary quantization"

Common follow-ups: - "How does this compare to running Qdrant or Milvus?" - "What's the recall impact?"

Traps: - Conflating pgvector and pgvectorscale. They're separate extensions.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Operational concerns¶

Q: "How do you handle multi-tenancy in a vector database?"¶

Tags: senior · very-common · design · source: Pinecone / Weaviate / Cosmos multi-tenancy docs 2026; Marcus Feldman multi-tenant scaling 2026; standard senior infra probe

Answer outline: - Three patterns, picked by isolation and scale requirements: - Filter-by-tenant (logical isolation): single index, every vector tagged with tenant_id, every query includes a filter. Cheapest, easiest. Risk: a missed filter leaks across tenants. - Per-tenant namespace / collection (logical isolation, physical sub-partition): vector DB has a native "namespace" concept (Pinecone namespaces, Qdrant collections, Weaviate multi-tenancy). Vectors stored separately per tenant in the same cluster. - Per-tenant index / instance (physical isolation): dedicated index or cluster per tenant. Strongest isolation; expensive at scale. - For SMB SaaS with 100s-1000s of tenants, namespace/collection isolation is the 2026 default. Weaviate, Qdrant, and Pinecone all support this natively. - For 10000s of tenants: collection-per-tenant can break the DB (Marcus Feldman pushed Qdrant past 100k collections — limits exist). Hybrid: hot tenants get dedicated collections, cold tenants share a multi-tenant index with filter-by-tenant. - For enterprise / regulated tenants: dedicated cluster or per-tenant encryption keys; air-gapped if required. - Security: - Every query must include the tenant filter. Enforce at the SDK / middleware layer, not in application code. - Test with red-team queries (try to retrieve another tenant's data). - Audit log on cross-tenant queries. - Numbers to drop: "Pinecone serverless: namespaces are free, auto-scale per namespace", "Qdrant: collections up to ~100k practical limit", "Weaviate native multi-tenancy: built for 10k+ tenants per cluster"

Common follow-ups: - "What's the failure mode of forgetting the tenant filter?" - "How do you handle a tenant with 1000× the data of others?" - "Per-collection vs per-shard isolation?"

Traps: - Filter-by-tenant alone for high-stakes data. One bug = cross-tenant leak. - Per-tenant cluster at scale. Cost explodes.

Related cross-cutting: Architecture choices, Production patterns Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/03_ai_security_safety/00_safety_guardrail_design/

Q: "How do you handle updates and deletes in a vector database?"¶

Tags: senior · common · conceptual · source: standard senior operational probe; reported in 2026 RAG-infra loops

Answer outline: - All major vector DBs support upsert; delete handling varies and is often expensive. - HNSW-based indices (Qdrant, Weaviate, pgvector, Milvus) handle deletes via tombstones: the deleted node stays in the graph but is filtered from results. Over time, tombstones accumulate and degrade recall/latency. Periodic rebuild compacts them. - IVF-based indices: similar tombstoning, with cluster-level repacking during compaction. - For update-heavy workloads: append-only patterns work better than in-place updates. New version → new ID, old ID tombstoned, periodic GC. - Pinecone serverless: handles upserts/deletes natively; you don't see the underlying compaction. - For incremental re-embedding (e.g., embedding model upgrade): build the new index in parallel, dual-write during the transition window, switch reads atomically, then deprecate the old. - For real-time freshness: write to the vector store, expect a propagation window (eventual consistency in Pinecone, near-immediate in pgvector with HNSW updates). - Numbers to drop: "tombstone GC: trigger at 5-15% tombstone ratio", "incremental re-embed: parallel dual-write window of hours-days depending on corpus size"

Common follow-ups: - "What's the cost of a deleted node staying in the index?" - "How do you handle a document being updated 100 times an hour?"

Traps: - Naive deletes without considering tombstone accumulation. Index quality degrades silently.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "Walk me through migrating from pgvector to Pinecone."¶

Tags: senior · common · scenario · source: standard senior scaling-migration probe; reported in 2026 RAG-infra loops

Answer outline: - Plan the migration as zero-downtime. Three phases: - Phase 1 — dual-write: every new vector written to both pgvector and Pinecone. Reads still served from pgvector. Run for a few days; verify Pinecone has the data. - Phase 2 — backfill: bulk export historical pgvector data, batch-upsert into Pinecone. Pinecone's serverless tier handles bulk ingest gracefully; pod-based tiers may need pre-provisioning. At 10M vectors and typical upsert throughput, expect hours. - Phase 3 — read-flip: shadow-route reads to Pinecone, compare top-K results with pgvector for a sample of queries. Confirm equivalent recall. Then flip reads atomically (feature flag); keep dual-writing. - Phase 4 — decommission: after a stability window (1-2 weeks), stop writing to pgvector. Drop the table or keep as cold backup. - Catches: - Distance metric mismatch: pgvector default is L2 (<->); Pinecone default depends on init config. Verify both compute the same metric. - Embedding normalization: cosine and dot-product give the same ranking iff vectors are unit-normalized. Inconsistency causes silent recall degradation. - Metadata schema: Pinecone's metadata filter syntax differs from SQL. Translate queries; test thoroughly. - ID format: Pinecone IDs are strings; map your pgvector IDs (often integers) carefully. - Consistency window: Pinecone is eventually consistent. If your app relies on read-after-write, add a brief delay or retry. - Eval: side-by-side comparison of recall@10 on a labeled eval set before and after. Should be ≥99% agreement; investigate any divergence. - Numbers to drop: "10M vectors backfill: hours at typical bulk-upsert throughput", "dual-write window: days. Shadow-read comparison: thousands of queries before promotion."

Common follow-ups: - "What if Pinecone returns different results?" - "How do you avoid downtime?" - "What if the embedding model also changes during migration?"

Traps: - Single-shot cutover. Always shadow first. - Forgetting metric / normalization mismatches. Silent quality regression.

Related cross-cutting: Production patterns, Architecture choices Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "How do you re-embed a corpus when your embedding model upgrades?"¶

Tags: senior · common · scenario · source: standard senior RAG-ops probe; reported in 2026 RAG-infra loops

Answer outline: - The embedding model is bound to its index. Mix embeddings from two model versions in the same index and similarity becomes meaningless. - Plan: build a parallel index for the new embedding model; cut over once it's caught up. - Steps: - Provision a new collection / namespace / table for the new embeddings. - Run the new embedding model over the full corpus, write to the new index. At 10M chunks and typical API throughput (1k-10k chunks/sec), expect hours-to-a-day. - Dual-write incremental updates to both indices during the migration window. - Shadow comparison: route 1-5% of queries to the new index, compare top-K and answer quality. Eval-set verification. - Flip reads to the new index. Keep dual-writing for a stability window. - Decommission the old index after 1-2 weeks of stability. - Cost: parallel storage during the migration window doubles vector store costs temporarily. Plan accordingly. - Embedding cost: re-embedding 10M chunks via API is non-trivial. At $0.02/1M tokens × 10M chunks × 200 tokens/chunk = $40 for a small model. Larger models scale up. - The interview answer to nail: this is a pipeline change with shadow + canary, not a "drop the old index and rebuild" event. - Numbers to drop: "re-embed 10M chunks: hours at API throughput; self-hosted GPU often comparable", "parallel-storage window: 1-2 weeks", "embed cost: $40-400 for 10M chunks depending on model"

Common follow-ups: - "What if you can't afford double storage?" - "How do you decide it's safe to flip?" - "What if the new embedding model has different dimensions?"

Traps: - Trying to re-embed in place. Mid-migration queries return garbage.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/07_search_relevance_ranking/

Architecture & performance¶

Q: "How does HNSW behave under high update load?"¶

Tags: senior · common · conceptual · source: Marcus Feldman multi-tenant 2026; standard senior performance probe

Answer outline: - HNSW updates are not free. Each insert walks the graph and updates connections; each delete tombstones a node. The graph degrades over time. - Specific issues: - Concurrent inserts: most HNSW implementations serialize graph updates internally. High write QPS bottlenecks here. - Deletes accumulate as tombstones: queries still visit tombstoned nodes (filtering them after); search slows as tombstone ratio grows. - Connectivity degradation: inserts can create suboptimal connections; over time, recall drops below the index-build-time level. - Mitigations: - Batched inserts: write in bulk where possible. Vector DBs handle batches more efficiently than single-row. - Periodic rebuild / compact: schedule a full rebuild during low-traffic windows. Most vector DBs support online rebuild with cutover. - Append-only patterns: tombstone-and-add instead of in-place update. Easier to GC. - Read replicas: if writes are bottlenecking reads, separate read and write replicas (Qdrant supports; pgvector via Postgres replicas). - For write-very-heavy workloads (>1k writes/sec sustained), HNSW may not be the right choice — consider IVF-based indices or DiskANN-style structures. - Numbers to drop: "tombstone GC trigger: 5-15% ratio", "scheduled rebuild: weekly to monthly depending on write rate", "HNSW write throughput: 100-1000s/sec single-node"

Common follow-ups: - "What's the rebuild cost?" - "When would you not use HNSW?"

Traps: - Assuming HNSW is set-and-forget. It needs maintenance under update load.

Related cross-cutting: Production patterns, Retrieval Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "How do you size a vector database for a target workload?"¶

Tags: senior · common · design · source: standard senior infra-sizing probe; reported in 2026 RAG-infra loops

Answer outline: - Inputs: vector count, dimensionality, target QPS, target p95 latency, filter selectivity, multi-tenant pattern. - Memory sizing: - Raw vectors: vector_count × dimension × bytes_per_element. At 10M × 1536-dim × FP32 = 60 GB raw. - HNSW overhead: ~1.5-2× the raw vector size. - KV / metadata: depends on per-vector metadata. - Total: budget ~2.5-3× the raw vector size for HNSW + metadata + headroom. - Compute: - QPS × per-query work. Typical HNSW query: <10ms on a single core. 1000 QPS → ~10 cores busy. - Reranking and embedding generation are usually larger compute consumers than the vector search itself. - Storage: - On-disk for cold tier (DiskANN, pgvector) - SSD with the corpus. - In-memory for hot tier (HNSW) — RAM-dominated. - Replication: typical 2-3× replicas for HA and read throughput. - Plan for growth: vector count doubles in many products. Size with 6-12 months of headroom. - Numbers to drop: "1M vectors × 1536-dim FP32: ~6 GB raw + ~3 GB HNSW overhead", "10M: ~60 GB raw + ~30 GB overhead", "1B: ~6 TB raw — IVF-PQ or DiskANN territory"

Common follow-ups: - "What if you don't know the QPS yet?" - "How does compression change the sizing?"

Traps: - Sizing only for current load. Vector counts grow.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "What's the difference between Pinecone pods and serverless?"¶

Tags: senior · common · conceptual · source: Pinecone docs 2026; standard senior vendor-decision probe

Answer outline: - Pods (legacy): dedicated capacity. You provision N pods; you pay per pod-hour regardless of usage. Predictable performance, predictable cost. Good for steady-state high QPS where you know your load. - Serverless: auto-scaling, usage-based pricing. You pay for what you use (storage GB + read/write units). Scales transparently up to billions. Eventually consistent (some propagation window between write and read). - 2026 default: serverless for new projects. Pods only when: - You need strict tail latency guarantees (serverless has cold-start variance). - You need consistent throughput at very high QPS where pod-based pricing wins. - You have specific regional / dedicated requirements. - Cost crossover: at small-to-moderate workloads, serverless is much cheaper. At very high steady-state QPS, dedicated pods may match or beat. - Migration: namespaces are compatible between pods and serverless; switching is possible. - Numbers to drop: "serverless free tier: 100k vectors", "moderate workload: $50-500/month serverless", "pod-based starts around $70/month per pod"

Common follow-ups: - "What's the cold-start cost on serverless?" - "When did you last hit a pod limit?"

Traps: - Defaulting to pods because "they sound more enterprise". Serverless is the 2026 default.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/12_model_vendor_strategy/

Hybrid search & filtering¶

Q: "Which vector databases have native hybrid (BM25 + vector) search?"¶

Tags: senior · common · conceptual · source: vector DB comparison guides 2026; standard senior retrieval probe

Answer outline: - Weaviate: best-in-class. BM25 and vector search in one query, fused via your-choice (RRF or weighted). Native. - Qdrant: supports BM25 via the FastEmbed integration (since 2024-2025); can combine in queries. Strong support. - Elasticsearch / OpenSearch: started as text-search engines; added vector (HNSW) — hybrid is straightforward. Same query DSL. - Vespa: hybrid native, with custom ranking expressions. - Pinecone: supports sparse-dense hybrid via separately-indexed sparse vectors. Requires you to provide both sparse and dense vectors per item. Less ergonomic than Weaviate's BM25. - pgvector: pair with Postgres full-text search (ts_vector, GIN index, ts_rank). Works but two indexes; combine in SQL via union + RRF. - Milvus: supports hybrid via sparse + dense, but ergonomics depend on version. - For RAG: if hybrid is core to your retrieval strategy and you want one-query simplicity, Weaviate or Elasticsearch are the easiest. Otherwise, build hybrid yourself with separate sparse + dense indices. - Numbers to drop: "hybrid retrieval typically lifts NDCG@10 by 5-20% over pure dense", "RRF (Reciprocal Rank Fusion) with k=60 is the standard fusion algorithm"

Common follow-ups: - "What's RRF?" (See retrieval-and-ranking.md) - "How does Pinecone's sparse-dense compare to Weaviate?"

Traps: - Implying hybrid is automatic. Most vector DBs need you to set it up explicitly.

Related cross-cutting: Retrieval Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/07_search_relevance_ranking/

Q: "Why is filtered vector search hard?"¶

Tags: senior · common · conceptual · source: Qdrant filter-aware HNSW docs 2026; standard senior retrieval probe

Answer outline: - Naive HNSW traversal doesn't know about filters. You either: - Pre-filter (filter first, then ANN-search the subset): cheap if the filter is very selective. Risk: if the filter rejects most of the corpus, the remaining is too small for HNSW's graph to give good recall. - Post-filter (ANN-search first, filter after): simple. Risk: top-K results may be filtered away entirely, requiring you to retrieve top-NK to get K survivors. - Filter-aware search: the graph traversal skips nodes that fail the filter. Best quality at slight perf cost. This is what Qdrant pioneered. - The hard case: medium-selective filters (filter retains 5-30% of corpus). Pre-filter loses graph quality; post-filter wastes most of the retrieved set. Filter-aware shines here. - Selectivity matters: ultra-selective filters (<1% retained, e.g., tenant=X for 1000-tenant SaaS) benefit from segmented storage (one HNSW per tenant), not from filter-aware search. - Practical guidance: - Single-tenant filters: use tenant-segmented storage (collections / namespaces / shards). - Multi-field filters with moderate selectivity: filter-aware search (Qdrant, modern Pinecone, Weaviate). - Very selective filters: pre-filter is fine. - Numbers to drop: "Qdrant filter-aware HNSW: <2× the unfiltered query latency", "naive post-filter on 10% selectivity: need top-50 to get top-5 survivors"

Common follow-ups: - "When does pre-filter beat filter-aware?" - "What's the cost of filter-aware vs naive HNSW?"

Traps: - Saying filters are free. They aren't, on HNSW. - Treating tenant filtering as "just another filter". For multi-tenant, segmentation is usually better.

Related cross-cutting: Retrieval Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/07_search_relevance_ranking/

Lightweight & specialized options¶

Q: "When would you use Chroma or LanceDB?"¶

Tags: mid · common · scenario · source: vector DB comparison guides 2026; standard senior tooling probe

Answer outline: - Chroma: lightweight, dev-friendly, in-process or client-server mode. Built for quick prototyping and LLM apps. The default for "I want a vector DB to play with" in Python. Production usage exists but at modest scale. - LanceDB: built on the Lance file format (columnar, optimized for ML). Sits on object storage; great for "vector DB + your data lake" patterns. Strong analytics integration; embeds in Python apps. - Use Chroma for: - Prototyping; demos; notebooks. - In-process / embedded use cases where you don't want a separate service. - Small production workloads (<1M vectors) where simplicity beats scalability. - Use LanceDB for: - Mixed analytics + retrieval workloads. - Data-lake-native patterns (vectors alongside parquet on S3). - Embedded apps where you want vector + analytics in one library. - Neither is the right answer for high-QPS multi-tenant SaaS. Migrate when scale demands. - Numbers to drop: "Chroma: typical comfort zone <1M vectors", "LanceDB: scales further but most teams use it embedded, not as primary OLTP retrieval"

Common follow-ups: - "Why isn't Chroma the right production answer?" - "When would you embed LanceDB instead of running a service?"

Traps: - Recommending Chroma for production multi-tenant at scale. Wrong tier.

Related cross-cutting: Architecture choices Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "What about using Elasticsearch / OpenSearch for vectors?"¶

Tags: senior · common · scenario · source: vector DB comparison guides 2026; standard senior infra probe

Answer outline: - Elasticsearch (and OpenSearch fork) added HNSW vector search in recent versions. Now competitive for hybrid retrieval where you already use ES for text. - Wins: - You're already on Elasticsearch / OpenSearch: no new infra. Hybrid is natural — BM25 + HNSW in one query. - Rich text-search features: stemming, synonyms, multi-field, query DSL. - Operational familiarity: if your team runs ES, the learning curve is small. - Limits: - Vector-specific features lag: less aggressive ANN tuning than Qdrant or Milvus; recall/latency competitive but not best-in-class. - JVM overhead: heavier ops profile than Rust-based vector DBs. - Cost at scale: ES cluster sizing for billion-scale vector search can be expensive. - 2026 stance: if you already have Elasticsearch, use it for vectors up to mid-scale. If you're greenfield, a dedicated vector DB usually wins. - Numbers to drop: "ES/OpenSearch HNSW: competitive recall/latency up to 10s of millions of vectors", "added in ES 8.x; OpenSearch 2.x"

Common follow-ups: - "How does ES vector search compare to Qdrant?" - "What about Vespa?"

Traps: - Adding Elasticsearch just for vector search. Overkill if you don't already have it.

Related cross-cutting: Architecture choices Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/07_search_relevance_ranking/

Security & governance¶

Q: "How do you secure a vector database against prompt-injection and data exfiltration?"¶

Tags: senior · common · design · source: Blockchain-Council Vector DB Security 2026; standard senior security probe

Answer outline: - The risk: vectors are derived from text that may itself contain injection payloads or PII. Three angles: - Tenant isolation: see the multi-tenancy question. Hard isolation prevents cross-tenant leakage; logical isolation via tenant filter requires enforcement at the SDK / middleware layer. - PII at ingestion: detect and redact PII before embedding/storing. The stored chunks (and source documents downloaded into traces) inherit PII; redact upstream. - Injection content in retrieved chunks: see safety-guardrails.md (indirect prompt injection). Chunks retrieved from the vector DB may contain "ignore prior instructions" payloads. Quarantine retrieved content before passing to the LLM. - Access control: - API keys scoped per tenant / per role; rotate regularly. - Audit log on every query (who, when, what filter, top-K returned). - Network: vector DB in a VPC; no public ingress except via the gateway. - Encryption: at-rest (most managed services do this; self-hosted needs explicit config) and in-transit (TLS). - Backup / DR: vector DBs are not free to rebuild. Snapshot regularly; test restore. - Numbers to drop: "audit log retention: 30-180 days typical", "PII redaction at ingestion catches ~95-98% of structured PII"

Common follow-ups: - "What if a tenant uploads malicious content to their own corpus?" - "How do you handle backup encryption?"

Traps: - Treating vector DB security as just access control. The data inside matters.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/03_ai_security_safety/00_safety_guardrail_design/, learning/03_ai_security_safety/01_prompt_injection_security/

Q: "What goes wrong in vector DBs in production?"¶

Tags: senior · common · debugging · source: standard senior production-debug probe; reported in 2026 RAG-ops loops

Answer outline: - Top failure modes: - Embedding model version drift: a developer updates the embedding model in one service but not the corpus. Mixed-version embeddings; similarity goes to garbage. Mitigate with version tagging on vectors. - Missing tenant filter: one code path forgot to add tenant_id to the filter; cross-tenant data appears in results. Mitigate with middleware enforcement + red-team tests. - Stale index from failed re-embed: re-embed job partially completed; some chunks new model, some old. Mitigate with dual-write + verification before cutover. - Tombstone accumulation: deletes piled up; query latency degrading. Mitigate with scheduled compaction. - Filter selectivity surprise: a tenant's data is much larger than others; their queries trigger pathological pre-filter performance. Mitigate with hybrid pre/filter-aware/post-filter routing per tenant. - Embedding-quality drift: a new tenant uploads data outside the embedding model's training distribution. Recall drops silently for that tenant. Mitigate with per-tenant recall monitoring. - Index OOM: HNSW out of memory when corpus grows past hardware limits. Mitigate with sizing alerts + planned scale-up. - Pinecone-style eventual-consistency window: app writes a vector, immediately queries, doesn't find it. Mitigate with retry-with-backoff or explicit consistency tier. - Observability essentials: per-tenant query rate, per-tenant recall (on sampled labeled data), index size, tombstone ratio, p99 query latency by filter pattern. - Numbers to drop: "embed-version tag on every vector", "per-tenant recall monitoring weekly", "tombstone ratio alarm: >10%"

Common follow-ups: - "How would you debug a sudden recall drop?" - "What alert would you put on the index?"

Traps: - Generic "monitoring" answer. Senior interviewers want specific failure modes and specific signals.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/03_agent_observability_debugging/