05. Finding the screenshot from a text question — multimodal indexing and incremental updates¶

~24 min read. The copilot asks "what's relevant to this payment failure for customer 88213?" The answer lives in three places: a chat (text), a transcript (audio-derived text), and a screenshot of the error (image). One query must reach all three — and the screenshot the customer uploaded 8 seconds ago must already be findable. This file builds the index that makes that true.

Built on incremental indexing, derived artifact, and modality cost asymmetry named in 00-first-principles.md, the idempotent-upsert sink from 04-streaming-transforms-and-embeddings.md, and the freshness gap's last leg Δx from 01-batch-vs-streaming-pressure.md. Chapter 04 produced embeddings; this file makes them retrievable, fresh, and cross-modal.

What chapter 04 settled, and what is still unreachable¶

Chapter 04 filled the derived zone: transcripts in the lakehouse, captions, and per-modality embeddings upserted into a vector index without duplicates. But an embedding in a sink is inert. The copilot does not scan embeddings; it issues a query and expects the most relevant chunks back in tens of milliseconds — and chapter 01's freshness gap had a final leg, Δx, "artifact → searchable in the index," that we deferred to here. Two hard questions remain. First, cross-modal reach: the customer typed a complaint and uploaded a screenshot; a text query must surface that image. Different modalities, different embedding shapes — how does one query span them? Second, freshness at the index: the screenshot embedded 8 seconds ago has to be findable now, which means the index must absorb a new vector continuously, not in a nightly rebuild.

Both are about the read path, and both are where the freshness gap finally closes or stays open. A perfectly fresh stream that lands embeddings in an index the copilot cannot query across modalities — or an index that only refreshes nightly — wastes every second the earlier chapters saved.

What this file solves¶

A copilot needs to retrieve relevant context across text, audio transcripts, and screenshots from a single query, while new embeddings stream in every second and must become searchable within seconds — not at the next index rebuild. This file shows how per-modality embeddings land in a vector index alongside metadata filters, how a shared embedding space lets a text query reach an image (cross-modal retrieval), and how incremental upserts and segment-based indexing keep an approximate-nearest-neighbor index fresh without the nightly full rebuild that would reopen the freshness gap.

1) Why a metadata filter alone can't find the screenshot — the need for vector retrieval¶

The first instinct is to retrieve by metadata: WHERE customer=88213 AND topic='payment' ORDER BY ts DESC. It works when you know the exact attributes. It fails the moment the relevant thing is described in different words than the query, which is most of the time. The customer's chat says "payment keeps failing," the transcript says "card got declined three times," the screenshot shows an error dialog with code E_CARD_DECLINED. A keyword/metadata filter on "payment failure" misses the transcript ("declined") and cannot read the image at all.

So the real need is retrieval by semantic similarity, not by exact attribute match — find chunks whose meaning is close to the query, even in other words or other modalities. That is what a vector embedding gives: each artifact becomes a point in a high-dimensional space where semantically close things are geometrically close, and retrieval is nearest-neighbor search. Metadata filtering does not disappear; it narrows the search (this customer, last 30 days) so the vector search runs over a relevant subset. The two compose: filter for correctness boundaries, vectors for semantic reach.

METADATA-ONLY (exact match)            VECTOR + METADATA (semantic + boundary)
 WHERE customer=88213                   filter: customer=88213, ts>now-30d   (narrow)
   AND text LIKE '%payment failure%'    then: nearest vectors to embed("payment failure")
 ✗ misses "card declined"               ✓ "card declined" is geometrically near
 ✗ cannot read the screenshot           ✓ screenshot embedding near the text query (cross-modal)
   exact words only                       meaning across words and modalities

Why this rule exists. Two artifacts about the same problem rarely share the same words and never share a modality (text vs image). Exact-match retrieval keys on surface form; the copilot needs to retrieve on meaning. Embeddings turn meaning into geometry so "close in meaning" becomes "close in space," and similarity search finds it. Metadata stays as the boundary (whose data, how recent, what type) that keeps the semantic search correct and scoped.

2) The core picture: one query fanning across modalities and segments¶

   copilot query: "payment failure, customer 88213, recent"
            │ embed query (text encoder)
            ▼
   ┌──────────────────── VECTOR INDEX (shared space) ───────────────────┐
   │  metadata filter: customer=88213  AND  ts > now-30d                  │
   │                                                                      │
   │   text vectors        transcript vectors      image vectors          │
   │   • chat:90412 ●        • txn:90414 ●            • img:90413 ●         │
   │        ╲                    │                    ╱                    │
   │         ╲     ANN search over the filtered set  ╱                     │
   │          ●────────── nearest k by cosine ──────●                      │
   │                                                                      │
   │  SEGMENTS:  [ sealed seg A ][ sealed seg B ] ... [ GROWING seg N ]    │
   │             immutable, indexed (HNSW/IVF)        fresh upserts land    │
   │                                                  here, searched too    │
   └──────────────────────────────────────────────────────────────────────┘
            │ top-k chunks (mixed modality), each with raw_s3 back-pointer
            ▼
   copilot grounds answer on chat + transcript + screenshot

Two ideas in one picture. Cross-modal: chat, transcript, and image embeddings live in the same vector space (produced so a text query and an image of the same concept land near each other), so one ANN search returns all three modalities ranked together. Incremental freshness: the index is a set of segments — older ones sealed and fully indexed, a growing segment receiving fresh upserts — and the search covers both, so the screenshot upserted 8 seconds ago into the growing segment is already returned. No nightly rebuild stands between the stream and the query.

The 14:32 query, traced. By 14:32:10 the index holds (from chapter 04's idempotent upserts):

id=chat:88213:90412  modality=text  vec=[...]  raw_s3=.../chat   ts=14:32:01
id=img:88213:90413   modality=image vec=[...]  raw_s3=.../shot.png ts=14:32:03
id=txn:88213:90414   modality=text  vec=[...]  raw_s3=.../call.wav ts=14:34:20  (lands later — ASR)

The copilot embeds its query and searches:

filter: customer=88213 AND ts > now-30d        → narrows to this customer's recent artifacts
ANN over filtered set, k=5, metric=cosine:
  1. chat:90412   sim 0.89  "payment failed again"           (text)
  2. img:90413    sim 0.81  [screenshot of E_CARD_DECLINED]   (image)  ← cross-modal hit
  3. older chat   sim 0.64  refund from 9 days ago            (text)
  (txn:90414 not yet present at 14:32:10 — transcript lands at 14:34)

The win over chapter 01: the chat from 31 seconds ago and the screenshot from 7 seconds ago are both returned — the freshness gap is closed at the read layer because the growing segment is searched. The cross-modal win: the text query surfaced the image at rank 2 without anyone running OCR or a separate image search; the screenshot's embedding sits near the text query in the shared space. And each result carries the raw_s3 back-pointer from chapter 03, so the copilot (or an auditor) can fetch the original bytes.

Note the asymmetry the trace makes visible: the transcript (txn:90414) is not there at query time — ASR hasn't finished. Modality cost asymmetry at the retrieval layer: text and image are retrievable in seconds, audio-derived text lags by minutes. The index is fresh; the audio modality is simply not ready, exactly as chapter 01 predicted.

4) Rule: keep the index searchable while it absorbs writes — never block reads on a rebuild¶

The chapter's invariant: a fresh retrieval index serves reads continuously while accepting streaming upserts, by searching sealed segments plus a small growing segment — so a new vector is searchable within seconds and no full rebuild ever stands between an event and a query. The freshness gap's last leg, Δx, is small only if the index update is incremental. The moment freshness depends on a nightly rebuild, every second chapters 02–04 saved is thrown away at the finish line.

Incremental updates carry a cost the rule has to manage: ANN indexes like HNSW degrade as you insert and delete into them — the graph fragments, recall drifts — so you cannot insert forever into one structure. The segment design resolves this: writes go to a small growing segment (cheap to keep current), segments seal and get a clean index built once, and background compaction merges and rebuilds sealed segments to restore recall. Reads always see all segments; the rebuild happens behind the live index, not in front of the query.

WRITE PATH (continuous)          READ PATH (continuous, never blocked)
 upsert ─▶ growing segment        query ─▶ search( sealed segs + growing seg )
           (fresh, small)                  ─▶ merge results, rank, return
 seal when full ─▶ build index             freshness = growing seg is current
 background: compact/rebuild               recall   = sealed segs kept clean by compaction
 sealed segs (restore recall)              the two are decoupled

Teacher voice. Freshness and recall pull in opposite directions inside an ANN index, and the segment design is how you stop them from fighting on the read path. Insert-heavy HNSW stays fresh but its recall drifts; a clean rebuild restores recall but is slow and would block reads if done in front of them. So you split the index in two: a tiny growing part absorbs writes and keeps freshness, sealed parts get rebuilt in the background to keep recall, and the query reads both. If you ever hear "we rebuild the index nightly so search is fresh in the morning," the freshness gap from chapter 01 has reappeared at the index — fresh data, stale index.

The deep design choice is how a text query reaches an image. Two approaches, and the choice changes the whole retrieval path.

Attempt A — separate index per modality, query each, merge¶

Embed text with a text model, images with an image model, into separate indexes. The copilot queries each and merges results.

Helps: each modality uses a model tuned for it; simple to reason about per modality.

Hurts: the embedding spaces are not comparable — a cosine score of 0.8 in the text index does not mean the same as 0.8 in the image index, so merging and ranking across them is guesswork. And a text query cannot search the image index at all unless you also embed the query as an image, which is nonsense. You end up needing a separate text-only path to images: caption every image, embed the caption as text, and retrieve the caption. That works (the "caption-and-index" pattern) but adds a captioning model call and loses detail the caption omits.

Attempt B — one shared multimodal embedding space¶

Use a multimodal model (voyage-multimodal-3.5, Cohere Embed 4) that embeds text and images into the same vector space, trained so a screenshot and a text description of it land near each other. Now a text query embedding searches one index containing all modalities, and cosine scores are comparable because everything shares the space.

Helps: one index, one query, directly comparable scores, true cross-modal reach — the text query finds the screenshot with no captioning step. 2026 multimodal models embed interleaved text+image (and Voyage 3.5 adds video frames) in a single call, with Matryoshka dimensions (256/512/1024/1536) so you can trim storage by truncating vectors without re-embedding.

Hurts: a general multimodal model may underperform a specialist text model on pure text-to-text relevance; you trade a little per-modality precision for cross-modal comparability and operational simplicity.

So the real choice is not "which embedding model" but whether modalities must be compared in one ranked list. If the copilot needs one fused top-k spanning chat, transcript, and screenshot — which it does — a shared space (Attempt B) is the clean answer; separate indexes (Attempt A) force fragile cross-index score reconciliation or a captioning detour.

Mini-FAQ. "Why not caption every image into text and avoid multimodal embeddings entirely?" Caption-and-index is the simplest pattern and sometimes right — but a caption is lossy. "Error dialog with a red banner" loses the exact code E_CARD_DECLINED and the field the cursor was in. A shared multimodal embedding retains visual detail the caption drops, and it skips a model call. Use captions when images are simple and a text-only stack is a hard constraint; use shared multimodal embeddings when visual detail matters for retrieval — which it does for error screenshots.

6) The property that changes the design: ANN index type and the recall/freshness/cost trade¶

The dominant knob is the index structure, and each choice trades recall, query latency, memory, and how gracefully it absorbs streaming inserts.

Index	Query latency	Recall	Streaming inserts	Memory	Fit on this platform
Flat (brute force)	high (scans all)	1.0 (exact)	trivial (just append)	low	tiny per-customer subsets after a tight filter
HNSW	very low	high	inserts OK, degrades with churn, needs periodic rebuild	high (graph in RAM)	hot recent data, low-latency copilot reads
IVF / IVF-PQ	low	tunable (nprobe)	delta-store + periodic re-cluster	lower (PQ compresses)	large historical sets, cost-sensitive

The reconciling pattern most 2026 vector DBs use: HNSW (or IVF) on sealed segments for fast recall, a small growing segment for fresh inserts, background compaction to merge and rebuild. Milvus separates storage and compute and organizes data into segments exactly for this; Qdrant keeps quantized vectors and the HNSW graph hot while streaming in. The metadata filter is what makes brute-force viable for the common copilot case: after customer=88213 AND ts>now-30d the candidate set may be a few hundred vectors, where exact search is microseconds and ANN's recall drift is irrelevant. Filter hard, and the ANN structure matters less for per-customer queries; it matters most for broad cross-customer searches (e.g., "find similar incidents across all customers").

Teacher voice. A tight metadata pre-filter is the cheapest retrieval optimization there is. Many "we need a fancier ANN index" problems vanish once you scope the search to one customer's recent data — the candidate set is small enough that even brute force is instant and exact. Reach for HNSW/IVF tuning when the unfiltered search space is large (cross-customer similarity, global pattern mining), not when a filter already shrank it.

7) Cost and freshness table: update strategies under this workload¶

Order-of-magnitude for the running platform (~7.6M artifacts/year, 1024-d). Verify against your engine.

Strategy	Index freshness (Δx)	Recall stability	Cost / ops	When to use
Nightly full rebuild	up to 24 h	excellent (clean build)	cheap compute, terrible freshness	never for a live copilot — reopens chapter 01's gap
Periodic mini-rebuild (e.g., 15 min)	~15 min	good	medium	loose-freshness search, not interactive
Streaming upsert into growing segment + background compaction	~2–8 s	good (compaction restores)	higher (always-on index, compaction CPU)	the copilot — interactive, fresh
Streaming upsert, no compaction	~2–8 s	degrades over time	low until recall rots	the trap — fresh but silently less accurate

Row three is the right answer for an interactive copilot and the only one that keeps Δx in seconds. Its hidden cost is compaction: merging and rebuilding sealed segments consumes CPU/IO continuously, and if compaction falls behind, recall drifts (row four's failure mode creeping in). The freshness you bought in chapters 02–04 is preserved at the index only if you pay this compaction cost. Storage is modest — 7.6M × 1024 × 4 bytes ≈ 31 GB raw, less with PQ or Matryoshka truncation — so the cost is compute and the always-on index, not bytes. Pressure evolution: incremental indexing relieves the freshness gap (Δx → seconds) but creates compaction pressure (continuous CPU to hold recall), absorbed by the vector DB's background workers.

8) Operational signals: watching the retrieval layer¶

Healthy: query latency p99 low and flat (e.g., <50 ms after filter); index freshness (time from upsert to searchable) ~seconds; recall@k stable against a periodic ground-truth check; growing-segment size oscillating (sealing and compacting on schedule).
First metric to degrade: recall@k against ground truth. As inserts/deletes churn HNSW and compaction lags, recall drifts down silently — the copilot retrieves slightly worse chunks, but latency and freshness still look fine. Recall is the leading indicator of an under-compacted index; nothing else shows it.
Misleading metric people watch: query latency. It stays low even as recall rots, because returning fast wrong neighbors is still fast. Low latency reassures while retrieval quality quietly degrades — the most dangerous comfort metric in the stack.
First graph an expert opens: recall@k over time overlaid with compaction lag and growing-segment size. They look for recall trending down as compaction backs up, and for the growing segment never sealing (freshness fine, recall doomed) or sealing too often (compaction thrash).

9) Boundary: where this index design fits, and where it doesn't¶

Strong fit: interactive, fresh, cross-modal retrieval over recent data with tight per-entity filters — exactly the copilot. Incremental upserts + segments keep Δx in seconds; the metadata filter keeps per-customer search exact and cheap.
Pathological: using a heavy ANN index with continuous churn and no compaction budget — recall rots invisibly. Or forcing cross-modal retrieval where modalities are genuinely incomparable (retrieving audio waveforms by text), where a shared space buys nothing and adds model cost.
Scale/workload limit that breaks intuition: at small filtered candidate sets, brute-force exact search beats every fancy ANN index — the index tuning is wasted effort. At very large unfiltered search (global cross-customer similarity over hundreds of millions of vectors), HNSW's RAM cost and compaction load dominate, and IVF-PQ or disk-resident indexes become necessary. The intuition "always use HNSW" breaks at both ends: too small (brute force wins) and too large (memory wall).

10) Wrong model to drop: "vector search makes metadata filtering obsolete"¶

The seductive idea is that since embeddings capture meaning, you can drop structured filters and let similarity do everything. It feels clean — one mechanism. The correct model: vector similarity finds semantically near things; it does not enforce correctness boundaries. Without customer=88213 the copilot can retrieve a different customer's semantically-similar payment complaint and ground its answer in someone else's data — a privacy and correctness failure no similarity threshold fixes. Filters enforce whose data, how recent, and what type; vectors find meaning within that boundary. They compose; neither replaces the other. (This is also why deletion in chapter 07 must hit both the vector and its metadata.)

11) Other retrieval-layer failure shapes¶

Recall rot — HNSW churn without compaction; retrieval quality drifts down while latency and freshness look healthy.
Stale-index gap — freshness depends on a nightly/periodic rebuild; chapter 01's gap reappears at the read layer.
Cross-modal score mismatch — separate per-modality indexes merged by raw scores; an 0.8 in one space outranks a truly-better 0.75 in another.
Missing-filter leakage — vector search without a customer filter retrieves another customer's data; privacy + correctness failure.
Dimension/model drift — half the index built with the old embedding model, half with the new; vectors no longer comparable, recall collapses until re-derive (chapter 04 skew biting here).
Tombstone debt — deletes mark vectors as removed but compaction lags, so search wastes work on tombstoned vectors and returns deleted-but-not-purged data.
Growing-segment bloat — segment never seals; fresh data fine but the in-memory growing index grows unbounded and search slows.
Filter-then-empty — a too-tight filter (new customer, no history) returns zero candidates; copilot over-relies on the live turn (cold-start, chapter 01).

12) Pattern transfer¶

Index freshness = the freshness gap's last leg (chapter 01) — Δx is small only with incremental updates; a nightly rebuild is the same clock-triggered staleness chapter 01 fought, relocated to the index.
Segment-and-compact = LSM-tree — growing segment + sealed segments + background compaction is exactly the log-structured merge-tree shape (memtable + SSTables + compaction); the freshness/recall trade mirrors write-amplification vs read-amplification in storage engines.
Filter narrows, vectors rank = two-stage retrieval — the same shape as a SQL index scan (narrow) feeding a sort (rank), and the same as candidate-generation-then-ranking in recommender systems.
Idempotent upsert (chapter 04) — the keyed upsert that prevented duplicates in the transform is the same write the index relies on; a duplicate vector would distort similarity scores.

13) Design test¶

Does retrieval compose a metadata filter (whose data, how recent) with vector similarity (meaning), never relying on similarity alone for correctness boundaries?
Do modalities that must appear in one ranked list share a single embedding space, so scores are comparable and a text query reaches an image?
Is the index updated by streaming upserts into a growing segment that reads can see immediately — never gated by a full rebuild?
Is there a compaction budget keeping recall stable, and do you measure recall@k against ground truth (not just latency)?
Is every retrieved chunk carrying a raw_s3 back-pointer and a customer filter, so results are auditable and scoped?

Where this appears in production¶

Vector indexing engines and streaming ingest: - Milvus — separates storage and compute, organizes data into sealed/growing segments with background compaction; built for streaming upserts at billions of vectors. - Qdrant — keeps quantized vectors and the HNSW graph hot for streaming ingest with hybrid memory/disk storage. - Pinecone — managed vector DB with live upserts and metadata filtering for fresh retrieval. - Weaviate — multimodal vector DB with module-based embedding and live inserts. - pgvector (Aurora/RDS) — HNSW in Postgres for teams keeping vectors next to relational metadata; GPU-accelerated build options in 2026. - OpenSearch / Vespa — combine vector ANN with rich metadata filtering and ranking at scale.

Cross-modal and multimodal retrieval in products: - Voyage voyage-multimodal-3.5 — single shared space for interleaved text+image+video frames with Matryoshka dimensions. - Cohere Embed 4 — text+image in one space, 128K context, Matryoshka dims for storage trimming. - Pinterest visual search — image embeddings retrieved by image and text in a shared space. - Google multimodal search — text queries retrieving images/video via shared embedding spaces. - Amazon product search — image + text retrieval over catalog embeddings with metadata filters. - Intercom Fin / support copilots — per-customer filtered retrieval over recent chat + transcript + screenshot embeddings, fresh within seconds. - Notion / Glean enterprise search — vector + permission/metadata filters so retrieval respects access boundaries (the filter-enforces-correctness pattern). - ColPali / ColQwen page-as-image retrieval — late-interaction multimodal retrieval over document pages without parsing. - Spotify / podcast search — transcript embeddings retrieved by text query, the audio-derived-text retrieval path. - GitHub code search — embedding + metadata filter (repo, language) two-stage retrieval.

Pause and recall¶

Why can't a metadata/keyword filter alone find the screenshot and the "card declined" transcript for a "payment failure" query?
How does a single text query return an image at rank 2 with no OCR or separate image search?
State the chapter's invariant. How do segments let the index stay fresh and keep recall?
Why do freshness and recall pull against each other inside an HNSW index, and what cost reconciles them?
Shared multimodal space vs separate per-modality indexes — what is the real deciding question?
Why is a tight metadata pre-filter often a cheaper win than tuning the ANN index?
Which metric degrades first as the index under-compacts, and which metric misleadingly stays healthy?
Why is dropping the customer filter a correctness and privacy failure that no similarity threshold fixes?

Interview Q&A¶

Q1. A text query needs to surface a screenshot of an error. How do you make that work? A. Embed text and images into one shared multimodal space (voyage-multimodal-3.5, Cohere Embed 4) so a text query embedding lands near the screenshot's embedding, and search one index containing both modalities — scores are comparable, one ranked list. The alternative (separate indexes) forces either fragile cross-index score reconciliation or a lossy caption-then-embed-as-text detour. Use shared space when modalities must appear in one fused top-k. Common wrong answer to avoid: "Run OCR on the image and keyword-search the text." OCR misses layout/visual cues and is exact-match; it won't connect "payment failure" to a declined-card dialog the way a shared embedding does.

Q2. Why not rebuild the vector index nightly — it's simpler and recall is great? A. A nightly rebuild reopens chapter 01's freshness gap at the read layer: the screenshot uploaded at 14:32 isn't searchable until tomorrow's build, so the copilot can't ground on it. For an interactive copilot, the index must absorb streaming upserts into a growing segment that reads see immediately, with background compaction keeping recall — Δx in seconds, not hours. Nightly rebuild is fine only for loose-freshness search. Common wrong answer to avoid: "Rebuild more often, like hourly." Still leaves an hour gap and wastes a full rebuild's compute; incremental upsert is the right shape for interactive freshness.

Q3. Latency is great but the copilot's answers feel slightly off lately. What do you check? A. Recall@k against ground truth. HNSW churn from continuous upserts/deletes degrades recall as compaction lags, so the index returns fast but slightly wrong neighbors — latency stays low (returning wrong neighbors is still fast) while quality drifts down. Check recall trend and compaction lag; latency is a misleading comfort metric here. Common wrong answer to avoid: "Latency is fine so retrieval is fine." Low latency coexists with rotting recall; quality needs its own ground-truth measurement.

Q4. When does a tight metadata filter make ANN tuning irrelevant? A. When the filter (e.g., customer=88213 AND ts>now-30d) shrinks the candidate set to a few hundred vectors, brute-force exact search is microseconds and exact, so HNSW recall drift doesn't matter. ANN tuning matters for broad unfiltered searches — cross-customer similarity over millions of vectors. Filter first; reach for index tuning only when the unfiltered space is large. Common wrong answer to avoid: "Always tune HNSW for speed." Over a small filtered set, brute force is faster to reason about and exact; the tuning is wasted.

Q5. Should vector similarity replace metadata filtering since embeddings capture meaning? A. No — similarity finds semantically near things but enforces no correctness boundary. Without a customer filter, the copilot can retrieve a different customer's similar complaint and ground on someone else's data: a privacy and correctness failure. Filters enforce whose/when/what-type; vectors rank meaning within that boundary. They compose. Common wrong answer to avoid: "Embeddings encode everything, drop the filters." That leaks cross-customer data and breaks scoping; no similarity threshold substitutes for an access/recency boundary.

Q6. (Cumulative) The copilot missed the screenshot from 10 seconds ago. Is this chapter-02 backpressure, chapter-04 transform, or chapter-05 indexing? A. Locate where the artifact is. If the image event is stuck in the log (consumer lag) → backpressure (02). If it was embedded but the upsert wasn't idempotent/failed → transform sink (04). If it was upserted but the index only refreshes on a rebuild, or compaction/segment issues hid it → indexing (05). For a 10-second miss with healthy lag, suspect the index update path: is the growing segment searched, or does freshness wait on a rebuild? Common wrong answer to avoid: "Re-embed the image." If it was already embedded and upserted, re-embedding doesn't help; the question is whether the index made it searchable, which is the segment/rebuild design.

Design/debug exercise (10 min)¶

Step 1 — Modeled example. Retrieval design for the copilot's per-customer query:

Query:    "payment failure, customer 88213, recent"
Embed:    text query → shared multimodal space (same space as image/transcript embeddings)
Filter:   customer=88213 AND ts > now-30d     (boundary: scope + recency)
Search:   ANN over filtered set, k=5, cosine; search sealed segments + growing segment
Index:    HNSW on sealed segments; growing segment for fresh upserts; background compaction
Fresh:    upsert→searchable ~2–8 s (growing segment visible to reads)
Return:   top-k mixed modality, each with raw_s3 back-pointer
Watch:    recall@k vs ground truth + compaction lag (not just latency)

Step 2 — Your turn. Design retrieval for a cross-customer query the support-ops team needs: "find the 20 most similar incidents to this one across all customers in the last 90 days." Decide: what changes about the filter (no single-customer scope), why ANN index choice now matters more, what recall/latency trade you'd accept, and how you'd prevent leaking PII across customers in the results (hint: chapter 07).

Step 3 — Reproduce from memory. Redraw the section-2 diagram (query → embed → filter → ANN over sealed+growing segments → mixed-modality top-k), label where freshness comes from (growing segment) and where recall is maintained (compaction of sealed segments), and write one sentence connecting Δx here to chapter 01's freshness chain and one connecting the idempotent upsert to chapter 04.

Operational memory¶

This chapter explained how an embedding sitting in a sink becomes something the copilot can actually retrieve — across text, audio transcripts, and screenshots, freshly, within seconds. The important idea is two-fold: a shared multimodal embedding space lets one text query reach an image with comparable scores, and incremental segment-based indexing keeps the index searchable while it absorbs streaming upserts, so the freshness gap's last leg Δx stays in seconds instead of waiting for a rebuild.

You learned to compose a metadata filter (whose data, how recent — the correctness boundary) with vector similarity (meaning — the semantic reach), embed all retrievable modalities into one space so cross-modal queries fuse into one ranked top-k, and update the index by upserting into a growing segment that reads see immediately while background compaction rebuilds sealed segments to hold recall. That closes the freshness gap at the read layer and connects the chat, transcript, and screenshot into one retrievable story, each chunk carrying its raw_s3 back-pointer.

Carry this diagnostic forward: when answers degrade but latency is fine, measure recall@k against ground truth and check compaction lag — recall rots silently while latency stays comforting. When a recent artifact isn't retrieved, ask whether the growing segment is searched or whether freshness secretly waits on a rebuild. And never drop the customer filter: vectors find meaning, filters enforce whose data — they compose, neither replaces the other.

Remember:

Retrieve by composing a metadata filter (boundary) with vector similarity (meaning); similarity alone enforces no correctness/privacy boundary.
A shared multimodal space makes a text query reach an image with comparable scores — separate per-modality indexes can't be ranked together cleanly.
Keep the index fresh by upserting into a searchable growing segment; a nightly rebuild reopens chapter 01's gap at the read layer.
Freshness and recall fight inside HNSW; compaction is the cost that holds recall — measure recall@k, because latency stays low while recall rots.
A tight metadata pre-filter often beats ANN tuning; index tuning matters for large unfiltered searches, not small per-customer ones.

Bridge. We can now retrieve fresh, cross-modal context in seconds — the copilot finally sees the whole story. But every freshness mechanism in chapters 02–05 costs money continuously: always-on consumers, a transform layer that never sleeps, compaction burning CPU to hold recall. Some of this data is queried constantly; some is never queried at all and is being kept warm out of habit. So the real question shifts from "can we make it fresh?" to "how fresh does each path actually need to be, and what are we paying to over-deliver?" That question forces a decision about how many code paths you maintain — one always-on streaming path, or a fast path plus a cheap correct batch path. The next file confronts lambda vs kappa and the cost of always-on freshness. → 06-freshness-vs-cost-lambda-kappa.md