Skip to content

05. Long-Term Vector Memory — Search the cabinet by meaning

~18 min read. Once memory leaves the live prompt, we need a way to fetch the right piece back.

Built on the ELI5 in 00-eli5.md. The filing-cabinet — the big searchable store — becomes useful only when the librarian can find the right sheet fast.


1) What vector memory is trying to solve

A raw filing-cabinet can hold everything. That does not mean you can find anything. Keyword search helps for exact phrases. But users often ask indirectly.

"Use the style from last quarter's launch review." That may never contain the exact same words. Vector memory solves this by storing meaning-like representations. Picture first.

memory text ──→ embedding ──→ vector index
query text  ──→ embedding ──→ nearest neighbours
                             top memories returned
The idea is simple. Texts with similar meaning land near each other. Then the librarian can ask, "What old memory sits close to this query?"

That is why the filing-cabinet becomes searchable by semantics. Simple, no? But note the danger. Near is not the same as correct.

We still need filters and ranking.

2) What actually gets stored

Teams sometimes store whole conversations as one vector. Usually that is poor practice. Better units are chunks with metadata.

For example: - one compressed summary-card - one tool result summary - one meeting decision

  • one stable preference with source Each chunk should carry metadata. That may include:
  • user id

  • org id

  • timestamp
  • memory type
  • sensitivity level

  • source turn ids Metadata is not optional decoration. It keeps the filing-cabinet clean. It lets the librarian filter by tenant, time, and type.

Without metadata, retrieval gets noisy fast.

┌─────────────── chunk ───────────────┐
│ text: prefers short Python examples │
│ user_id: u17                        │
│ type: semantic_fact                 │
│ source: turn_44                     │
│ ts: 2026-02-12                      │
└─────────────────────────────────────┘
Now the vector is useful. It is not floating alone. It is attached to identity and context.


3) Retrieval is more than cosine similarity

After the picture, now the simple score. Suppose we use dot score for intuition. Query vector q = [0.8, 0.2]. Memory A = [0.9, 0.1].

Memory B = [0.4, 0.7]. Memory C = [0.1, 0.9]. Scores: A = 0.8×0.9 + 0.2×0.1 = 0.74.

B = 0.8×0.4 + 0.2×0.7 = 0.46. C = 0.8×0.1 + 0.2×0.9 = 0.26. So A ranks first. Good.

But what if A belongs to another customer? Reject it with metadata filters. What if A is three years old? Maybe down-rank it.

What if B is marked high importance? Maybe boost it. See. Vector similarity gives a candidate set.

It should not make the final decision alone. That is why the librarian remains a policy engine, not only a nearest-neighbour call.


4) Worked example: retrieving a preference

A user asks,

"Draft this update like you usually do for me." The filing-cabinet stores these memories. 1. "Prefers short bullet summaries with a clear action list." 2. "Works at Northstar Payments on internal platform tools."

  1. "Asked for vegan dinner ideas last weekend."
  2. "Uses Python more often than JavaScript." The query embedding pulls back 1 and 4 as nearest. Metadata filtering keeps only communication preferences for drafting.

So the librarian injects memory 1. Now the answer becomes: - short bullets - action list

  • compact wording If memory 3 were retrieved instead, that would be a funny but useless miss. This is why chunking and metadata matter.

The filing-cabinet should not be a junk drawer.

5) Common failure modes

Failure one: chunks are too large. One vector represents many unrelated ideas. Retrieval becomes muddy.

Failure two: chunks are too tiny. You retrieve fragments with no usable context. Failure three: poor metadata. The librarian cannot filter by tenant or task.

Failure four: stale embeddings. The text changed, but the vector did not. Failure five: semantic drift. Queries about current work pull ancient but similar text.

So what to do? Use reasonable chunk sizes. Re-embed on content updates. Always filter by identity and sensitivity.

Combine similarity with recency and importance. And keep a plain-text fallback for audits. Vector memory is powerful. It is not magic.


Where this lives in the wild

  • ChatGPT Memory — OpenAI uses durable memory retrieval so user preferences can reappear across sessions when relevant.
  • Glean Assistant — enterprise employee benefits from semantic retrieval over company knowledge and prior interactions when phrasing a new question differently.
  • Rewind AI — personal knowledge worker relies on semantic search so past notes and meetings can be recalled by meaning rather than exact wording.
  • HubSpot AI — account manager can retrieve prior customer context that matches a new deal question semantically, not only lexically.

Pause and recall

  1. Why is metadata as important as the vector itself?
  2. In the worked score example, why was top similarity not enough by itself?
  3. What is the risk of storing giant conversation chunks as single vectors?

4. Which placeholder is responsible for choosing among candidate memories?

Interview Q&A

Q: Why use vector memory instead of plain keyword search for long-term recall? A: Keyword search misses paraphrases and conceptual matches. Vector memory broadens recall to semantically similar memories.

Common wrong answer to avoid: "Because keyword search cannot scale" — keyword search scales fine; the real gap is semantic matching.

Q: Why is top-k nearest neighbour retrieval not enough for production memory systems? A: Similarity alone ignores identity, recency, importance, and safety. Production retrieval needs filters and reranking.

Common wrong answer to avoid: "Because cosine similarity is mathematically weak" — the issue is missing policy constraints, not just the metric.

Q: Why chunk memories instead of embedding whole transcripts?

A: Retrieval needs focused, reusable units. Whole transcripts mix unrelated concepts and produce noisy matches.

Common wrong answer to avoid: "Because embedding APIs have token limits" — even without limits, huge chunks are semantically messy.

Q: Why keep the original text beside embeddings? A: Embeddings help find memories, but the original text is needed for injection, auditing, and re-embedding.

Common wrong answer to avoid: "Only because vectors are unreadable" — the deeper reason is provenance and downstream usefulness.

Apply now (5 min)

Exercise: Write four memory chunks from your own recent work. Give each a type and a timestamp. Then write one new query. Choose which two chunks you would retrieve and why. Sketch from memory: Draw the flow from text to embedding to vector index to retrieved memories. Add metadata boxes around the chunks.


Bridge. Semantic search finds meaning, yes. But sometimes the user does not want "similar". They want "what happened last time" exactly. → 06-episodic-memory.md