14. Vectorless RAG — retrieve by reasoning over structure, not by similarity¶

~14 min read. Every pattern so far improved the vector pipeline. This one asks whether you needed the vectors at all.

Built on the ELI5 in 00-eli5.md. the contents map — reading a document's table of contents and reasoning about where to look — replaces the similarity machine when the document has structure a human would navigate.

1) The wall — when similarity is not relevance¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation. Every one of them still assumes the same substrate: chunk the document, embed the chunks, search by cosine similarity.

The concrete failure here is sharper than any single tuning knob: on a 200-page financial report, the chunk most similar to "what was operating margin in Q4" is often a different quarter's margin sentence, and the qualifier "all amounts in thousands" sits 180 pages away in a chunk that scores near zero. This page follows a navigation trace over a table-of-contents tree so you can see whether the system found the relevant section or merely the similar text.

The tempting repair is more of the same: better embeddings, smaller chunks, a stronger reranker. That helps, and on clean prose it may be enough. It fails on this case: similarity ranks by surface resemblance, but the answer needs the section a human expert would deliberately turn to.

Root cause: the retrieval substrate threw away document structure at chunk time, so nothing downstream can reason over it. Rule: when relevance requires structure and reasoning, retrieve by navigating the document, not by matching vectors.

Mini-FAQ. "What is the control point here?" the contents map is useful only when it creates a real decision: the LLM chooses which node to open next and whether the evidence so far is enough, rather than accepting a fixed top-k.

2) The core visual — read the map, then turn to the page¶

Consider how an analyst answers a question about a thick report.

They do not scan every paragraph for the one that sounds most like the question.

They open the table of contents.

They reason about which section holds the answer.

They turn to that section and read it.

If it is not enough, they follow a cross-reference and turn again.

Vectorless RAG encodes exactly that.

The document becomes a tree of sections, not a bag of chunks.

The LLM walks the tree by reasoning, not by distance.

question
   │
   ▼
table-of-contents tree
   ├── 1. Business Overview
   ├── 2. Risk Factors
   ├── 3. Financial Statements
   │     ├── 3.1 Income Statement   ← reason: "margin lives here"
   │     └── 3.2 Balance Sheet
   └── 4. Notes
         └── 4.2 Basis of Presentation  ← "amounts in thousands"

No embeddings. No chunking. No vector database.

3) What the contents map is really deciding¶

Retrieval becomes a sequence of reasoning steps, not a single similarity query.

The index is a hierarchical tree: each node has an id, a title, a short summary, and its children.

The LLM is given the tree and the question.

Its job is not to answer yet.

Its job is to choose a path: which section to open, what to extract, and whether to keep going.

PageIndex (Vectify AI, September 2025) named this "vectorless, reasoning-based RAG" and runs it as a five-step loop.

1. read the table of contents      (understand structure)
2. select the most relevant section (reason, don't match)
3. extract information from it
4. sufficiency check  ── not enough ──► back to step 2
5. answer with citations

Senior systems treat the tree as the durable artifact.

Build it once per document; reuse it for every query.

The cost moves from an embedding index to LLM navigation calls per query.

4) The worked example — trace the intermediate state¶

Question:

"What was the company's Q4 operating margin, and in what units?"

A vector pipeline scores chunks:

chunk: "Q3 operating margin was 18.2% ..."     sim = 0.91  ← top-1, WRONG quarter
chunk: "Q4 operating margin improved to ..."   sim = 0.88
chunk: "All amounts in thousands of USD ..."    sim = 0.07  ← never retrieved

Top-k returns the Q3 sentence and misses the units note entirely.

A vectorless pass reasons over the tree instead:

ToC reasoning:
  "margin → Income Statement (3.1)"          open 3.1
  extract: "Q4 operating margin: 14.0%"
  sufficiency: units not stated here
  "units → Notes / Basis of Presentation"    open 4.2
  extract: "All amounts in thousands of USD"
  sufficiency: complete → answer

See the intermediates clearly.

The vector pipeline optimized resemblance and returned a confident wrong quarter.

The navigation pipeline optimized relevance and assembled two distant-but-correct sections.

The decision the model made — open 3.1, then 4.2 — is the retrieval trace you can inspect and replay.

5) Failure modes — how the mechanism breaks¶

Failure one. The tree is bad.

If the table of contents is shallow, mislabeled, or auto-generated wrong, navigation reasons over a broken map.

Failure two. Navigation cost explodes.

Each step is an LLM call; a deep tree plus a chatty sufficiency check can mean many calls per query.

Failure three. No structure to exploit.

A flat blog post, a chat log, or a pile of short tickets has no meaningful hierarchy to navigate.

So what to do?

Invest in tree quality at index time.

Bound the navigation depth and cache the document prefix so per-query calls stay cheap.

And reserve the pattern for documents whose structure is real.

6) Production rules that hold up¶

Build the tree from the document's own structure (headings, sections, ToC), not from arbitrary splits.

Store a short LLM-written summary on each node so the model can decide without opening every child.

Bound navigation: max depth, max sections opened, a sufficiency check that can stop early.

Log the chosen path, the extracted spans, and the citations.

Cache the document at index time so contextualizing and navigating do not re-pay for the whole file each call.

And keep a fallback: if navigation finds nothing, drop to lexical or hybrid retrieval rather than answering empty-handed.

This is not separate from evaluation. A navigation path is good only if the section it opened actually contained the answer.

7) Why not just keep tuning the vector pipeline under this workload¶

The plausible alternative is the whole rest of this module: better chunks, hybrid retrieval, a cross-encoder, confidence gates. It is attractive because the vector stack already exists and scales to millions of documents in milliseconds.

That tradeoff is correct for large, weakly-structured corpora where fuzzy semantic match is the point. It is wrong when the unit of truth is a long, highly-structured document and the answer needs reasoning over its hierarchy. Under that workload, navigation earns its cost by making retrieval explainable and by finding relevant — not merely similar — sections.

Option	Works when	Fails when	Cost moves to
vector retrieval (+ rerank)	huge corpus, fuzzy semantic match, low latency per query	structure matters, exact section needed, retrieval must be auditable	embedding index, vector DB, opaque ranking
vectorless navigation	long structured docs, reasoning over hierarchy, explainability required	corpus is millions of flat fragments; no real structure; tight per-query latency	LLM navigation calls, tree quality, per-query reasoning

Mini-FAQ. "Is this the end of vector databases?" No. It is a second substrate. Pick navigation when documents are structured and answers are reasoned; pick vectors when the corpus is vast and matching is fuzzy. Many systems route between both.

A healthy trace shows the model opening few, correct sections and stopping when evidence is sufficient. The first metric to watch is section-hit rate: did the opened node actually contain the answer? Top-1 similarity is irrelevant here because there is no similarity score — the decision is the path.

The review loop starts with false greens: the model answered after opening a plausible-but-wrong section. Those cases reveal whether the tree summaries are good enough to guide selection.

user complaint
   -> navigation trace (which nodes opened, in what order)
   -> extracted spans + citations
   -> answer / open another section / abstain
   -> false-green review on section selection

Strong fit: long, structured documents — filings, contracts, manuals, standards, textbooks — where a human would use the table of contents. Weak fit: short documents that fit in context anyway, or corpora with no usable hierarchy. Pathology: the model keeps opening sections because it wants an answer, not because the last section revealed a new evidence need.

Scale limit: navigation is LLM calls per query. At millions of queries or millions of tiny documents, vector ANN is cheaper per request. Route navigation to the documents and questions that justify the reasoning cost.

10) Wrong model — vectorless means embeddings are obsolete¶

The wrong model says reasoning over trees replaces similarity search everywhere, so you should rip out your vector DB.

The better model is narrower: vectorless retrieval relieves one named pressure — similarity is not relevance on structured documents — and creates one visible decision — which section to open next. Where the corpus is huge and matching is genuinely fuzzy, embeddings still win. The two are substrates to route between, not rivals to the death.

11) Failure taxonomy for vectorless RAG¶

The tree is auto-built badly, so every navigation reasons over a wrong map.
Node summaries are weak, so the model opens the wrong section confidently.
Navigation depth is unbounded, so cost and latency balloon per query.
The document has no real structure, so the tree is arbitrary and useless.
The sufficiency check never stops, looping over sections like uncalibrated retries.
Citations point to the node but not the exact span, so claims cannot be verified.
The benchmark number is vendor-reported and not reproduced on your own corpus.

12) Pattern transfer — same pressure, different system¶

Parent-child retrieval has the same shape: structure carries context that flat chunks lose.
Routing has the same shape: choose the substrate that fits the query, do not force one path.
Agentic search has the same shape: a coding agent greps and reads files by reasoning, not by embedding the repo.
Confidence gates have the same shape: the sufficiency check is a gate deciding answer, continue, or abstain.

13) Design review checklist¶

What exact pressure forced navigation instead of similarity here?
What artifact proves retrieval changed — can you show the opened-node path?
Why is the tuned vector pipeline weaker on this specific workload?
Which metric should improve first — section-hit rate, not similarity?
Which cost rises first — LLM navigation calls per query?
When should the system open another section, fall back to vectors, or abstain?

Where this lives in the wild¶

Financial document QA — navigate 10-Ks and earnings reports by section instead of matching margin sentences across quarters; PageIndex reports 98.7% on FinanceBench (vendor-reported — reproduce before quoting).
Legal contract analysis — open the indemnity clause by reasoning, not by hoping a chunk embedding lands near the query.
Regulatory and compliance — standards and statutes are deeply structured; navigation mirrors how a compliance officer reads them.
Technical manuals and runbooks — turn to the right procedure by section, with a traceable path for audits.
Medical guidelines — clinicians navigate by structured headings; explainable retrieval matters for trust.
Academic papers and textbooks — answer by opening Methods or a specific chapter, following cross-references.
Tax and accounting research — defined terms and "basis of presentation" notes live far from the numbers they govern.
Insurance policy review — coverage, exclusions, and endorsements are distinct sections a reader navigates deliberately.
Government filings and tenders — long, sectioned, and reference-heavy; similarity scatters, structure holds.
Enterprise knowledge bases — when articles have real hierarchy, navigation beats chunk soup.
Coding assistants — grep-and-read agentic retrieval over a repo is vectorless retrieval by another name.
SQL / structured data copilots — text-to-query is "vectorless" too: the LLM reasons about where data lives, not which vector is near.
Hybrid routers — send structured long-doc questions to navigation and broad fuzzy questions to vector search.
Due-diligence platforms — assemble distant-but-correct sections (units, dates, parties) into one cited answer.
Audit and e-discovery — every retrieved claim must trace to an exact section path, which navigation gives for free.

Recall checkpoint¶

Why does similarity rank the wrong quarter's margin above the units note?
In the worked example, which two sections did navigation assemble, and why?
What is the navigation cost model, and when does it beat or lose to vector ANN?
Which false-green case would you review first for vectorless RAG?
What replaces top-1 similarity as the health metric here?
When would the tuned vector pipeline be the right choice instead?

Interview Q&A¶

Q: What is vectorless RAG, in one breath? A: Retrieval-augmented generation that finds context by having an LLM reason over a document's structure — typically a table-of-contents tree — instead of embedding chunks and searching by cosine similarity. No embeddings, no chunking, no vector DB.

Common wrong answer to avoid: "It's RAG without a database." — It still retrieves; it just retrieves by navigation, and may still use a normal store for the tree.

Q: Why would you choose it over a tuned vector pipeline? A: When the unit of truth is a long, structured document and the answer needs the relevant section, not the most similar text — and when retrieval must be explainable and auditable.

Common wrong answer to avoid: "Because vectors are obsolete." — At huge scale with fuzzy matching, vector ANN still wins; the two are substrates to route between.

Q: What's the core critique of similarity search it's built on? A: Similarity is not relevance. The nearest embedding can be the wrong quarter, the wrong entity, or miss a governing qualifier sitting far away in the document.

Common wrong answer to avoid: "Embeddings are just inaccurate." — The point is structural: matching surface resemblance is a different objective from reasoning about where the answer lives.

Q: What's the main cost you're taking on? A: LLM navigation calls per query — reading the tree, selecting sections, checking sufficiency. That's slower and costlier per request than an ANN lookup, and it doesn't scale to millions of flat fragments the way vectors do.

Common wrong answer to avoid: "It's cheaper because there's no vector DB." — You trade index cost for per-query reasoning cost; on high QPS that can be more expensive.

Q: What artifact do you inspect when it fails? A: The navigation path — which nodes the model opened, in what order, and the spans it extracted — plus the tree itself. A wrong answer is usually a bad tree or a weak node summary, not a bad embedding.

Common wrong answer to avoid: "Check the similarity scores." — There are none; the decision is the opened-section path.

Q: How do you make it production-safe? A: Build the tree from real structure, store good node summaries, bound navigation depth, cache the document prefix, log the path and citations, and fall back to lexical/hybrid retrieval when navigation finds nothing.

Common wrong answer to avoid: "Just trust the 98.7% benchmark." — That's vendor-reported on FinanceBench; reproduce on your own corpus before relying on it.

Apply now (10 min)¶

Take one long structured document you know (a filing, a contract, a manual). Sketch its table-of-contents tree and write a question whose answer needs two distant sections.
Sketch from memory: draw the five-step navigation loop and mark where the sufficiency check sends control back.
Reproduce from memory: explain vectorless RAG in five sentences — the pressure, the mechanism, the alternative, the metric, and the boundary.

What you should remember¶

Vectorless RAG exists because similarity is not relevance on structured documents. Chunk-and-embed throws away the hierarchy a human expert would navigate, so no reranker downstream can reason over it. The mechanism replaces the similarity machine with the contents map: an LLM reads a table-of-contents tree and reasons about which section to open, then whether the evidence is enough.

The artifact to inspect is the navigation path — the opened-node sequence and extracted spans. If that path does not explain why one section was chosen over another, the tree or its summaries, not the model, are the problem.

Remember:

Retrieve by navigating structure when relevance needs reasoning, not by matching vectors.
The decision is which section to open next and whether evidence is sufficient — log it.
The cost moves to LLM navigation calls per query; bound depth and cache the document.
Section-hit rate replaces top-1 similarity as the health metric.
It is a second substrate, not the death of embeddings — route between them by workload.

Bridge. Navigation, routing, and confidence gates all sharpen retrieval. But sharper control does not make missing truth, hard reasoning, or bad source data disappear. The last thing to admit is what even advanced RAG still cannot solve cleanly. → 15-honest-admission.md