Skip to content

02. Open-book answering — what changes when the model sees the page

~10 min read. Same model. Different desk. Different answer.

Builds on the ELI5 in 00-eli5.md and the failure in 01-confident-wrong-answer.md. The librarian, the bookshelf, the reading desk, the answer brief — all four placeholders reappear here, this time wired together.


A new hire's question that lives in three different systems

A new hire walks up to your support bot and asks one question: "Can contractors push code to the staging branch on a Saturday?" The closed-book model from the last chapter has no chance. Your contractor policy is a private wiki page, your branch protection rules live in a GitHub repo, and your on-call calendar lives in PagerDuty. None of that text is in the weights, so the model does what it did before — invents a clean-sounding answer. We saw that trap in chapter 01.

Now imagine a different setup. Before the model speaks, the librarian fetches three documents: the contractor access doc, the branch-protection page, and the weekend-deploy policy. All three land on the reading desk, and only then does the model write the reply. Same model, same weights, same prompt template — but the answer is now grounded in pages that exist.

That is the entire shift you must internalize on this page. Not a smarter model. A better-prepared desk.


The open-book exam, made literal

Picture two students taking the same paper. Student A walks in with nothing and must answer from memory alone; if a fact is stale or private, she guesses, and the guess sounds confident because her writing is fluent. Fluency is not the same as correctness — that gap is the whole story.

Student B walks in with the relevant textbook open beside her, and the proctor has even bookmarked the right chapter. She still has to read, reason, and write, but she is no longer guessing what the rule says. She is paraphrasing what is in front of her.

The two students share the same brain. Only one has paper on her desk, and that single change rewrites her error profile.

A closed-book LLM is Student A on every question. A RAG system is the proctor — the librarian plus the bookshelf plus the reading desk — who places the right pages in front of Student B before she starts writing. This page is about that proctor's job.

Mini-FAQ. "Is open-book just a metaphor, or is it literal?" It is literal at the prompt level. The retrieved chunks are pasted into the prompt as visible text. The model reads them with the same attention it reads the user question. There is no special "evidence mode" — just text on the desk.


What the open book actually changes

Three things shift the moment evidence enters the prompt.

Source of truth shifts. A closed-book model treats its weights as the world; an open-book model treats the retrieved chunks as the world. The instruction "answer only from the context below" is the contract that flips the source — without that line in the answer brief, the model drifts back to weights.

Failure mode shifts. Without RAG the failure is hallucination from missing knowledge; with RAG the failure becomes wrong retrieval or unfaithful generation. Different bug, different fix — you stop debugging the model and start debugging the desk.

Freshness shifts. Weights are frozen the day training stops, but the bookshelf can be reindexed tonight. A retrieved chunk written this morning beats a parameter learned last year, which is why teams stop chasing fine-tuning and start chasing better chunks.

            closed-book                       open-book (RAG)
         ┌──────────────┐                ┌──────────────────────┐
question │ model weights│       question │  the librarian       │
   ──▶   │   only       │   ──▶          │  fetches chunks      │
         └──────┬───────┘                └──────────┬───────────┘
                │                                   │
                ▼                                   ▼
       fluent continuation              chunks on **the reading desk**
                │                                   │
                ▼                                   ▼
       maybe correct                    **the answer brief** is built
       often invented                              │
                                          model answers from desk

The diagram has two columns for a reason. The left column is what the model can do alone. The right column is what the system does around the model. RAG is the right column.


The running example — Priya's contractor question

Let us stay with one query for the whole page.

"Can contractors push code to the staging branch on a Saturday?"

Three documents exist in your company.

  • Doc 1 — contractor_access.md: Contractors get repo write access only with a sponsoring full-time engineer. Access expires 90 days after contract end.
  • Doc 2 — branch_protection.md: The staging branch requires two reviewers from the platform team. No direct pushes.
  • Doc 3 — weekend_deploys.md: Weekend deploys to staging or production are blocked unless on-call approves in PagerDuty.

A closed-book model sees only the question. It might say "Yes, contractors can push if they have permission." That sentence is generic. It also misses two of three real constraints.

An open-book model sees the question plus the three chunks. It can now write a grounded answer. "No direct pushes are allowed. A contractor would need a sponsoring engineer, two platform reviewers, and on-call approval through PagerDuty for any Saturday change." Three facts. Three sources. Zero invention.

That is the lift open-book gives you. Hold this example. We will return to it.


The minimal pipeline, named once

Open-book answering needs only three verbs. Retrieve. Augment. Generate. You will hear about embeddings, reranking, hybrid search, citation mapping later. Today, just the three verbs.

        user question
     ┌────────────────┐
     │  RETRIEVE      │  the librarian walks to the bookshelf
     │  find chunks   │  and pulls the most relevant ones
     └────────┬───────┘
     ┌────────────────┐
     │  AUGMENT       │  the chunks land on the reading desk
     │  build the     │  the answer brief is written:
     │  brief         │  question + chunks + rules
     └────────┬───────┘
     ┌────────────────┐
     │  GENERATE      │  the model writes the reply
     │  write answer  │  using only what is on the desk
     └────────────────┘

Each verb has a job. Retrieve brings evidence near. Augment arranges it on the desk in a useful shape. Generate turns that arrangement into prose. A failure in any verb produces a different bad answer. We will spend the rest of the module on each verb's internals.

Mini-FAQ. "Is augment just string concatenation?" No. Augment is structured. It decides the order of chunks, the system rule ("answer only from these"), where the question sits, what metadata to attach for citation. A naive "\n".join(chunks) is the difference between a toy demo and a system that does not hallucinate.


What "evidence at query time" actually buys you

Closed-book systems share four limits. Open-book answering chips away at each of them.

Limit 1 — staleness. Weights freeze at the training cutoff. A retrieved chunk reflects the document as of this morning's index build. Update the source, reindex, the answer changes. No retraining.

Limit 2 — privacy. Public weights have never seen your wiki or your CRM. A retriever points at your private store. The model reads your data without ever being trained on it. That is also why permission-aware retrieval matters — the librarian must respect ACLs.

Limit 3 — verifiability. A closed-book answer cannot be checked against a source. An open-book answer can. Every claim should point back to a chunk. That is what makes RAG products like Perplexity feel trustworthy.

Limit 4 — abstention. A closed-book model rarely says "I don't know." An open-book model with the right system rule can. If no chunks score high enough, the brief says "insufficient evidence — refuse." That refusal is a feature, not a bug.

Hold this in your head. Open-book is the only path where the system can honestly say "I do not have that information." Weights cannot do that. Evidence can.


Predict the new failures before you read them

You have seen the closed-book trap and the open-book fix. Before reading the failure-modes section, predict on paper.

  • Name the three verbs of the minimal pipeline.
  • Name one failure that exists in open-book systems but not in closed-book ones.
  • Name one failure that exists in both.

Now continue.


New failures the open book introduces

Open-book is not magic. It trades one set of failures for another. A serious engineer names both sets.

Wrong-chunk retrieval. The librarian fetches three chunks. None of them is the contractor doc. The model now has irrelevant evidence on the desk. Either it improvises (bad) or it abstains (better). Either way, the question is unanswered.

Right chunk, wrong rank. The contractor doc is retrieved but ranked sixth. Top-three selection drops it. The model writes from weaker chunks. This is why reranking exists. Foreshadowing — chapter 10.

Right chunks, ignored by model. The chunks are perfect. The model still writes from its training memory because the brief did not say "only from this context." Style wins over evidence. Fix the brief, not the model.

Contradicting chunks. Doc 1 says contractors can push with sponsorship. Doc 2 says no direct pushes at all. Without a rule for handling conflict, the model picks one and sounds confident. Production systems either surface the conflict to the user or rank by source authority.

Stale chunks. The bookshelf was last reindexed three months ago. The policy changed last week. The model answers with stale truth. This is freshness debt, and it shows up as a confident wrong answer that looks like a hallucination.

Over-stuffed desk. You put twenty chunks on the desk to be safe. The model gets lost in the middle. Quality drops. The reading desk has a budget for a reason.

See the pattern. Closed-book failures live inside the model. Open-book failures live in the chain feeding the model. That is why RAG is systems engineering and not prompt engineering.

Mini-FAQ. "If retrieval is wrong, will a smarter model save me?" Mostly no. A stronger model can write more fluent nonsense from the same wrong chunks. The cap on answer quality is set by what reaches the desk. Upgrading the writer does not change what is on the page in front of her.

Mini-FAQ. "Why not just stuff every document into the prompt and skip retrieval?" Two reasons. Context windows are bounded — a million-token window still costs more per call and degrades on long-context attention. And legal, compliance, and tenancy rules forbid showing the model documents the user is not allowed to see. Retrieval is also access control.


The diagnostic shift: from "why did the model say that?" to "why did those chunks reach the desk?"

Closed-book systems fail at the model; open-book systems fail at the desk. Good RAG engineers stop asking "why did the model say that?" and start asking "why did those chunks reach the desk?" If you internalize one sentence from this page, take that one.


A worked pass through Priya's question

Walk the minimal pipeline once, end to end.

question: "Can contractors push code to the staging branch on a Saturday?"
                       ┌──────────────────┐
                       │ the librarian    │
                       │ searches         │  RETRIEVE
                       └────────┬─────────┘
        ┌───────────────────────┼────────────────────────┐
        ▼                       ▼                        ▼
   contractor_access.md    branch_protection.md     weekend_deploys.md
   (sponsorship, 90d)      (2 reviewers required)   (PagerDuty approval)
        │                       │                        │
        └───────────────────────┼────────────────────────┘
                       ┌──────────────────┐
                       │ reading desk     │  AUGMENT
                       │ + system rule:   │
                       │ "answer only     │
                       │  from these"     │
                       └────────┬─────────┘
                       ┌──────────────────┐
                       │ model writes     │  GENERATE
                       │ a grounded       │
                       │ answer with      │
                       │ three citations  │
                       └──────────────────┘

The answer Priya receives names three rules, not one. Each rule points back to a doc. If she clicks the citation, she lands on the source page. She trusts the answer because she can verify it. That trust loop is what every grounded-answer product is selling.


Open-book answering in shipped products

Open-book answering shows up across very different stacks. The shape is constant: pull evidence, build a brief, generate with citations.

  • Perplexity AI — web research with span-level citations on every claim.
  • Glean — enterprise search across SaaS apps with ACL-aware retrieval.
  • Notion AI Q&A — answers grounded in your workspace pages and databases.
  • Google NotebookLM — strict source-grounded QA over user-uploaded PDFs and notes.
  • GitHub Copilot Chat — repository context retrieved per query before code suggestions.
  • Cursor — codebase-aware coding agent with file-level retrieval into the prompt.
  • Windsurf (Codeium) — multi-repo retrieval for code completions and chat.
  • Anthropic Claude Projects — user-uploaded corpus indexed and retrieved per turn.
  • OpenAI ChatGPT with Connectors — third-party data (Drive, Box, Slack) retrieved into the answer.
  • Microsoft Copilot for Microsoft 365 — Graph-aware retrieval across Outlook, Word, Excel, Teams.
  • Slack AI — channel-history retrieval for thread summaries and search answers.
  • Intercom Fin — support bot grounded in help-center articles with citation back to source.
  • Zendesk AI agents — KB retrieval for ticket deflection with answer attribution.
  • Salesforce Einstein Copilot — CRM-grounded answers using customer records as evidence.
  • HubSpot Breeze — RAG over CRM, knowledge base, and marketing content.
  • Harvey — legal RAG over filings, contracts, and case law for law firms.
  • Casetext CoCounsel — legal research grounded in retrieved case law and statutes.
  • Hebbia — financial-document RAG for analysts at hedge funds and banks.
  • Cohere Coral / Command R — packaged retrieval and grounded generation for enterprises.
  • Amazon Q Business — AWS enterprise assistant retrieving across S3, SharePoint, Confluence.
  • Vectara — RAG-as-a-service with built-in faithfulness scoring.
  • You.com / Andi / Phind — consumer web-grounded chat with per-claim citations.
  • Mendable / Inkeep — docs-grounded chat embedded in developer portals.
  • Pinecone Assistant — managed end-to-end open-book stack over your corpus.
  • Elastic AI Assistant — observability and security questions answered from indexed logs and docs.

The names differ. The desk-and-brief structure does not.


Numbers, qualified

Some rough figures, only as orientation. Always qualify by stack.

  • Latency to retrieve a small chunk set is usually 10–80 ms on a managed vector DB like Pinecone or pgvector, longer on cold indexes.
  • Cost of retrieval is dominated by the embedding call and the vector read. At 1M queries/day, retrieval is typically pennies per thousand queries; generation dominates total spend.
  • Faithfulness lift from a good RAG pipeline over a closed-book baseline is large on private data — public reports from Glean, Vectara, and Cohere all show double-digit accuracy gains on domain QA, though exact numbers depend on the eval set.

Do not memorize these. Recompute for your stack before quoting them.


Recall — can you retell the desk-shift cold?

  1. What single change separates a closed-book answer from an open-book answer?
  2. Name the three verbs of the minimal pipeline.
  3. Why does fluent language not imply correct evidence?
  4. Name one failure mode that exists in open-book but not in closed-book systems.
  5. Why does upgrading the LLM not fix a broken retrieval stage?
  6. What does the system rule "answer only from the provided context" actually buy you?
  7. Why is abstention easier in open-book systems than in closed-book ones?
  8. In Priya's example, which three documents must reach the desk for a complete answer?

Interview Q&A

Q1. In one sentence, what does RAG change about how an LLM answers? A. RAG supplies relevant evidence at query time, so the model writes from documents on the desk instead of guessing from frozen weights. Common wrong answer to avoid: "RAG fine-tunes the model on your documents."

Q2. Why is fine-tuning not the same as open-book answering? A. Fine-tuning bakes patterns into weights at training time; it does not fetch fresh documents at query time, cannot easily forget revoked content, and still hallucinates when unsure. RAG separates the corpus from the model. Common wrong answer to avoid: "Fine-tuning and RAG are interchangeable approaches."

Q3. If retrieval returns the right chunks but the model still ignores them, where is the bug? A. The augment stage — the brief failed to instruct the model to answer only from the provided context, or the chunks were ordered or formatted so the model under-weighted them. Common wrong answer to avoid: "The model is too small; upgrade it."

Q4. Why does open-book answering make abstention practical? A. The system can score retrieval relevance, and below a threshold it instructs the model to refuse. Without retrieval, the model has no objective signal that it lacks evidence, so it interpolates. Common wrong answer to avoid: "The LLM naturally knows when it doesn't know."

Q5. What new failure modes does RAG introduce that a closed-book LLM does not have? A. Wrong-chunk retrieval, right-chunk-wrong-rank, ignored-evidence generation, contradicting chunks, stale-index answers, and over-stuffed context lost-in-the-middle effects. Common wrong answer to avoid: "RAG only adds upside, no new failures."

Q6. If a RAG system gives a confidently wrong answer, what is the first thing you inspect? A. The chunks that reached the prompt. Read them before blaming the generator. If they are wrong or missing, the failure is upstream in retrieval, ranking, or selection. If they are correct, the failure is in the brief or the generation. Common wrong answer to avoid: "Switch to a stronger LLM and rerun."

Q7. Why is the open-book setup also a privacy and compliance win, not only an accuracy win? A. Private data stays in the retrieval store, not in the model weights. Documents can be permissioned per user, revoked, or deleted, and the next query reflects the change. Weights cannot do selective forgetting reliably. Common wrong answer to avoid: "Open-book and closed-book have the same privacy profile."

Q8. A teammate says "we will just put the whole knowledge base in the prompt — no retrieval needed." How do you respond? A. Even with million-token context windows, you pay per token, latency grows with length, attention degrades in the middle, and you cannot enforce per-user document permissions. Retrieval is also access control and cost control, not just a token trick. Common wrong answer to avoid: "Yes, long-context models make RAG obsolete."


Apply now (10 min)

Step 1 — model the exercise. Here is how I would trace Priya's question through the minimal pipeline.

Verb Input Output One failure I would log
Retrieve "Can contractors push to staging on Saturday?" Top-5 chunks across access, branch, calendar docs Did the contractor doc make the top-5?
Augment 3 selected chunks + system rule A brief with question last and rule first Did the brief explicitly say "only from context"?
Generate Final brief Answer text + chunk citations Did the answer cite all three relevant docs?

Step 2 — your turn. Pick a real question from your product or workplace. Pick one that a closed-book LLM would mishandle. Write the three rows above for that question. Name the documents that should reach the desk. Name one failure mode per verb that would produce a wrong answer.

Step 3 — sketch from memory. Draw the three-verb pipeline. Mark where the librarian, the bookshelf, the reading desk, and the answer brief appear. If you can do this cold, you own the open-book mental model.


What you should remember

This chapter explained the single shift that turns a confident-wrong chatbot into a grounded one: place evidence on the desk before the model writes. Same weights, same model. What changes is the prompt the model sees — three retrieved chunks plus a system rule that says "answer only from these". The retrieved text becomes the new source of truth.

You learned the three verbs that make this work — retrieve, augment, generate — and saw Priya's contractor question move through each one with three sources naming three rules. You also learned that the open book introduces its own failures: wrong-chunk retrieval, right-chunk-wrong-rank, ignored evidence, contradicting chunks, stale indexes, over-stuffed desks. Each failure lives in the chain feeding the model, not in the model itself.

Carry this diagnostic forward: when a RAG answer looks wrong, read the chunks first. If the right chunks did not reach the desk, the bug is in retrieval or ranking. If they reached the desk but the model ignored them, the bug is in the brief. Only when both are clean is it worth blaming the generator.

Remember:

  • The shift is structural: same model, prompt now contains evidence. The contract "answer only from this context" is what flips the source of truth.
  • Retrieve, augment, generate. Each verb has its own failure mode and own debug surface.
  • Open-book answering moves the failure from the model to the desk. RAG is systems engineering, not prompt engineering.
  • Abstention only works when retrieval scores are visible. Below threshold, instruct refusal — that refusal is a feature.
  • Long-context windows do not replace retrieval. Cost, attention degradation, and per-user permissions still demand a librarian.

Bridge. The bookshelf does not hold whole documents. It holds chunks. How big should each chunk be, and where should it break? Too big and the desk gets clogged. Too small and meaning falls apart between pieces. That tradeoff is the next thing to get right.

03-chunking-tradeoffs.md