11. Prompt augmentation — building the answer brief¶

~12 min read. Retrieval is done. The chunks are ranked. Now you must hand them to the model. Bad handoff, bad answer — even with perfect chunks.

Follows on from 10-reranking.md. The answer brief from the ELI5 is exactly what we are building here — the structured packet the writer reads before composing a word.

Hook — two prompts, one query, two answers¶

User asks: "When can an enterprise customer get a refund, and who approves it?"

The reranker has done its job. Three strong chunks are sitting on the reading desk. Now two engineers build the prompt in two different ways.

Engineer A — the bad brief.

Enterprise annual plans may request a refund within 30 days of renewal.
Refunds are not available after the first 5,000 API calls.
Enterprise support can verify eligibility but cannot approve refunds.
When can an enterprise customer get a refund, and who approves it?

Three chunks. One question. No rules. No source labels. No abstention clause. The model writes:

"Enterprise customers can request a refund any time within the first year. Refunds are processed by support agents within 5,000 business days."

Confidently wrong. Mixed-up numbers. Invented roles. No way to trace the lie.

Engineer B — the structured brief.

[SYSTEM]
You are a support analyst. Answer ONLY from the evidence below.
Cite chunk IDs in square brackets after every claim.
If evidence is missing or conflicting, reply: "I do not know from the provided sources."

[EVIDENCE]
[chunk_014 | refund_policy_q3] Enterprise annual plans may request a refund
within 30 days of renewal. Requests must be submitted by the billing admin.

[chunk_022 | refund_exceptions] Refunds are not available after the first
5,000 API calls in the renewal term.

[chunk_031 | enterprise_sla] Enterprise support can verify eligibility, but
final approval remains with billing operations.

[QUESTION]
When can an enterprise customer get a refund, and who approves it?

[RULES]
Answer in one paragraph. Cite every fact. Refuse if evidence is insufficient.

Same chunks. Different brief. The model writes:

"An enterprise customer may request a refund within 30 days of renewal [chunk_014], but only if they have not yet made 5,000 API calls in that term [chunk_022]. Final approval rests with billing operations, not support [chunk_031]."

Same evidence. Same model. Different answer. The brief did the work.

This chapter is about that brief.

The metaphor — the answer brief¶

Recall the ELI5. The librarian, the bookshelf, the reading desk, the answer brief. We have reached the last placeholder.

The answer brief is not a prompt. It is a working sheet. It tells the writer who they are. It lays out the evidence on the desk. It states the question. It names the rules of engagement. Then the writer writes.

A weak brief is like handing a journalist three news clippings and saying "write something." A strong brief is like handing them the same clippings with a slug, a deadline, a word count, a citation style, and a refusal clause for unverified claims.

Same clippings. Different newspaper.

The five blocks of a good brief¶

Every well-built augmented prompt has five blocks, in this order:

┌─────────────────────────────────────────────────────────┐
│ 1. SYSTEM ROLE                                          │  ← who the writer is
│    "You are a support analyst..."                       │
├─────────────────────────────────────────────────────────┤
│ 2. GROUND RULES                                         │  ← non-negotiables
│    "Answer only from evidence. Cite. Refuse if unsure." │
├─────────────────────────────────────────────────────────┤
│ 3. EVIDENCE BLOCKS                                      │  ← the desk contents
│    [chunk_014 | doc=refund_policy_q3] ...               │
│    [chunk_022 | doc=refund_exceptions] ...              │
│    [chunk_031 | doc=enterprise_sla] ...                 │
├─────────────────────────────────────────────────────────┤
│ 4. THE QUESTION                                         │  ← the task
│    "When can an enterprise customer get a refund?"      │
├─────────────────────────────────────────────────────────┤
│ 5. ANSWER FORMAT + REPEATED RULES                       │  ← end-of-prompt anchor
│    "One paragraph. Cite every claim. Refuse if needed." │
└─────────────────────────────────────────────────────────┘

Five blocks. Each block has one job. Drop a block and you create a failure mode. We will see all of them in section 8.

Mini-FAQ. "Why include source IDs in the evidence block?" Two reasons. First, the model can echo them in the answer — that is how citations get traced back to chunks. Second, when something goes wrong, you can replay the exact prompt and see which chunk lied. No IDs, no traceability, no debugging.

Order of the blocks — and lost-in-the-middle¶

Now the harder question. Why this order?

Transformer attention is not uniform. The model reads the whole prompt, but it does not pay equal attention to every token. A 2023 paper by Liu et al., Lost in the Middle, tested long-context models on multi-document QA. The finding: when the answer lived in document 1 or document 10, accuracy was high. When it lived in document 5 of 10, accuracy dropped by 20 percentage points or more for some models. The middle is a graveyard.

Picture the attention curve as a smile:

attention
  ▲
1.0│●                                                    ●
   │ ●                                                  ●
0.8│  ●                                                ●
   │   ●                                              ●
0.6│    ●                                            ●
   │     ●                                          ●
0.4│      ●●●                                    ●●●
   │         ●●●●●                          ●●●●●
0.2│              ●●●●●●●●●●●●●●●●●●●●●●●●●●
   │                  ↑ middle: under-read
   └──────────────────────────────────────────────────────►
   start                                              end

So the rule of thumb: put the most important content at the start and at the end, and put the lowest-value content in the middle.

For a RAG brief this means:

Start — system role and ground rules. The model anchors to these.
Middle — the evidence blocks themselves (yes, the middle).
End — the question, the format spec, and a re-statement of the most critical rule.

But wait — if the middle is the graveyard, why park evidence there? Because the evidence is many chunks, and something has to live in the middle. The trick is to rank within the evidence block too. Most important chunk goes first, second-most goes last, and the weakest chunk gets buried. We will come back to this.

Mini-FAQ. "Where should the question go — top or bottom?" Bottom, for a typical RAG prompt. The question at the end is the model's last instruction, and last instructions are weighted heavily. If you put the question at the top, the model can drift by the time it reaches the end. Some chat templates put a short version at the top and the full question at the bottom — that works too.

Mini-FAQ. "Should context come before or after instructions?" Instructions before context, then context, then the question, then a short repeat of the most critical instruction. The model needs to know the rules before it reads the evidence, so it knows what to do with it. And it needs to be reminded of the rules at the end, so they survive the long context in between.

Predict the five blocks before reading the inventory¶

Stop. Before reading on, predict three things:

What is the single line in the brief that most reduces hallucination?
If you have 8,000 tokens of context window and 4,000 tokens of retrieved chunks, what is the danger?
Why are citation markers placed inside the evidence block, not after the answer?

Hold those answers. We return to them.

The running example — fully built¶

Here is the brief from the hook, with every block annotated.

┌─ SYSTEM ROLE ──────────────────────────────────────────────────────────┐
│ You are a customer support analyst for an enterprise SaaS product.     │
│ Your job is to answer billing and refund questions accurately.         │
└────────────────────────────────────────────────────────────────────────┘

┌─ GROUND RULES ─────────────────────────────────────────────────────────┐
│ 1. Answer ONLY from the EVIDENCE block below.                          │
│ 2. Cite chunk IDs in square brackets after every factual claim.        │
│ 3. If evidence is missing, conflicting, or ambiguous, reply exactly:   │
│    "I do not know from the provided sources."                          │
│ 4. Do not infer policy that is not explicitly stated.                  │
└────────────────────────────────────────────────────────────────────────┘

┌─ EVIDENCE ─────────────────────────────────────────────────────────────┐
│ [chunk_014 | doc=refund_policy_q3 | section=2.1]                       │
│ Enterprise annual plans may request a refund within 30 days of         │
│ renewal. Requests must be submitted by the billing admin of record.    │
│                                                                        │
│ [chunk_022 | doc=refund_exceptions | section=4]                        │
│ Refunds are not available after the first 5,000 API calls in the       │
│ renewal term. Partial credits may be approved for duplicate invoices.  │
│                                                                        │
│ [chunk_031 | doc=enterprise_sla | section=7.3]                         │
│ Enterprise support can verify refund eligibility, but support agents   │
│ cannot approve refunds directly. Final approval remains with the       │
│ billing operations team.                                               │
└────────────────────────────────────────────────────────────────────────┘

┌─ QUESTION ─────────────────────────────────────────────────────────────┐
│ When can an enterprise customer get a refund, and who can approve it?  │
└────────────────────────────────────────────────────────────────────────┘

┌─ ANSWER FORMAT ────────────────────────────────────────────────────────┐
│ Answer in one paragraph, under 80 words.                               │
│ Cite every factual claim with the chunk ID, e.g. [chunk_014].          │
│ If you cannot answer fully from the evidence, refuse as instructed.    │
└────────────────────────────────────────────────────────────────────────┘

Read it slowly. Five blocks, each doing one thing. The model now has no excuse to invent.

Token budget — building the brief backward¶

The prompt has a hard ceiling. Plan it the right way: reserve the answer first, the rules next, and only then fill evidence.

Worked example — Claude 3.5 Sonnet with a 200,000-token window, but a real RAG service running at 8,000 input tokens per query for cost reasons:

Block	Tokens	Notes
Answer reservation	800	What the model writes back
System role	150	Stable, cacheable
Ground rules	250	Stable, cacheable
Question	80	Per-query
Answer format	120	Stable, cacheable
Subtotal (non-evidence)	1,400
Evidence budget	6,600	The rest

If each chunk after reranking is ~400 tokens, you can fit ~16 chunks. But should you? Almost never. Four to eight strong chunks usually beat sixteen mediocre ones, because of lost-in-the-middle and signal dilution. More is not better — more is louder.

Typical production splits look like:

Model	Window	System+rules	Evidence	Question	Answer
GPT-4o (128k)	128,000	500	6,000–20,000	100	1,500
Claude 3.5 Sonnet	200,000	500	8,000–30,000	100	2,000
Llama 3.1 8B (8k)	8,192	400	4,000–6,000	80	1,000
Gemini 1.5 Pro (1M)	1,000,000	500	up to 200,000	100	2,000

The window has grown. The lost-in-the-middle problem has not gone away. Long windows are capacity, not quality.

Prompt compression — fewer tokens, same signal¶

When evidence overflows the budget, you have three choices: drop chunks, summarize chunks, or compress them token by token.

LLMLingua (Microsoft) compresses prompts at the token level. It removes low-information tokens — articles, fillers, redundant subclauses — using a small language model as the budget evaluator. Reported numbers in the LLMLingua-2 paper: up to 20x compression with near-original answer quality on QA benchmarks, and roughly 1.7x–5.7x faster end-to-end latency.

LongLLMLingua extends this for very long contexts and pairs compression with importance-aware ordering — it pushes the most informative content to the start and end, fighting lost-in-the-middle directly.

Summarization-as-compression uses a small model (often the same provider's cheaper tier) to rewrite each chunk into a tighter abstract. Cost: one extra LLM call per chunk. Benefit: a 600-token chunk becomes a 150-token brief paragraph.

Practical recipe. If your retrieval returns 20 chunks of 400 tokens (8,000 tokens) and your evidence budget is 4,000:

Option A — keep top 10, drop the rest. Risk: drop something useful.
Option B — keep all 20, compress each by 50% with LLMLingua. Risk: compression artifacts.
Option C — keep top 8 uncompressed, plus a 200-token LLM summary of the remaining 12. Risk: extra latency.

Pick by workload. Measure faithfulness either way.

Citation markers — inside, not after¶

Many beginners put citations as a footnote at the end of the answer. "Sources: [1] [2] [3]." This is decoration. It does not tell you which sentence came from which source.

Real citation lives inside the brief and inside the answer, sentence by sentence.

Inside the brief. Every evidence block carries a chunk ID. The model can quote that ID.

[chunk_014 | refund_policy_q3] Enterprise annual plans may request...

Inside the answer. The model places the ID immediately after the claim it supports.

A refund may be requested within 30 days of renewal [chunk_014],
but only if 5,000 API calls have not yet been used [chunk_022].

This gives you claim-level attribution. Each fact, each citation. Now an auditor can verify each sentence against the chunk in seconds. This is what Perplexity, Glean, NotebookLM, and Vectara all do under the hood, just with prettier formatting.

Mini-FAQ. "What is lost-in-the-middle and how do I mitigate?" It is the empirical finding that mid-prompt tokens get less attention. Mitigate it by (1) ranking evidence so the strongest chunk is first and the second-strongest is last, (2) repeating the most important instruction near the end of the prompt, (3) compressing or summarizing middle content, and (4) measuring — re-order chunks at random and see if accuracy moves. If it does, you have the effect.

Refusal — the line that saves you¶

The single most important sentence in your brief is the refusal clause.

If the evidence is missing, conflicting, or insufficient,
reply exactly: "I do not know from the provided sources."

Five reasons to keep it word-for-word:

It gives the model permission to refuse. Without it, the default is to please the user — and pleasing means filling gaps.
It gives you a detectable signal. A literal string lets your evaluation framework count refusals.
It avoids hedged garbage like "I think the answer might be..." — which is worse than refusal because it sounds confident.
It calibrates user expectations. Users learn the model can say no, and they trust the yeses more.
It is measurable as a metric. Refusal rate becomes part of your eval dashboard alongside faithfulness.

Production teams tune the refusal threshold by domain. A medical chatbot refuses aggressively. A movie recommender almost never refuses. Pick a default per product.

Failure modes — five ways the brief breaks¶

#	Failure	Symptom	Fix
1	No refusal clause	Confident hallucination on missing data	Add the literal refusal line
2	Instructions buried mid-prompt	Model ignores the rules	Move to start AND repeat at end
3	No citation markers in evidence	Answers cannot be traced	Tag every chunk with a stable ID
4	Context overflows window	Silent truncation, last chunks vanish	Token-count before sending; reserve budget
5	Redundant chunks fill the desk	Same fact, three voices, no diversity	De-duplicate at selection, before brief

Every failure here has been seen in real production tickets. None are exotic. They show up the moment you ship.

Brief patterns across shipped grounded-answer products¶

Twenty-five real systems and patterns built around the same brief idea:

Anthropic Claude — recommends <documents><document><source> XML tags around evidence; the model is trained to anchor citations to these.
OpenAI structured outputs — JSON schema enforcement so the model returns fields like answer and sources[] in a parseable shape.
OpenAI function/tool calling — citation tools that the model must call with chunk IDs as arguments.
Perplexity AI — numbered citation markers [1] [2] woven into every sentence, mapped to fetched URLs.
Cohere RAG API — built-in documents parameter; the model returns citations with start/end character offsets.
Glean — enterprise answers carry source-document links per claim and respect ACLs at retrieval time.
Vectara — RAG-as-a-service with built-in factual consistency scoring and per-claim source attribution.
Pinecone Assistant — managed brief assembly with system rules and citation enforcement.
LlamaIndex response synthesizers — compact, refine, tree_summarize, accumulate modes are different brief-assembly strategies.
LangChain RetrievalQA — stuff, map_reduce, refine, map_rerank chains pack evidence in different ways.
LangGraph — agentic RAG where the brief is rebuilt across loop iterations.
Haystack — PromptBuilder and AnswerBuilder components separate brief assembly from generation.
llama.cpp grammar-constrained outputs — GBNF grammars force the model to produce only citation-shaped strings.
Instructor (Python) — Pydantic-typed responses, so refusal becomes a literal Refusal model.
BAML — schema-first prompts where the brief structure is compiled from a declarative spec.
DSPy — programmatic prompt construction; the brief is composed by Signature objects.
Microsoft Copilot for M365 — Graph-grounded brief with per-document citation back to SharePoint, Outlook, Teams.
GitHub Copilot Chat — code-context brief with file path and line range as citations.
Cursor / Windsurf — codebase-aware briefs; @file and @symbol are explicit evidence injections.
Notion AI Q&A — citations point users to the source note in the workspace.
NotebookLM — every answer paragraph carries clickable source markers.
Intercom Fin — support brief grounded in help-center articles; refusal triggers an escalation path.
Hebbia / Harvey — long-document briefs for finance and legal; citations include page and clause numbers.
Microsoft LLMLingua — token-level prompt compression library, up to 20x with retained quality.
Anthropic prompt caching — the stable parts of the brief (system + rules + format) are cached at 90% discount, only the per-query evidence and question are billed at full rate.

The shape stays. The wrapper, the tags, the citation format — all change.

Recall — brief construction cold¶

Name the five blocks of a well-built brief, in order.
Why does the question usually go at the bottom?
What does lost-in-the-middle mean, and what is one mitigation?
What is the literal refusal line you would use in a production brief?
If you have 6,600 tokens for evidence and chunks average 400 tokens, how many chunks fit — and why is that not always the right number?
Where should citation markers appear: inside the evidence, inside the answer, or both?
Name two prompt-compression strategies and roughly what they save.
List three failure modes from section above and the fix for each.

Interview Q&A¶

Q1. How do you prevent hallucination at the prompt level? A. Four levers, all stacked. (1) An explicit ground rule: answer only from the provided evidence. (2) A literal refusal clause: if evidence is insufficient, reply "I do not know from the provided sources." (3) Citation enforcement: every claim carries a chunk ID. (4) Repeat the most critical rule at the end of the prompt, where attention is strong. Prompt-level prevention does not replace retrieval quality, but it converts ambiguous failures into measurable refusals. Common wrong answer to avoid: "Just lower the temperature."

Q2. In what order should you arrange a RAG prompt and why? A. System role and ground rules first (anchor attention). Evidence next, with the strongest chunk ranked first and second-strongest last. Question near the bottom. Format and a repeat of the critical rule at the very end. This order respects the lost-in-the-middle attention curve and ensures the question is the last instruction the model reads. Common wrong answer to avoid: "Order doesn't matter; transformers attend equally."

Q3. What is lost-in-the-middle and how do you mitigate it? A. A finding from Liu et al. 2023 that mid-prompt content gets significantly less attention than start or end content, with accuracy drops of 20+ points on multi-document QA. Mitigations: rank evidence so the strongest is at the boundary, repeat key instructions at the end, compress or summarize middle content with tools like LLMLingua, and measure by shuffling chunk order to check answer stability. Common wrong answer to avoid: "Larger context windows solve it."

Q4. Why are citation markers placed inside the evidence block, not appended to the answer? A. So the model can echo them at the claim level. A footnote at the end tells you the sources existed but not which sentence came from which source. Per-claim citations enable auditable, verifiable answers — every fact traces back to a chunk ID in seconds. This is what Perplexity, Glean, Cohere, and Vectara all do. Common wrong answer to avoid: "Citations are a UI concern, not a prompt concern."

Q5. How do you plan a token budget for a RAG prompt? A. Backward. Reserve answer space first (e.g., 800 tokens). Subtract system, rules, and format (typically 500–700 tokens). Subtract the question (~100 tokens). The rest is the evidence budget. Then pick chunk count from that budget, but cap it — four to eight strong chunks usually beat sixteen mediocre ones due to dilution and lost-in-the-middle. Common wrong answer to avoke: "Use the full context window; bigger is better."

Q6. When would you use prompt compression and how much can you save? A. When retrieved evidence exceeds the input budget, or when latency and cost are tight. LLMLingua reports up to 20x token compression with near-original answer quality on QA benchmarks, and ~1.7x to 5.7x faster end-to-end latency. Alternatives: drop low-rank chunks, or summarize chunks with a cheaper model. Trade-off is compression artifacts versus information loss from dropping. Common wrong answer to avoid: "Just truncate from the end."

Q7. What should the refusal clause say, and why so literally? A. A literal exact-match string like "I do not know from the provided sources." Literal because (a) the model is more likely to comply with an explicit example, (b) it produces a detectable string for monitoring, (c) it gives a refusal rate metric for evals, and (d) it prevents hedged garbage that is worse than a clean refusal. Common wrong answer to avoid: "Just say 'be honest about uncertainty.'"

Q8. How does prompt prefix caching change brief design? A. Providers like Anthropic and OpenAI cache the static prefix of long prompts. The implication: put stable content — system role, ground rules, format spec — at the top, and put per-query content — evidence and question — at the bottom. Done right, you pay full price only for the changing part; the prefix is billed at ~10% of the input rate (Anthropic) and latency drops noticeably for long stable headers. Common wrong answer to avoid: "Caching is invisible to the developer."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the brief I would write for our running query, as a checklist:

Block	What it contains	Token estimate	Rationale
System role	"You are a support analyst"	80	Anchors attention; cacheable
Ground rules	Answer-only, cite, refuse	200	Hallucination prevention
Evidence chunks (3)	Tagged with chunk IDs	1,200	The actual desk
Question	The literal user query	30	Last full instruction
Format + critical rule repeat	One paragraph, cite, refuse	100	End-of-prompt anchor
Total input		1,610
Answer reservation		800	What model returns

Step 2 — your turn. Take one query from your product. Write each of the five blocks. Tag your evidence with stable IDs. Write the refusal clause as a literal string. Estimate the token count of each block. Where does your budget run out — and which chunks would you drop first?

Step 3 — sketch from memory. Redraw the five-block diagram. Label each block with its one job. Underneath, write the failure mode that triggers when that block is missing. If you can do this cold, you understand the answer brief.

What you should remember¶

This chapter explained why "stuff the chunks into the prompt" is the move that produces fluent hallucinations even on perfect retrieval. The answer brief has five blocks — system role, ground rules, evidence with stable chunk IDs, the literal question, and an end-of-prompt anchor that repeats the critical instruction — and the order matters because LLMs weight the last full instruction most heavily.

You also learned the structural defences against the common failures: stable chunk IDs let citations attach to evidence rather than to invented document names; the answer-only-from-context clause has to be repeated at the bottom because attention degrades in the middle; the refusal clause must be a literal string because vague "be honest" rules do not survive long prompts. Prefix caching only pays off when the stable content is at the top and the changing content is at the bottom.

Carry this diagnostic forward: when the model produces a fluent answer that ignores the chunks, read the brief end-to-end before blaming the model. The critical rule probably sits halfway down where attention does not protect it.

Remember:

A brief is five blocks, not a concat. Drop one and a known failure mode arrives.
The last instruction wins. Put the question last and repeat the critical rule at the end of the prompt.
Tag chunks with stable IDs so citations survive prompt rewrites.
Refusal must be a literal clause ("If the context does not cover the question, say so"), not a vague aspiration.
Order blocks for prefix caching: stable content top, per-query content bottom. Caching pays in latency and dollars.

Bridge. The brief is well-built. The model answers cleanly. But how do you know the retrieval feeding the brief is any good? A perfect brief with the wrong chunks still produces the wrong answer. To diagnose retrieval itself, you need numbers — recall@k, MRR, NDCG. Vibes are not enough.

→ 12-retrieval-metrics.md