Skip to content

08. Prompt compression — shrink context without deleting the reason the answer is correct

~22 min read. A lead AI engineer does not optimize a single number; they make the tradeoff visible, measured, and reversible.

Builds on 00-eli5.md. The meter ticks and boot space both shrink when prompts shrink, but the fuel ledger must catch quality loss.


What previous chapters solved before this pressure appears

KV cache showed that long context spends memory while cost anatomy showed it spends money. What still breaks is the naive response: remove tokens until the bill looks good, then discover that evidence, safety instructions, or task constraints disappeared. Prompt compression is not shortening for its own sake; it is preserving behavior under a smaller context budget.

The accumulated lesson is already visible in the taxi fleet. meter ticks expose money, the ETA call exposes perceived wait, the dispatch board exposes route choice, and the fuel ledger keeps those choices honest. This file adds the next constraint without forgetting the earlier ones: every optimization relieves one pressure and creates another that some subsystem must absorb.

What this file solves

This file starts from this failure: 45% shorter prompt saves money but citation accuracy drops from 92% to 71%. It shows the concrete design move a lead engineer makes, the artifact to inspect, and the signal that proves the optimization helped instead of merely moving pain elsewhere.

The opening failure shows up in a concrete artifact

The failure is not abstract: a 45% shorter prompt saves money but citation accuracy drops from 92% to 71%. Here is the early artifact a reviewer can inspect.

A compression review tags removed text by role before celebrating lower token counts.

Prompt diff summary:

Removed text Tokens cut Quality effect Decision
duplicate retrieval chunks 700 no loss keep cut
citation-bearing paragraph 420 citation accuracy -12 pts restore
verbose style examples 350 answer shape stable keep cut
safety exception list 180 unsafe pass rate +3 pts restore

A smart team might try to fix the most visible line in that artifact. That is tempting, and it is incomplete. The root cause is unstructured deletion: the prompt lost load-bearing evidence while keeping decorative text. So how do we compress while protecting the tokens that make the answer correct and safe? This is the root-cause pivot: not a local metric problem, a boundary-and-pressure problem.

A tiny version exposes the whole mechanism

A toy request makes the pressure visible: one request looks fine in isolation, but after route mix, retries, context, or output length changes, the user-visible workflow changes. The smallest example is enough to show why the lever must be measured at the workflow boundary.

Rule: Compress redundancy, not obligations: keep task, policy, evidence, and output contract before trimming style and duplicates.

Why this rule exists. Prompt compression is packing the taxi: remove empty boxes before removing the medicine or map. The boot space shrinks only if the right luggage leaves. The fuel ledger matters because it shows whether the new pressure landed in cost, latency, memory, quality, or operator attention.


1) Identify load-bearing tokens before cutting

Start with the workflow, not the vendor feature. In Maya's review, the team takes one request and follows it from API ingress to model call, tool call, runtime behavior, and resource consequence. That cross-layer trace is the shortest path from symptom to lever. If the symptom is cost, the trace follows meter ticks. If the symptom is silence, it follows the ETA call. If the symptom is serving pressure, it follows the carpool lane and boot space.

user request
API gateway ── route/version ──► model/runtime
    │                              │
    │                              ├─ tokens / queue / KV / output
    │                              ▼
    └──────── outcome ◄──────── fuel ledger row

The counterintuitive part is that the most obvious metric can improve while the product gets worse. A smaller bill can hide more failed outcomes. Higher tokens/sec can hide longer queueing. A shorter prompt can hide missing evidence. The mechanism in this chapter is useful only when the trace keeps the relieved pressure and the newly created pressure in the same picture.

2) The prompt before and after the safe cut

Picture the chapter as a pressure transfer, not a free lunch.

Before optimization                    After optimization
┌──────────────────────┐              ┌──────────────────────┐
│ visible pain          │              │ relieved pain         │
│ cost / wait / memory  │──change──►   │ lower local metric    │
└──────────┬───────────┘              └──────────┬───────────┘
           │                                      │
           ▼                                      ▼
 hidden cause not named                 new pressure appears
 retries, route mix, context,           quality, queueing, cache,
 output, provider limits                memory, fallback, ops

The diagram is the reason this module keeps returning to the fuel ledger. The ledger is where the second box becomes visible instead of surfacing as an invoice surprise, p99 incident, or quality complaint weeks later.

3) Maya compresses a RAG answer prompt without losing citations

Maya threads one workload through the design review: a production assistant with real traffic, route versions, prompt versions, and outcome labels.

Attempt A — optimize the visible line

The first attempt changes the local knob associated with the opening symptom. The local dashboard improves. The team celebrates too early because the request boundary is still broken: retries, quality loss, queueing, cache misses, or memory pressure move elsewhere.

Attempt B — optimize with the pressure chain

The second attempt keeps the artifact, the rule, and the guardrail together. Maya writes the expected improvement, the pressure that may worsen, the owner of that pressure, and the rollback trigger. The dispatch board may change a route, the memorized route may change a prompt prefix, or the carpool lane may change scheduling, but the same request ID proves whether the user outcome survived.

4) Why selective compression beats a smaller retrieval top-k alone

The plausible alternative is attractive because it is simpler to explain in a status update: change one knob, quote one percentage, and move on. That works for demos. It fails for lead-level ownership because it cannot answer which workload benefits and which workload pays.

Use this chapter's mechanism when the workload has the shape named in the opening artifact. Use the alternative when the product is small enough, stable enough, or low-risk enough that the extra machinery would cost more than it saves. The decision is not about elegance; it is about whether the signal-to-operator cost is worth it.

5) Evidence density matters more than raw token count

Concrete numbers make the tradeoff review honest. The sample prices and memory figures below are illustrative; replace them with the provider, hardware, and workload numbers in your own stack.

Scenario Fresh input Cached input Output Extra condition Lesson
Original prompt 4200 0 520 92% citation accuracy $0.0126
Naive 45% cut 2300 0 520 71% citation accuracy $0.0088
Dedup evidence + compact policy 2600 0 480 90% citation accuracy $0.0089
Summary memory instead of full chat 2100 0 500 88% task success $0.0082
Compression plus cache 900 1700 480 90% accuracy $0.0063

The table teaches the design habit: every row says what improved and what might have worsened. If a row cannot name both, the proposal is not ready for production review.

Automated compressors — name the tools, then hold them to the same guardrail

Manual selective compression is the scalpel. When traffic is large, teams reach for automated compressors that score and drop low-information tokens at inference time. Name them in review so the choice is explicit, not folklore:

  • LLMLingua / LLMLingua-2 (Microsoft) — a small language model rates token informativeness and prunes the low-yield tokens, reporting 2–6x prompt reduction with little quality loss on many tasks. LLMLingua-2 is the faster, classifier-based successor trained directly for token classification.
  • LLMZip and related entropy coders — treat the LLM as a predictor and code text against its own probability model. More research lever than production default, but worth knowing as the theoretical floor for how compressible text is.

These are the vendor feature, not the workflow. The compressor is itself a model call: it adds its own meter ticks and ETA call, and it can hallucinate a summary (failure shape 6). So route compressed and uncompressed traffic through the same guardrail — citation precision, schema pass rate, safety pass rate — and keep the cut only where the fuel ledger shows quality held. A compressor that costs more latency than it saves in tokens has moved pressure, not removed it.

6) The cheap prompt that hallucinated because evidence was gone

Walk the failure from top to bottom. The user action enters the API. The application builds a prompt or route. The runtime spends tokens, queue time, cache memory, or output steps. The dashboard records a local improvement. Then the user-visible metric moves the wrong way.

That failure is not bad luck. It is what happens when the optimization changes one layer and the observation stops one layer too early. In a review, Maya asks for the missing link: where did the pressure go after the local metric improved? If nobody can answer, the change ships behind a small canary or does not ship.

7) Signals that reveal whether prompt compression is healthy

  • Healthy behavior: tokens fall while grounded accuracy and safety pass rate stay stable.
  • First degrading metric: citation precision or schema pass rate drops after prompt edits.
  • Misleading beginner metric: prompt length alone, because shorter can be worse.
  • Expert graph: quality metrics by removed-token category: evidence, policy, examples, history, formatting.

Mini-FAQ. "Why not watch the simplest metric?" Because the simplest metric is often the one the optimization directly manipulates. You need a paired guardrail that shows whether the system merely moved pain into another layer.

8) Boundaries where the chapter's lever works and where it turns pathological

  • Strong fit: redundant retrieval, repeated instructions, long chat history, and verbose examples.
  • Pathology: legal, medical, or safety tasks where small omissions change obligations.
  • Scale or workload limit: when compression model cost or latency exceeds the savings.

This boundary is not a disclaimer. It is a routing rule for engineering attention. The best optimization in one endpoint can be the wrong default for another endpoint with different latency tolerance, risk, context length, or outcome value.

9) Wrong mental model to replace

The wrong model is “shorter prompt equals better prompt.” The better model is budgeted evidence: remove duplicate or low-yield text while preserving the clauses that control correctness.

The replacement model should change how you speak in design review. Do not say, "this reduces cost" or "this improves latency" without naming the request slice, expected magnitude, guardrail, and rollback trigger. Say which meter ticks, ETA call, carpool lane, or boot space pressure changed.

10) Other failure shapes you will recognize

  1. Retrieval failure. retrieval top-k cut without recall measurement.
  2. Summaries failure. summaries losing numbers or negations.
  3. Policy failure. policy blocks shortened until guardrails weaken.
  4. Examples failure. examples removed although they anchor output shape.
  5. History failure. history compressed without preserving user commitments.
  6. Compression failure. compression model hallucinating a summary.
  7. Token failure. token savings reported without quality regression slices.

11) Cross-topic reinforcement — the same pressure shape returns

  • Cost anatomy values the token reduction only at workflow level.
  • Latency anatomy checks whether shorter prompts improve TTFT.
  • KV cache benefits because active context shrinks.
  • Output length control is the sibling lever on the generated side.

12) Design-review questions that catch shallow plans

  1. Which prompt tokens are load-bearing?
  2. Can you measure quality by removed-token category?
  3. Does compression save more than it costs?
  4. Can a reader reconstruct the answer evidence after compression?

Dynamic context assembly — the lever before compression

Compression shrinks the context you already decided to include. The lever before it decides what to include at all, per request. A fixed prompt that always ships the full system prompt plus the top-k retrieved chunks pays the maximum token bill on every query, including the easy ones that needed almost none of it.

Dynamic context assembly routes by query type and assembles only the context that query needs:

  • A simple FAQ-style query gets a short system prompt and one chunk.
  • A complex multi-hop query gets the full system prompt and five chunks.
  • Tool definitions, few-shot examples, and policy blocks load only when the route actually needs them.

The classifier that picks the route is the dispatch board from earlier chapters applied to context instead of to models. It carries the same risk: a wrong route starves a hard query of evidence and the answer degrades — the same failure as cutting a citation-bearing paragraph in the opening artifact, reached by a different path. So the guardrail is identical: measure quality by route, and keep a fallback that escalates to fuller context when the router's confidence is low.

Order matters: assemble first, compress second. Assembly removes whole sections the query never needed; compression trims redundancy from what remains. Run compression on a fixed prompt and you are tuning a number the router could have halved for free.

Where this shows up in production

  • Enterprise support bot — turns route, token, cache, retry, and outcome rows into cost per resolved ticket rather than model spend per message.
  • Coding assistant — separates inline completions from agentic edits because typing flow, repo context, and repair loops have different budgets.
  • Search answer product — pays for rewriting, retrieval, reranking, synthesis, citations, and judge calls as one user-visible answer.
  • Voice assistant — treats dead air, cancellation, and local fallback as product features because users notice 100 ms gaps.
  • Back-office summarizer — uses larger queues and batches because humans care about daily throughput more than first-token immediacy.
  • Commerce assistant — protects purchase-changing actions with stronger routes while letting read-only advice run cheaper.
  • Internal data copilot — attributes spend by tenant, dataset, prompt version, and tool path so one team cannot hide another team's budget.
  • Education tutor — spends tokens on safety and pedagogy rules, then watches whether shorter answers still teach well.
  • Legal review workflow — keeps evidence and citation context even when compression pressure is high because unsupported claims are worse than cost.
  • Healthcare intake helper — uses conservative routing and buffered streaming because safety checks are part of the latency path.
  • Marketing content tool — controls output length and variant count because creative generation can silently explode spend.
  • Incident-response copilot — prefers predictable latency and logs over clever savings during high-severity operations.

Recall — rebuild prompt compression from memory

  1. What concrete failure opened this chapter, and which artifact made it inspectable?
  2. What root cause made the naive fix insufficient?
  3. State the rule in one sentence without using vendor language.
  4. Which pressure does the mechanism relieve, and which new pressure can it create?
  5. Which operational signal degrades first when the mechanism is misapplied?
  6. Where is the boundary where this lever becomes pathological?
  7. How does this chapter reuse the fuel ledger or dispatch board from earlier chapters?
  8. What would you put in the rollback trigger for this optimization?

Interview Q&A

Q: What problem does prompt compression actually solve?

A: It relieves the specific pressure from the opening failure: a 45% shorter prompt saves money but citation accuracy drops from 92% to 71%. The important answer names the workflow metric, the moved pressure, and the guardrail that keeps the optimization honest.

Common wrong answer to avoid: It is just a generic way to make LLMs cheaper or faster.

Q: Why is the naive fix dangerous?

A: Because the naive fix attacks the visible symptom while the root cause is the root cause is unstructured deletion. It can improve a local metric and damage outcome quality, tail latency, memory, or operational clarity.

Common wrong answer to avoid: If the local metric improves, the product improved.

Q: What artifact would you inspect first?

A: The first artifact is the joined request trace or table for this chapter: route/version, token buckets, latency stages, outcome, and the topic-specific signal. Without it, the team debates guesses.

Common wrong answer to avoid: I would start by changing model parameters.

Q: How do you know the optimization worked?

A: The relieved pressure improves while the guardrail remains stable: tokens fall while grounded accuracy and safety pass rate stay stable. You also check the first degrading metric: citation precision or schema pass rate drops after prompt edits.

Common wrong answer to avoid: The dashboard line for cost or latency went down.

Q: When should you avoid this lever?

A: Avoid it in the pathological case: legal, medical, or safety tasks where small omissions change obligations. The workload shape must match the mechanism, or the optimization moves cost into quality, memory, or user wait.

Common wrong answer to avoid: Use it everywhere because it is a best practice.

Q: What is the common wrong mental model?

A: The wrong model is “shorter prompt equals better prompt.” The better model is budgeted evidence: remove duplicate or low-yield text while preserving the clauses that control correctness.

Common wrong answer to avoid: The obvious intuition is good enough.

Q: How does this connect to earlier chapters?

A: It reuses the workflow boundary from cost anatomy, the stage trace from latency anatomy, and the fuel ledger discipline. The lever should be explained as pressure relief plus a new pressure, not as an isolated trick.

Common wrong answer to avoid: Earlier chapters are background only; this mechanism stands alone.

Apply now (10 min)

Step 1 — model the exercise. Use the modeled table in this chapter, then pick one feature you own or know. Fill in the same rows with rough numbers, change one assumption, and reproduce from memory which metric should improve, which guardrail could fail, and what rollback trigger you would set.

Step 2 — your turn. Pick a real LLM feature and write the same artifact with your own rough numbers. Name the pressure relieved, the pressure created, the owner, and the metric that would prove the change unsafe.

Step 3 — reproduce from memory. Close the file and redraw the two diagrams: request trace and pressure transfer. Then restate the rule and the first degrading metric without looking.

What you should remember

This chapter explained why the opening failure is not solved by changing one local knob. The useful move is to make the request boundary inspectable, apply the topic rule, and watch the paired guardrail so the optimization cannot hide its cost in another subsystem.

You learned to describe the lever as pressure movement: what it relieves, what it creates, and which team or resource absorbs the new cost. That is the difference between a trick and an operating practice.

Carry the diagnostic forward: if the dashboard cannot show the artifact, the route or version, the user outcome, and the first degrading signal in one place, the optimization is not yet reviewable.

Remember:

  • Compress redundancy, not obligations: keep task, policy, evidence, and output contract before trimming style and duplicates.
  • The first artifact to inspect is the trace/table that exposes a 45% shorter prompt saves money but citation accuracy drops from 92% to 71%.
  • The first degrading signal is: citation precision or schema pass rate drops after prompt edits.
  • The misleading beginner signal is: prompt length alone, because shorter can be worse.
  • Every optimization must name what pressure it relieves, what pressure it creates, and who owns the new cost.

Bridge. Compression controls what the model reads. The next pressure is what the model writes: output can dominate cost and total latency unless the answer has a clear stopping contract.

./09-output-length-control.md