Skip to content

08. Prompt and response caching

Cost determines whether a call should be made. Caching determines whether it actually has to be. Every cache hit is a cost not paid and a latency saved. The gateway is the right layer to operate the cache because it sees every call and every provider response in a unified shape.


A backend engineer at a Chennai retail platform looks at the gateway dashboard and sees that the summary endpoint — used to generate end-of-week reports for ten thousand stores — issues the same call hundreds of times per night. Each store's summary is computed from a query that returns identical inputs for stores that had no activity in the period. Without a cache, the provider serves the same response repeatedly at full cost. With an exact-match cache keyed on the normalised request, the second through ninety-eighth calls return in under twenty milliseconds at a fraction of a paisa. The engineer estimates the savings: thousands of dollars per month, more on bursty Mondays. The cache is added in a sprint; the savings are real; the cache later catches a downstream regression when the hit rate suddenly drops, which the team investigates as a separate signal entirely.

This chapter is the caching layer. Two kinds of cache; explicit eligibility; honest invalidation; and the security thinking that prevents the cache from becoming a leak.


Two kinds of cache

Cache Match Use case
Exact-match The unified request hashes identically Repeated identical calls; deterministic pipelines; idempotent generation
Semantic The request is semantically similar to a prior one Knowledge-base lookups; FAQ-shaped queries; chat with frequent paraphrases

Most platforms operate exact-match. Semantic caches are powerful when used carefully and dangerous when used carelessly. This chapter covers both with appropriate warnings on the second.


Exact-match cache

The simplest, safest, highest-confidence cache. Two identical requests produce the same cached response.

Key construction

The cache key is a hash over the unified request — specifically the fields that determine the model's output. A reasonable construction:

def cache_key(req):
    canonical = {
        "model_alias": req.model_alias,
        "model_version_pinned": route_resolve_concrete(req),  # the actual model version
        "messages": req.prompt.messages,
        "parameters": {
            "temperature": req.parameters.temperature,
            "max_output_tokens": req.parameters.max_output_tokens,
            "stop_sequences": sorted(req.parameters.stop_sequences),
            "tools": canonicalise_tools(req.parameters.tools),
            # temperature=0 or deterministic seeds are required for safe exact-match
        },
    }
    return sha256(canonical_json(canonical))

Three rules in the key:

  • Resolve the concrete model version, not just the alias. If the alias mapping changes (chapter 03), the cache should not serve responses from the old model.
  • Canonicalise. Sort keys; sort arrays where order is semantically insensitive; normalise whitespace; lowercase certain identifiers if appropriate.
  • Include parameters that affect output. Temperature, top_p, stop sequences, tool definitions. Exclude parameters that do not affect output (trace IDs, request IDs, idempotency keys).

Eligibility

Not every call is cache-eligible. Default to ineligible; the caller (or per-alias policy) opts in. Common ineligibility cases:

  • temperature > 0 — randomness defeats caching unless seeds are pinned and reproducible
  • Tool-calling calls — the agent's next step depends on tool outputs that change
  • Streaming calls — the response is consumed as it streams
  • Calls whose response is user-visible and the user expects a fresh response

The policy is per-alias and per-feature:

aliases:
  fast-summariser:
    cache:
      exact:
        eligible_when:
          temperature: 0
          stream: false
        ttl_seconds: 3600
        max_size_bytes: 50000

Storage

Exact-match caches are typically Redis or a similar key-value store. Per-call latency: 1–3 ms for a hit. Storage: typically gigabytes for active workloads; bounded by TTL and an LRU eviction policy.

TTL and invalidation

TTL is the simplest invalidation. Most prompts age out within hours to days. The TTL is per-alias policy; sensible defaults are 1 hour for interactive workloads (recent inputs likely to be reused), 24 hours for batch pipelines, 7 days for stable knowledge-base lookups.

Explicit invalidation is needed when:

  • The model version behind the alias changes (chapter 09's deprecation/promotion)
  • The system prompt embedded in the call changes
  • The underlying data the prompt references becomes stale (the cache cannot know this directly; the caller must invalidate)

A useful pattern: include a cache_version field in the unified request that the caller can bump to force a refresh. The gateway includes it in the cache key, so a bump invalidates all entries with the old version.

Hit-rate economics

Per call, the cache is cheap. Across a platform, hit rates of 20–60% are typical for production workloads with re-use patterns. Each hit saves the full provider cost and most of the latency. Tracking the hit rate per alias is a first-class metric:

Metric What it tells you
Hit rate per alias How much repeat structure exists for this workload
Hit rate drop A change in input distribution — investigate
Hit rate spike A bug producing identical calls — investigate
Cost saved by cache per period The dollar value of the cache layer

Semantic cache

A semantic cache returns a prior response when a similar request arrives, even if not byte-identical.

How it works

On a miss, the cache stores both the cached response and the embedding of the request. On a subsequent call, the new request's embedding is compared (cosine similarity) to stored embeddings; if the closest is above a similarity threshold, the cached response is returned.

def semantic_lookup(req, threshold=0.92):
    emb = embed(canonical_prompt(req))
    nearest, score = vector_store.nearest(emb, k=1)
    if score >= threshold and nearest.alias == req.model_alias:
        return nearest.response
    return None

Where it is appropriate

Semantic caches work well when:

  • Inputs cluster around a small number of intents (FAQ, common queries)
  • Slight variation in wording does not change the appropriate response
  • The response is not highly user-specific or context-dependent

They are dangerous when:

  • The response is user-specific (asking "show my balance" should not return another user's balance)
  • The response is context-dependent on details that may not be in the prompt
  • The cost of returning a slightly wrong answer is high

Required safeguards

A semantic cache that ignores these will produce wrong answers in production.

  • Strict tenancy partitioning. Cache entries are scoped per tenant. A semantic match across tenants is a breach.
  • Strict user partitioning for user-specific queries. Two users asking the same question should not share a response when the answer depends on their identity.
  • Strict alias partitioning. A fast-summariser query does not match a smart-reasoner cache entry.
  • Strict context partitioning. If the prompt depends on time-of-day, location, or other context, partition the cache by it.
  • Threshold tuned by eval. The similarity threshold is not a guess; it is validated against an eval that measures incorrect-cache-hit rate.

A reasonable starting partition key: (tenant_id, user_id (if relevant), feature_id, alias, model_version, context_signature). The cache lookup is over similar prompts within this partition.

Eligibility

Even narrower than exact-match. Default policy is to disable semantic caching; enable per-alias for workloads where it is verified safe.

Hit-rate trade

Semantic caches can lift hit rates substantially (sometimes 2–3x over exact-match) at the cost of complexity and the risk of incorrect hits. The decision is workload-specific; many platforms use exact-match only and accept lower hit rates as a fair trade.


Provider-side caching

Some providers (Anthropic, OpenAI) offer their own prompt caching — repeated portions of a prompt (e.g., a long system prompt) are cached server-side at a reduced per-token cost. The gateway can take advantage of this by:

  • Structuring the call so the cacheable portion is at the start
  • Signalling to the provider which portion to cache
  • Tracking cache_read_tokens and cache_write_tokens in the audit and cost calculation (chapter 07)

Provider-side caching is complementary to gateway-side caching. Provider-side reduces cost on repeated long contexts; gateway-side eliminates the call entirely on identical requests.


Cache poisoning concerns

A cached entry that contains incorrect content is more dangerous than a fresh wrong answer, because it serves the wrong answer repeatedly at near-zero latency.

Two classes of poisoning:

Inadvertent. A bug causes a wrong response to be cached. The wrong response is served to many callers before the cache expires. Mitigation: do not cache responses that the gateway recognises as errors or degraded. Cache only ok: true responses. Treat cached responses with the same postcondition checks (chapter 11, drift detection) as fresh responses.

Adversarial. An attacker manages to inject content that produces a cached response benefiting them on subsequent calls. Less common in well-isolated gateways but a real concern in semantic caches across users. Mitigation: strict user/tenant partitioning; never cache user-controlled content unless it is the output, not the cache key shape.


How the cache interacts with the other surfaces

  • Routing (chapter 03) — cache lookup occurs before routing decides anything heavy; a hit short-circuits the route resolution.
  • Fallback (chapter 04) — the cache is a chain step when same-quality candidates are exhausted; a cached degraded answer is sometimes preferable to a refusal.
  • Quota (chapter 05) — cache hits do not consume the per-provider bucket (no provider call); they may consume a small "cache RPM" tracked separately.
  • Cost (chapter 07) — cache hits are billed at the cache's own cost (storage + lookup), not the provider's price.
  • Audit (chapter 11)cache_status on every audit record: miss, exact_hit, semantic_hit. Dashboards aggregate.

How to recognise broken caching in the wild

  • The hit rate is unknown or unmonitored
  • Cache hits are not surfaced in audit
  • The cache crosses tenants without partitioning
  • Semantic caching is enabled with no threshold validation against an eval
  • Cache invalidation on model promotion is manual or skipped
  • Errors or degraded responses are cached

Interview Q&A

Q1. What goes into a cache key for an exact-match cache, and what stays out? In: the resolved concrete model version (not just the alias), the messages, the parameters that affect output (temperature, max_output_tokens, stop sequences, tools, top_p), and any cache version bump the caller specifies. Out: trace IDs, idempotency keys, request IDs, parameters that do not affect output, the tenant or caller identity unless cache is partitioned by them (in which case the partition is separate from the key). Canonicalise everything that goes in. Wrong-answer notes: including identity in the key for an exact-match cache is over-segmentation; including timing-sensitive fields prevents hits.

Q2. Should you enable semantic caching for a user-facing chat application? Only with strict per-user partitioning, threshold validated by eval, and a cache miss path that is fast. The risk is returning another user's response or returning a response that no longer reflects the user's current context. The reward is higher hit rate. The decision is per-feature; most chat applications start with exact-match only, then introduce semantic carefully if hit rates and evals support it. Wrong-answer notes: "yes, it saves cost" without the safeguards is the breach.

Q3. The cache hit rate for a high-traffic alias suddenly drops from 35% to 5%. What does that suggest? Some input distribution change. Possible causes: a product update changed the prompt template (the prompt portion of the key changed); the alias was rebound to a different concrete model (the model version in the key changed and invalidated the cache); the calling pattern shifted to more unique inputs; an upstream bug is making each call slightly different (e.g., a timestamp slipped into the prompt). Investigation: pull audit before/after the drop, diff cache keys to find which field is varying. Wrong-answer notes: "the cache is broken" before diagnosing the input is jumping to a fix.

Q4. Why never cache an error response? Because the cause may be transient and the cached error keeps the call from succeeding when the cause has cleared. Errors are also more likely to be sensitive in their content (stack traces, internal codes) than success responses. Caching errors converts a recoverable hiccup into an extended outage for any caller whose inputs hash to the same key. The discipline is: cache only ok: true responses; errors flow through every time. Wrong-answer notes: "to avoid hammering the provider" is what retry policy and backoff are for; the cache is the wrong mechanism.


What to do differently after reading this

  • Implement exact-match caching first; measure hit rates per alias.
  • Cache only successful responses. Confirm the implementation refuses to cache errors and degraded responses.
  • Surface cache_status on every audit record and dashboard.
  • Defer semantic caching until exact-match is established and an eval validates the threshold for the proposed workload.
  • Document and test cache invalidation on alias rebinding and model promotion.

Bridge. Caching is the friend you want. Drift is the enemy you must detect. Models retire; behaviour shifts; providers ship changes on their own cadence. The next chapter is the gateway's response to model and provider drift — pinning, deprecation calendars, dual-running, and the alarms that catch silent shifts. → 09-provider-drift-and-deprecation.md