09. Rate limiting and abuse control — the control tower manages traffic before the runway jams¶

~13 min read. Safety is not only about content; it is also about who is stressing the system and how fast.

Built on the ELI5 in 00-eli5.md. The control tower — the part watching the whole airport — decides when traffic is normal, abusive, or too expensive to keep clearing.

Abuse often looks like scale, not evil language¶

A request can be perfectly polite and still be abusive. Scrapers harvest outputs. Attackers flood retries. Curious users trigger huge tool chains. Broken clients spin in loops. Content filters alone will not catch this.

That is why the control tower matters. It sees patterns across time, user IDs, IPs, API keys, organizations, and workflows. One completion cannot see that full traffic picture.

single request view                control tower view
┌──────────────┐                   ┌─────────────────────┐
│ looks normal │                   │ 600 requests / min  │
│ small prompt │                   │ same key, same path │
└──────────────┘                   │ rising token burn   │
                                   └─────────────────────┘

See. Abuse is often a sequence problem. The individual message is not the whole story.

Rate limits should exist on several dimensions¶

Many teams start with requests per minute. Good start. Not enough.

Also limit tokens per minute, cost per day, tool calls per session, concurrent sessions per user, retrieval depth, file-upload size, and failed-validation retries. Each dimension closes a different loophole.

A user who sends ten enormous prompts can be more expensive than a user who sends one hundred tiny prompts. A user who repeatedly triggers browsing or code-execution tools may create outsized risk. The control tower needs multidimensional budgets.

Simple, no? Count what actually consumes risk or money.

Worked example: scraping through a helpful endpoint¶

Suppose you run a premium summarization API. One customer key starts making 50 requests per minute. Each request asks for "summarize this webpage" with a different URL. The prompts are short. The content is public. Toxicity filters see nothing suspicious.

But the cost pattern looks bad. Total fetched pages spike. Token spend triples. The caller never reads the UI. They just automate extraction.

A strong control tower response might be this.

signals
├── requests per minute      → high
├── fetched pages per hour   → high
├── average session length   → near zero
├── token spend per day      → above plan
└── repeat pattern entropy   → very low

action
├── throttle to lower rate
├── require captcha or re-auth
├── disable expensive browse tool
└── alert abuse operations

Now what is nice here? The system did not need to prove malicious intent. It only needed to enforce safe operating budgets and investigate anomalous behavior.

Budgeting is a guardrail for both cost and harm¶

Rate limits protect infrastructure. They also reduce abuse success. That dual role matters.

If a jailbreak campaign can only attempt five prompts per minute instead of five hundred, discovery slows. If a fake-account farm hits daily cost caps, the blast radius shrinks. If an agent loop is bounded to three tool retries, prompt injection has less room to explore.

The passport desk and control tower cooperate here. The desk caps single-request size. The tower caps aggregate behavior. One local. One global.

Abuse detection needs patterns, not only thresholds¶

Hard thresholds are useful. Add them. But also look for suspicious patterns.

Repeated near-identical prompts across many accounts. Sudden geography shifts. High refusal rates from one tenant. Spikes in prompt-injection classifier scores. Long inputs with little user-visible value. Bursts of failed schema validations. These are abuse indicators.

A compact detector can mix rules and scores.

abuse score =
  0.25 * request_rate_zscore
+ 0.20 * token_spend_zscore
+ 0.20 * refusal_rate
+ 0.15 * validation_fail_rate
+ 0.20 * prompt_similarity_cluster_score

Picture before formula, yes? The point is not the exact numbers. The point is that many weak signals together can justify investigation faster than any single threshold.

Graceful degradation beats full outage¶

Now what is the practical response? Not every spike needs a hard block. Sometimes you degrade gracefully.

Lower model size. Reduce max tokens. Disable browsing. Require authentication. Queue low-priority jobs. Ask the caller to retry later. Move from agentic mode to simple answer mode. This keeps the airport open while protecting the runway.

The control tower should have playbooks, not improvised panic. Abuse control is part policy, part SRE, part product design.

Look. Rate limiting is safety engineering because it protects availability, cost discipline, and attack surface together.

Where this lives in the wild¶

OpenAI API platform — infrastructure engineer: enforces per-key and per-organization quotas to manage fairness, abuse, and spend.
Anthropic API tenants — platform operations lead: need token and request budgets so one customer cannot exhaust shared capacity.
Perplexity-style browse products — abuse analyst: watch for scraping patterns that exploit expensive retrieval and browsing endpoints.
Enterprise copilots with actions — platform architect: cap tool invocations and retries so agent loops do not run away.
Consumer chat apps — trust and safety operator: use account-level velocity checks to slow jailbreaking campaigns and spam account farms.

Pause and recall¶

Why can polite-looking traffic still be abusive?
Which dimensions besides requests per minute should a control tower limit?
How does rate limiting reduce both cost risk and safety risk?
Why is graceful degradation often better than a total hard block?

Interview Q&A¶

Q: Why limit tokens and tool calls in addition to request count? A: Because cost and risk scale with prompt size and action depth, not just with how many HTTP requests were made. Common wrong answer to avoid: "Because token limits only matter for billing dashboards."

Q: Why is abuse detection a cross-request problem rather than a single-request classifier problem? A: Because many harmful behaviors emerge only from velocity, repetition, or aggregate cost patterns visible over time. Common wrong answer to avoid: "Because single requests never matter for abuse."

Q: Why prefer graceful degradation to immediate global blocking during suspicious spikes? A: Because it preserves service for legitimate users while still shrinking the blast radius and buying time for investigation. Common wrong answer to avoid: "Because attackers stop once the model gets slightly smaller."

Q: Why should refusal-rate and validation-failure spikes feed abuse analytics? A: Because concentrated failures often signal probing behavior, jailbreak exploration, or automated misuse rather than ordinary user confusion. Common wrong answer to avoid: "Because refusals are unrelated to security once content filters exist."

Apply now (5 min)¶

Exercise. Write one simple budget policy for an AI endpoint. Include request rate, token rate, daily spend, and tool-call cap. Then invent one scraper pattern and decide which signal the control tower should alert on first.

Sketch from memory. Draw two views. One box for single-request checks at the passport desk. One tower for per-user and per-tenant trends across time. Add one graceful-degradation action.

Bridge. By now we have many checkpoints. The next question is practical: do we assemble these ourselves, or use guardrail frameworks that package part of the stack? → 10-guardrail-frameworks.md