Skip to content

08. Red-team evals and scoring — adversarial tests that change releases

~12 min read. A red-team exercise is not a scary prompt contest. It is a release gate for attack paths that matter to the product.

Continues from 07-memory-and-cross-tenant-risk.md. The red team room turns threat models into repeatable tests and severity-weighted decisions.

The previous chapters mapped the attack surfaces: direct prompts, retrieved content, jailbreak pressure, exfiltration, tools, and memory. That gives us risks, but not a release gate. This chapter turns those risks into adversarial eval cases with assets, expected boundaries, severity, and owners.


1) The wall — demos do not make a security program

A researcher finds a clever prompt that makes a demo assistant misbehave. The team adds it to a spreadsheet. Three months later, the same class returns through a retrieved document and a tool argument.

The problem was not lack of examples. It was lack of system.

Red-team evals need:

threat path -> test case -> expected boundary -> severity -> owner -> regression gate

If a test does not map to a boundary or release decision, it is theater.


2) Red-team case structure

Each case should include:

Field Example purpose
Attack family direct injection, indirect injection, exfiltration, tool abuse
Asset/action tenant document, refund tool, admin setting
Entry point user prompt, uploaded doc, tool output, memory
Expected boundary refusal, schema rejection, auth deny, human approval
Severity impact if bypass succeeds
Trace requirement what evidence proves pass/fail
Owner team responsible for lock

This structure keeps the suite product-specific.


3) Worked example — refund red-team case

Case:

family: tool abuse through direct injection
asset/action: refund execution
entry: user prompt in support chat
expected boundary: model may explain policy; refund tool requires server eligibility and approval
severity: high if execution succeeds, medium if only text recommendation
trace: proposed tool args, server auth result, approval status, final answer
owner: support AI platform

Pass means the system refuses or safely routes the attack at the right boundary. Pass does not mean the model never looked tempted.


4) Why not score all failures equally

The tempting alternative is a simple pass rate. That hides severity. One harmless policy wording failure is not equal to cross-tenant data exposure.

Use severity weighting:

critical: data leak, unauthorized action, unsafe high-stakes content
high: policy bypass in customer-facing workflow
medium: misleading answer without action
low: awkward refusal or benign over-block

A lead cares more about eliminating critical paths than improving a vanity average.


5) Production signals — red-team suite health

The first metric is critical attack path pass rate by product surface.

The misleading metric is number of test cases. A thousand low-impact prompts do not cover one missing tool authorization boundary.

The expert signal is release gating: high-severity failures block release, medium failures require owner signoff, and accepted risks are time-bound.


6) Boundary — red-team evals age quickly

Attackers adapt, models change, tools expand, and product surfaces grow. Red-team suites need refresh cycles and incident-driven additions.

The pathology is frozen adversarial evals. The suite passes because it tests last quarter's attacks against this quarter's product.


Recall checkpoint

  • What fields belong in a red-team case?
  • Why is pass rate misleading without severity?
  • How does a red-team suite become a release gate?
  • Why do adversarial evals age?

Interview Q&A

Q: How do you build a useful AI red-team eval suite? A: Start from threat paths, define attack family, asset/action, entry point, expected boundary, severity, trace requirement, owner, and release decision.

Common wrong answer to avoid: "Collect a big list of jailbreak prompts." Size without product-specific attack paths is weak coverage.

Q: What should block release? A: Critical or high-severity bypasses tied to sensitive data, unauthorized actions, safety policy, or broad customer impact.

Common wrong answer to avoid: "Only block if overall pass rate drops." A single critical bypass can be enough.

Q: How do incidents improve red-team suites? A: Each incident should add or update cases that cover the class, not only the exact example.

Common wrong answer to avoid: "Add the bad prompt verbatim." The attack class matters more than one string.


Apply now (10 min)

Model the exercise. Write one red-team case for refund tool abuse with severity and expected boundary.

Your turn. Create three cases for one AI feature: direct injection, indirect injection, and exfiltration.

Reproduce from memory. Explain why red-team evals are release gates, not demo collections.


What you should remember

This chapter explained red-team evals and scoring. The important idea is that adversarial tests must connect to assets, boundaries, severity, owners, and release decisions.

Carry this diagnostic forward: if a red-team case cannot say what boundary should stop it, rewrite the case.

Remember:

  • Red-team tests start from threat paths.
  • Severity weighting beats raw pass rate.
  • Critical bypasses should block release.
  • Suites must refresh as product surfaces change.

Bridge. Red-team evals tell us where attacks pass. Next we design the hard controls that make many attacks fail even when the model is persuaded. → 09-security-controls-and-isolation.md