Skip to content

09. Security controls and isolation — hard boundaries around a persuasive model

~12 min read. The model can be helpful, confused, or manipulated. Security controls decide what still cannot happen.

Continues from 08-red-team-evals-and-scoring.md. The red team room found attack paths. Now the guard rails must turn those findings into hard boundaries.

The previous chapter made attacks measurable through red-team cases and severity. That tells us where the system breaks, but tests alone do not stop production harm. This chapter turns the red-team findings into least privilege, isolation, schemas, approvals, sandboxing, and server-side authorization.


1) The wall — model alignment is not sandboxing

A well-aligned model can still be wrong under pressure. A secure system assumes that sometimes the model will produce a bad plan, unsafe argument, or overconfident answer.

Controls should answer:

if the model is persuaded,
what can it still not read,
not call,
not reveal,
not write,
not spend,
not change?

That question shifts security from model personality to system architecture.


2) The core control families

Control What it protects
Least privilege model/tool only gets needed access
Tenant isolation one tenant cannot influence or read another
Tool allowlists model cannot call arbitrary capabilities
Typed schemas free text cannot become unsafe arguments
Server-side auth model output cannot grant permission
Approval gates high-risk actions require human or policy approval
Sandboxing code/browsing/file work runs in constrained environment
Secret isolation credentials never enter prompt context
Output filtering sensitive or unsafe content blocked before user
Audit logging security review can reconstruct decisions

The model remains useful inside these boundaries. It just stops being the final authority.


3) Worked example — secure support agent design

For the refund agent:

least privilege: only refund-policy search and eligibility check
tenant isolation: account ID derived from authenticated session
tool allowlist: no arbitrary admin tool
typed schema: refund inquiry fields are enum/date/amount, not free-form command
server auth: eligibility computed server-side
approval gate: actual refund requires existing workflow
secret isolation: no API keys or private notes in prompt
audit: trace stores proposed args and server decision

This design still uses an LLM. It just refuses to let the LLM become the bank manager.


4) Why not put every control inside the prompt

The tempting alternative is prompt-based control because it is fast to add and easy to change.

It fails when the control must be reliable under adversarial pressure. Prompts are excellent for guiding model behavior. They are weak for enforcing authorization, isolation, spending, network access, deletion, or irreversible action.

Use prompts for behavior. Use code, infrastructure, and policy engines for boundaries.


5) Production signals — control effectiveness

The first metric is boundary bypass rate in red-team and production traces.

The misleading metric is "model followed policy." The model may follow policy in observed cases while the system lacks a hard boundary.

The expert artifact is a control matrix:

Asset/action Threat path Hard control Test
refund execution prompt/tool abuse server eligibility + approval red-team tool case
tenant document indirect injection tenant ACL + source label cross-tenant retrieval test
API key exfiltration never in prompt + vault secret scanning

6) Boundary — controls add friction

Controls can make the product slower, less flexible, more expensive, and harder to build. That is normal. Security design chooses friction where risk justifies it.

The pathology is equal friction everywhere. Low-risk tasks become unusable while high-risk tool paths remain under-protected because nobody mapped the vault.


Recall checkpoint

  • Why is model alignment not sandboxing?
  • Which controls should live outside the prompt?
  • What does a control matrix show?
  • Why is equal friction everywhere a problem?

Interview Q&A

Q: What hard controls should surround an AI agent? A: Least privilege, tenant isolation, tool allowlists, typed schemas, server-side auth, approval gates, sandboxing, secret isolation, output filtering, and audit logging.

Common wrong answer to avoid: "Make the model stricter." Model behavior is one layer; boundaries need enforceable controls.

Q: How do you decide where to add friction? A: Use asset value, action reversibility, user harm, regulatory risk, blast radius, and red-team severity.

Common wrong answer to avoid: "Add human review everywhere." Universal friction harms usability and still may miss technical boundaries.

Q: What is a control matrix? A: A table mapping asset/action to threat path, hard control, and test that proves the control works.

Common wrong answer to avoid: "A list of security features." The matrix must connect controls to attack paths.


Apply now (10 min)

Model the exercise. Build a control matrix for refund execution, tenant document access, and API key handling.

Your turn. Pick one agent and list what it must not read, call, reveal, write, spend, or change.

Reproduce from memory. Explain why prompts guide behavior while controls enforce boundaries.


What you should remember

This chapter explained security controls and isolation. The important idea is that a persuasive model needs hard boundaries around data and actions.

Carry this diagnostic forward: ask what still cannot happen if the model is persuaded.

Remember:

  • Alignment is not sandboxing.
  • Prompts guide; controls enforce.
  • Least privilege and server-side auth are central.
  • Control matrices connect threats to tests.

Bridge. Controls reduce risk, but attackers and regressions still happen. Next we monitor security behavior and hand real failures into incident response. → 10-security-monitoring-and-response.md