09. Security controls and isolation — hard boundaries around a persuasive model¶

~12 min read. The model can be helpful, confused, or manipulated. Security controls decide what still cannot happen.

Continues from 08-red-team-evals-and-scoring.md. The red team room found attack paths. Now the guard rails must turn those findings into hard boundaries.

The previous chapter made attacks measurable through red-team cases and severity. That tells us where the system breaks, but tests alone do not stop production harm. This chapter turns the red-team findings into least privilege, isolation, schemas, approvals, sandboxing, and server-side authorization.

1) The wall — model alignment is not sandboxing¶

A well-aligned model can still be wrong under pressure. A secure system assumes that sometimes the model will produce a bad plan, unsafe argument, or overconfident answer.

Controls should answer:

if the model is persuaded,
what can it still not read,
not call,
not reveal,
not write,
not spend,
not change?

That question shifts security from model personality to system architecture.

2) The core control families¶

Control	What it protects
Least privilege	model/tool only gets needed access
Tenant isolation	one tenant cannot influence or read another
Tool allowlists	model cannot call arbitrary capabilities
Typed schemas	free text cannot become unsafe arguments
Server-side auth	model output cannot grant permission
Approval gates	high-risk actions require human or policy approval
Sandboxing	code/browsing/file work runs in constrained environment
Secret isolation	credentials never enter prompt context
Output filtering	sensitive or unsafe content blocked before user
Audit logging	security review can reconstruct decisions

The model remains useful inside these boundaries. It just stops being the final authority.

3) Worked example — secure support agent design¶

For the refund agent:

least privilege: only refund-policy search and eligibility check
tenant isolation: account ID derived from authenticated session
tool allowlist: no arbitrary admin tool
typed schema: refund inquiry fields are enum/date/amount, not free-form command
server auth: eligibility computed server-side
approval gate: actual refund requires existing workflow
secret isolation: no API keys or private notes in prompt
audit: trace stores proposed args and server decision

This design still uses an LLM. It just refuses to let the LLM become the bank manager.

4) Why not put every control inside the prompt¶

The tempting alternative is prompt-based control because it is fast to add and easy to change.

It fails when the control must be reliable under adversarial pressure. Prompts are excellent for guiding model behavior. They are weak for enforcing authorization, isolation, spending, network access, deletion, or irreversible action.

Use prompts for behavior. Use code, infrastructure, and policy engines for boundaries.

5) Production signals — control effectiveness¶

The first metric is boundary bypass rate in red-team and production traces.

The misleading metric is "model followed policy." The model may follow policy in observed cases while the system lacks a hard boundary.

The expert artifact is a control matrix:

Asset/action	Threat path	Hard control	Test
refund execution	prompt/tool abuse	server eligibility + approval	red-team tool case
tenant document	indirect injection	tenant ACL + source label	cross-tenant retrieval test
API key	exfiltration	never in prompt + vault	secret scanning

6) Boundary — controls add friction¶

Controls can make the product slower, less flexible, more expensive, and harder to build. That is normal. Security design chooses friction where risk justifies it.

The pathology is equal friction everywhere. Low-risk tasks become unusable while high-risk tool paths remain under-protected because nobody mapped the vault.

Recall checkpoint¶

Why is model alignment not sandboxing?
Which controls should live outside the prompt?
What does a control matrix show?
Why is equal friction everywhere a problem?

Interview Q&A¶

Q: What hard controls should surround an AI agent? A: Least privilege, tenant isolation, tool allowlists, typed schemas, server-side auth, approval gates, sandboxing, secret isolation, output filtering, and audit logging.

Common wrong answer to avoid: "Make the model stricter." Model behavior is one layer; boundaries need enforceable controls.

Q: How do you decide where to add friction? A: Use asset value, action reversibility, user harm, regulatory risk, blast radius, and red-team severity.

Common wrong answer to avoid: "Add human review everywhere." Universal friction harms usability and still may miss technical boundaries.

Q: What is a control matrix? A: A table mapping asset/action to threat path, hard control, and test that proves the control works.

Common wrong answer to avoid: "A list of security features." The matrix must connect controls to attack paths.

Apply now (10 min)¶

Model the exercise. Build a control matrix for refund execution, tenant document access, and API key handling.

Your turn. Pick one agent and list what it must not read, call, reveal, write, spend, or change.

Reproduce from memory. Explain why prompts guide behavior while controls enforce boundaries.

What you should remember¶

This chapter explained security controls and isolation. The important idea is that a persuasive model needs hard boundaries around data and actions.

Carry this diagnostic forward: ask what still cannot happen if the model is persuaded.

Remember:

Alignment is not sandboxing.
Prompts guide; controls enforce.
Least privilege and server-side auth are central.
Control matrices connect threats to tests.

Bridge. Controls reduce risk, but attackers and regressions still happen. Next we monitor security behavior and hand real failures into incident response. → 10-security-monitoring-and-response.md