00. AI Security and Red-Teaming — First-Principles Overview¶

AI security starts when untrusted text can influence data access, tool calls, memory, or irreversible actions.

An ordinary web service has familiar security boundaries. A request enters through an API, the server checks auth, validates input, runs business logic, writes logs, and returns a response. If the service is well designed, user-controlled text is data. It does not silently become authorization, routing logic, or an admin decision.

An AI product changes that shape. The model reads user prompts, retrieved documents, tool outputs, files, web pages, tickets, emails, memory entries, and sometimes screenshots. Some of that text is trusted. Much of it is not. The model then produces more text that the application may treat as an answer, a tool argument, a plan, a summary, a memory update, or a decision.

That is the security pressure: untrusted instructions can enter through content paths that look like data paths. A malicious user may type hostile instructions directly. A compromised document may hide them indirectly. A tool response may carry attacker-controlled text back into the next model step. A memory write may persist the attack into tomorrow's session. The attack is not only "make the model say a bad thing." The dangerous path is "make untrusted text influence something valuable."

So AI security is not solved by asking the model to be more loyal. The model is a reasoning component inside the system; it is not the security boundary. The application must decide which text is trusted, which assets are reachable, which tool calls are allowed, which outputs need validation, which memories can be written, and which events require evidence for incident response.

Red-teaming exists because these paths are compositional. A direct prompt injection may be harmless in a read-only chat. The same instruction becomes high risk when combined with retrieval, tenant data, tool calls, memory, and weak logging. The job of this module is to turn vague fear about "jailbreaks" into concrete attack paths, hard controls, regression tests, and production signals.

The recurring pressures and concepts¶

Pressure / concept	Module shorthand	Meaning
Untrusted instruction path	lobby text	Any route where attacker-controlled text can influence the model: prompt, retrieved document, web page, tool output, file, email, ticket, image caption, or memory.
Reachable asset map	vault map	The concrete map of data, tools, tenant boundaries, credentials, actions, and reputation surfaces that an attack could actually affect.
Pre-model policy check	security desk	Threat modeling, validation, source labeling, and policy checks before model output or tool execution is trusted.
Authority boundary	guard rails	Hard controls outside the model: auth, scopes, schemas, allowlists, approvals, tenant isolation, network limits, and write permissions.
Protected data under controlled use	sealed envelope	Data the system may use for reasoning, ranking, or decision support but must not reveal, copy, or transform into an unsafe action.
Adversarial regression process	red team room	The process that turns realistic attack paths into severity-weighted tests and release gates.
Evidence trail	audit camera	The prompts, retrieved chunks, tool arguments, policy decisions, traces, alerts, and ownership handoff needed to investigate a security event.
Action boundary	action boundary	The line between generating text and causing a side effect: sending an email, creating a ticket, changing an account, calling an API, writing memory, or exposing data.

The shorthand exists only to make callbacks compact. The important object is always the engineering pressure: where untrusted instruction enters, what it can reach, which boundary should stop it, and what evidence proves the boundary held.

Top resources¶

OWASP Top 10 for LLM Applications — useful taxonomy for prompt injection, insecure output handling, excessive agency, sensitive information disclosure, and related application risks.
NIST AI Risk Management Framework — broader risk language for mapping, measuring, managing, and governing AI system risk.
MITRE ATLAS — adversarial tactics and techniques for AI systems, useful when turning incidents and attack paths into test cases.
Cloud and platform security docs for your stack — identity, secrets, logging, network isolation, sandboxing, KMS, and tenant boundaries still decide what the model can reach.
Your own traces and incidents — the best red-team backlog comes from real product surfaces, near misses, support tickets, abuse reports, and security reviews.

What's coming¶

01-threat-model-ai-system.md — map assets, actors, entry points, trust boundaries, and the paths from model-readable text to reachable harm.
02-direct-prompt-injection.md — understand direct hostile instructions as an authority-boundary problem, not merely a prompt phrasing problem.
03-indirect-prompt-injection.md — trace attacks hidden inside retrieved documents, web pages, tool outputs, and other context sources.
04-jailbreaks-and-policy-pressure.md — separate refusal behavior from system security and understand where policy pressure still matters.
05-data-exfiltration-and-secrets.md — reason about confidentiality when model-readable context can contain secrets or tenant data.
06-tool-abuse-and-action-boundaries.md — design tool execution so model-generated arguments cannot exceed user or system authority.
07-memory-and-cross-tenant-risk.md — treat persistence, personalization, and tenant mixing as security surfaces.
08-red-team-evals-and-scoring.md — build adversarial regression suites that measure reachable harm instead of prompt drama.
09-security-controls-and-isolation.md — apply least privilege, isolation, validation, allowlists, sandboxing, approvals, and data minimization.
10-security-monitoring-and-response.md — collect evidence, detect control failures, alert owners, and feed incidents back into tests.
11-honest-admission.md — name what remains uncertain when models, attacks, products, and context sources keep changing.

Memory map¶

Concept	Prerequisite	Pressure family	Recurs later as	Layer touched
Threat model	system design + auth basics	attack surface mapping	architecture review gate	product → API → infra
Direct prompt injection	prompt hierarchy	instruction conflict	authority-boundary failure	user text → model output
Indirect prompt injection	RAG + tool loops	untrusted context	context-source trust	document/tool → prompt → action
Jailbreak pressure	safety policy + evals	model persuasion	refusal regression	prompt → model behavior
Data exfiltration	tenancy + access control	confidentiality	leak path investigation	data store → context → output
Tool abuse	agents + orchestration	excessive agency	action-boundary design	model plan → API side effect
Memory risk	personalization + persistence	cross-session influence	tenant and retention control	memory write → future prompt
Red-team eval	evaluation design	adversarial regression	release gate	test suite → CI/CD
Isolation control	platform security	blast-radius reduction	least privilege and sandboxing	runtime → network → secrets
Security monitoring	observability + incident response	evidence and ownership	detection and response loop	trace → alert → incident

This map is the module's operating model. For every attack class, ask four questions: where did untrusted instruction enter, what reachable asset could it influence, which authority boundary should have stopped it, and what evidence would prove the path happened?

The engineering invariant¶

AI security is the discipline of keeping untrusted instruction paths from crossing authority boundaries into reachable assets, even when the model is helpful, fluent, and persuadable.

That invariant creates the design posture for the rest of the module:

Treat model-readable text as potentially adversarial unless the system can prove otherwise.
Keep secrets, tenant data, and privileged actions behind controls the model cannot override.
Validate tool arguments, memory writes, and output transformations with application logic.
Score red-team tests by reachable harm and control failure, not by how dramatic the prompt looks.
Preserve enough evidence to debug the path from input to retrieval to model step to tool call to output.

If a mitigation only changes what the model is asked to do, it may help behavior but it is not a complete security control. A real control changes what the model is able to reach, cause, remember, or expose.

Where this appears in production¶

A support assistant retrieves an attacker-uploaded document that tells it to include private billing notes in a customer-facing summary.
A coding agent reads a repository issue that asks it to print environment variables or modify security configuration.
A sales assistant stores a malicious preference in memory and later applies it to another workflow.
A browser agent reads a web page that instructs it to ignore the user's task and submit a form with attacker-chosen values.
A ticket triage system lets model-generated priority, assignee, or escalation fields bypass server-side policy.
A data assistant summarizes a table it was allowed to read, but the final answer leaks rows the current user should not see.
A red-team suite passes famous jailbreak strings while missing a boring tool-argument injection that creates real side effects.
A monitoring system logs prompts and retrieved chunks, but the logs themselves become a sensitive data store without access control.

These examples share the same shape: text influence travels farther than the system intended. The fix is not one universal prompt. The fix is a threat model, hard boundaries, adversarial regression tests, and response loops that match the product's actual asset paths.

How to use this module in design review¶

When reviewing an AI feature, do not start with "Can the model be jailbroken?" Start with the product path:

untrusted input -> model-readable context -> model decision -> tool/output/memory -> reachable asset

Then ask where each boundary is enforced. If the boundary lives only in prompt text, treat it as behavioral guidance, not a security guarantee. If the boundary lives in authorization, tenant filters, schemas, allowlists, approvals, sandboxing, or network policy, it can hold even when the model is persuaded.

The strongest review output is a short attack-path table: entry point, reachable asset, expected boundary, regression test, and monitoring signal. That table gives product, security, infra, and ML engineers the same object to argue over.

Use that table throughout the module. Direct prompt injection tests the first entry point. Indirect injection tests context sources. Exfiltration tests confidentiality. Tool abuse tests side effects. Memory risk tests persistence. Red-team evals turn the table into release gates. Monitoring turns production failures back into evidence.

Bridge. Before studying individual attacks, we need the map that makes risk concrete: which assets exist, who can influence the model, which context paths are untrusted, and which authority boundaries must hold. → 01-threat-model-ai-system.md