Skip to content

00. AI Security and Red-Teaming — First-Principles Overview

AI security starts when untrusted text can influence data access, tool calls, memory, or irreversible actions.


An ordinary web service has familiar security boundaries. A request enters through an API, the server checks auth, validates input, runs business logic, writes logs, and returns a response. If the service is well designed, user-controlled text is data. It does not silently become authorization, routing logic, or an admin decision.

An AI product changes that shape. The model reads user prompts, retrieved documents, tool outputs, files, web pages, tickets, emails, memory entries, and sometimes screenshots. Some of that text is trusted. Much of it is not. The model then produces more text that the application may treat as an answer, a tool argument, a plan, a summary, a memory update, or a decision.

That is the security pressure: untrusted instructions can enter through content paths that look like data paths. A malicious user may type hostile instructions directly. A compromised document may hide them indirectly. A tool response may carry attacker-controlled text back into the next model step. A memory write may persist the attack into tomorrow's session. The attack is not only "make the model say a bad thing." The dangerous path is "make untrusted text influence something valuable."

So AI security is not solved by asking the model to be more loyal. The model is a reasoning component inside the system; it is not the security boundary. The application must decide which text is trusted, which assets are reachable, which tool calls are allowed, which outputs need validation, which memories can be written, and which events require evidence for incident response.

Red-teaming exists because these paths are compositional. A direct prompt injection may be harmless in a read-only chat. The same instruction becomes high risk when combined with retrieval, tenant data, tool calls, memory, and weak logging. The job of this module is to turn vague fear about "jailbreaks" into concrete attack paths, hard controls, regression tests, and production signals.


The recurring pressures and concepts

Pressure / concept Module shorthand Meaning
Untrusted instruction path lobby text Any route where attacker-controlled text can influence the model: prompt, retrieved document, web page, tool output, file, email, ticket, image caption, or memory.
Reachable asset map vault map The concrete map of data, tools, tenant boundaries, credentials, actions, and reputation surfaces that an attack could actually affect.
Pre-model policy check security desk Threat modeling, validation, source labeling, and policy checks before model output or tool execution is trusted.
Authority boundary guard rails Hard controls outside the model: auth, scopes, schemas, allowlists, approvals, tenant isolation, network limits, and write permissions.
Protected data under controlled use sealed envelope Data the system may use for reasoning, ranking, or decision support but must not reveal, copy, or transform into an unsafe action.
Adversarial regression process red team room The process that turns realistic attack paths into severity-weighted tests and release gates.
Evidence trail audit camera The prompts, retrieved chunks, tool arguments, policy decisions, traces, alerts, and ownership handoff needed to investigate a security event.
Action boundary action boundary The line between generating text and causing a side effect: sending an email, creating a ticket, changing an account, calling an API, writing memory, or exposing data.

The shorthand exists only to make callbacks compact. The important object is always the engineering pressure: where untrusted instruction enters, what it can reach, which boundary should stop it, and what evidence proves the boundary held.


Top resources

  • OWASP Top 10 for LLM Applications — useful taxonomy for prompt injection, insecure output handling, excessive agency, sensitive information disclosure, and related application risks.
  • NIST AI Risk Management Framework — broader risk language for mapping, measuring, managing, and governing AI system risk.
  • MITRE ATLAS — adversarial tactics and techniques for AI systems, useful when turning incidents and attack paths into test cases.
  • Cloud and platform security docs for your stack — identity, secrets, logging, network isolation, sandboxing, KMS, and tenant boundaries still decide what the model can reach.
  • Your own traces and incidents — the best red-team backlog comes from real product surfaces, near misses, support tickets, abuse reports, and security reviews.

What's coming

  1. 01-threat-model-ai-system.md — map assets, actors, entry points, trust boundaries, and the paths from model-readable text to reachable harm.
  2. 02-direct-prompt-injection.md — understand direct hostile instructions as an authority-boundary problem, not merely a prompt phrasing problem.
  3. 03-indirect-prompt-injection.md — trace attacks hidden inside retrieved documents, web pages, tool outputs, and other context sources.
  4. 04-jailbreaks-and-policy-pressure.md — separate refusal behavior from system security and understand where policy pressure still matters.
  5. 05-data-exfiltration-and-secrets.md — reason about confidentiality when model-readable context can contain secrets or tenant data.
  6. 06-tool-abuse-and-action-boundaries.md — design tool execution so model-generated arguments cannot exceed user or system authority.
  7. 07-memory-and-cross-tenant-risk.md — treat persistence, personalization, and tenant mixing as security surfaces.
  8. 08-red-team-evals-and-scoring.md — build adversarial regression suites that measure reachable harm instead of prompt drama.
  9. 09-security-controls-and-isolation.md — apply least privilege, isolation, validation, allowlists, sandboxing, approvals, and data minimization.
  10. 10-security-monitoring-and-response.md — collect evidence, detect control failures, alert owners, and feed incidents back into tests.
  11. 11-honest-admission.md — name what remains uncertain when models, attacks, products, and context sources keep changing.

Memory map

Concept Prerequisite Pressure family Recurs later as Layer touched
Threat model system design + auth basics attack surface mapping architecture review gate product → API → infra
Direct prompt injection prompt hierarchy instruction conflict authority-boundary failure user text → model output
Indirect prompt injection RAG + tool loops untrusted context context-source trust document/tool → prompt → action
Jailbreak pressure safety policy + evals model persuasion refusal regression prompt → model behavior
Data exfiltration tenancy + access control confidentiality leak path investigation data store → context → output
Tool abuse agents + orchestration excessive agency action-boundary design model plan → API side effect
Memory risk personalization + persistence cross-session influence tenant and retention control memory write → future prompt
Red-team eval evaluation design adversarial regression release gate test suite → CI/CD
Isolation control platform security blast-radius reduction least privilege and sandboxing runtime → network → secrets
Security monitoring observability + incident response evidence and ownership detection and response loop trace → alert → incident

This map is the module's operating model. For every attack class, ask four questions: where did untrusted instruction enter, what reachable asset could it influence, which authority boundary should have stopped it, and what evidence would prove the path happened?


The engineering invariant

AI security is the discipline of keeping untrusted instruction paths from crossing authority boundaries into reachable assets, even when the model is helpful, fluent, and persuadable.

That invariant creates the design posture for the rest of the module:

  1. Treat model-readable text as potentially adversarial unless the system can prove otherwise.
  2. Keep secrets, tenant data, and privileged actions behind controls the model cannot override.
  3. Validate tool arguments, memory writes, and output transformations with application logic.
  4. Score red-team tests by reachable harm and control failure, not by how dramatic the prompt looks.
  5. Preserve enough evidence to debug the path from input to retrieval to model step to tool call to output.

If a mitigation only changes what the model is asked to do, it may help behavior but it is not a complete security control. A real control changes what the model is able to reach, cause, remember, or expose.


Where this appears in production

  • A support assistant retrieves an attacker-uploaded document that tells it to include private billing notes in a customer-facing summary.
  • A coding agent reads a repository issue that asks it to print environment variables or modify security configuration.
  • A sales assistant stores a malicious preference in memory and later applies it to another workflow.
  • A browser agent reads a web page that instructs it to ignore the user's task and submit a form with attacker-chosen values.
  • A ticket triage system lets model-generated priority, assignee, or escalation fields bypass server-side policy.
  • A data assistant summarizes a table it was allowed to read, but the final answer leaks rows the current user should not see.
  • A red-team suite passes famous jailbreak strings while missing a boring tool-argument injection that creates real side effects.
  • A monitoring system logs prompts and retrieved chunks, but the logs themselves become a sensitive data store without access control.

These examples share the same shape: text influence travels farther than the system intended. The fix is not one universal prompt. The fix is a threat model, hard boundaries, adversarial regression tests, and response loops that match the product's actual asset paths.


How to use this module in design review

When reviewing an AI feature, do not start with "Can the model be jailbroken?" Start with the product path:

untrusted input -> model-readable context -> model decision -> tool/output/memory -> reachable asset

Then ask where each boundary is enforced. If the boundary lives only in prompt text, treat it as behavioral guidance, not a security guarantee. If the boundary lives in authorization, tenant filters, schemas, allowlists, approvals, sandboxing, or network policy, it can hold even when the model is persuaded.

The strongest review output is a short attack-path table: entry point, reachable asset, expected boundary, regression test, and monitoring signal. That table gives product, security, infra, and ML engineers the same object to argue over.

Use that table throughout the module. Direct prompt injection tests the first entry point. Indirect injection tests context sources. Exfiltration tests confidentiality. Tool abuse tests side effects. Memory risk tests persistence. Red-team evals turn the table into release gates. Monitoring turns production failures back into evidence.


Bridge. Before studying individual attacks, we need the map that makes risk concrete: which assets exist, who can influence the model, which context paths are untrusted, and which authority boundaries must hold. → 01-threat-model-ai-system.md