10. Security monitoring and response — from suspicious trace to incident¶
~11 min read. AI security is not finished when the red-team suite passes. Production still needs alerts, traces, review loops, and incident handoff.
Continues from 09-security-controls-and-isolation.md. The audit camera watches whether the guard rails hold under real traffic.
The previous chapter placed hard controls around the model so persuasion cannot directly cross sensitive boundaries. That reduces reachable harm, but controls still fail, regress, or get probed in production. This chapter connects security traces to monitoring and incident response.
1) The wall — security failures often look like normal conversations¶
A cross-tenant retrieval leak may appear as a normal answer. A tool-abuse attempt may appear as a valid tool call that failed authorization. A jailbreak attempt may look like a long support conversation. A data-exfiltration attempt may be hidden inside a harmless-looking request for "debug context."
Security monitoring must track attack paths, not only errors.
signal
-> suspicious prompt / retrieved span / tool rejection / output leak
-> trace review
-> severity classification
-> containment or red-team case
The audit camera needs enough context to explain why a trace was suspicious.
2) Signals worth monitoring¶
Track these by workflow, tenant, model route, and tool:
- tool-call rejections by auth/schema/policy
- repeated prompt-injection patterns
- indirect-injection markers in retrieved content
- sensitive-field output blocks
- cross-tenant retrieval candidates
- unusual memory writes
- sudden refusal or unsafe-completion shifts
- high-severity red-team regression failures
- token/cost spikes tied to attack-like loops
- user reports of suspicious behavior
No single signal is enough. Security monitoring is correlation.
3) Worked example — suspicious refund trace¶
The refund agent produces three signals in one hour:
tool auth rejections ↑ for refund_customer
retrieved content includes instruction-like text from uploaded docs
output filter blocks two attempts to reveal internal notes
Individually, each might be noise. Together, they suggest an adversarial campaign or compromised document source.
The response:
- Open security investigation.
- Snapshot representative traces.
- Disable risky refund tool path if needed.
- Add red-team cases from observed traces.
- Review document source and tenant scope.
This is where Module 26 incident response connects directly to AI security.
4) Why not alert on every suspicious phrase¶
The tempting alternative is to alert every time a prompt contains suspicious wording. That creates noise fast.
Alert on impact paths instead:
- suspicious text plus tool access
- suspicious text plus sensitive retrieval
- repeated attempts by same actor or tenant
- policy bypass near high-risk workflow
- blocked exfiltration of known sensitive field
The goal is actionable security signal, not a panic feed.
5) Production signals — monitoring quality¶
The first metric is high-severity detection precision: how often alerts map to real attack paths or control failures.
The misleading metric is alert count. More alerts can mean better visibility or worse tuning.
The expert artifact is a security trace bundle: source, prompt segment, retrieval candidates, model output, tool proposal, control decision, user/tenant, and mitigation.
6) Boundary — monitoring is not prevention¶
Monitoring detects and explains. It does not replace hard controls. If an alert fires after data has already leaked, it is useful for response but not sufficient for prevention.
The pathology is dashboard security. The team can see the attack but cannot block it because no firebreak or control exists.
Recall checkpoint¶
- Why do AI security failures look normal?
- Which signals indicate attack paths?
- Why is suspicious phrase alerting noisy?
- What belongs in a security trace bundle?
Interview Q&A¶
Q: What should you monitor for AI security? A: Tool rejections, injection patterns, sensitive-output blocks, cross-tenant retrieval, unusual memory writes, unsafe-completion shifts, red-team regressions, cost loops, and suspicious user reports.
Common wrong answer to avoid: "Only monitor jailbreak attempts." Security attacks appear through tools, data, memory, retrieval, and output paths.
Q: When does a security alert become an incident? A: When there is credible user harm, data exposure, unauthorized action, active attack, broad blast radius, or control failure requiring containment.
Common wrong answer to avoid: "Only after confirmed exploit." Waiting for certainty can expand blast radius.
Q: How do you reduce alert noise? A: Alert on suspicious text plus reachable impact path, repeated behavior, high-risk workflow, sensitive field block, or control failure.
Common wrong answer to avoid: "Alert on every suspicious phrase." That creates fatigue without prioritizing risk.
Apply now (10 min)¶
Model the exercise. Define three alerts for the refund agent: tool rejection spike, sensitive-output block, and suspicious retrieved content.
Your turn. Pick one AI product and write a security trace bundle schema.
Reproduce from memory. Explain why monitoring must connect suspicious behavior to reachable impact.
What you should remember¶
This chapter explained AI security monitoring and response. The important idea is that production traces must reveal attack paths and control decisions, not only model text.
Carry this diagnostic forward: alert on suspicious behavior when it approaches an asset, tool, tenant boundary, or sensitive output.
Remember:
- AI security failures can look like normal conversations.
- Monitor attack paths, not isolated phrases.
- Security trace bundles support incident response.
- Monitoring does not replace prevention.
Bridge. Monitoring closes the loop from red-team to incident response, but AI security remains an arms race. The final chapter names the limits honestly. → 11-honest-admission.md