Skip to content

01. Why unsandboxed execution fails

~10 min read. Tools that run with the AI service's privileges, credentials, and host access amplify any wrong call into a production-scale incident. This chapter walks through the failure shapes the sandbox must absorb.

Continues from 00-first-principles.md. The six sandbox surfaces — isolation, resource, policy, credential, approval, observability — exist because each has a known failure when omitted. Until we feel those failures, we will treat the sandbox as ceremony.

The first-principles overview promised a sandbox. This chapter earns that promise by walking through six real failure shapes that occur when tool execution has no boundary. Each is something a smart team has shipped at least once.


The six shapes unsandboxed execution produces

A classic API server runs trusted code on trusted inputs. An AI agent runs trusted code on outputs from a non-deterministic, jailbreakable, and sometimes adversarial model. Treating tool execution as if it were classic API execution produces six recognisable shapes.

Family What goes wrong Why unsandboxed amplifies
Destructive call A rm -rf, DROP TABLE, mass-delete API call executes No filesystem/network policy to block it
Resource exhaustion A tool consumes CPU, memory, or time without bound No resource caps; one call starves the host
Credential exfiltration A tool reads or transmits the agent's secrets Tool runs with the agent's full credentials
Cross-tenant leak One tenant's data appears in another tenant's tool output No multi-tenant isolation in the execution layer
Irreversible side-effect A tool sends an email, transfers money, deletes data without approval No approval gate for irreversible actions
Sandbox escape The sandbox exists but is bypassed via a known vector Hardening, monitoring, or escape-vector awareness missing

Each family has a different shape; each needs its own surface in the sandbox design. The first chapter's claim is that any team running tool execution without addressing all six is shipping at least three known failure modes.


A worked example — the data assistant incident

Recall the Chennai SaaS incident from the first-principles chapter. The data assistant agent had a Python execution tool. A user asked the assistant to clean up temporary files. The model wrote and the tool executed:

import os
import shutil
shutil.rmtree("/data/workspace/temp")

The /data/workspace/temp path resolved, via a misconfigured base directory mount, to /mnt/customer-storage/temp, which was a mount of a production cloud storage bucket. The tool ran with the AI service's IAM role, which had broad write access to the bucket. Five terabytes of data, deleted.

Walking through the six families against this incident:

  • Destructive call. Yes — shutil.rmtree. No filesystem policy blocked it.
  • Resource exhaustion. No — but a tighter loop could have exhausted the disk's I/O before noticed.
  • Credential exfiltration. Not in this incident — but the tool had the credentials to do it.
  • Cross-tenant leak. Yes — the storage bucket served multiple tenants.
  • Irreversible side-effect. Yes — no approval gate before mass deletion.
  • Sandbox escape. Not applicable — there was no sandbox to escape.

Four of six families converged in one incident. The cause is not the model, not the prompt, not the user. The cause is unsandboxed execution.


Shape 1 — Destructive call

The agent executes a tool that performs an irreversible operation: file deletion, table drop, mass email send, payment transfer, account deletion. The unsandboxed shape: the tool has the credentials and the network access to do this, so it does.

The sandbox's defence has three layers. The policy layer blocks destructive operations the tool was not designed for. The credential layer ensures even if the tool tries, its scoped credentials cannot authorise destruction at scale. The approval layer requires a human in the loop for any operation classified as irreversible.

A team that protects only at the model layer ("the prompt says do not delete production data") is one jailbreak away from the failure.


Shape 2 — Resource exhaustion

The agent calls a tool that consumes resources without bound. Examples:

  • Code execution that loops without termination.
  • Memory allocation in a tight loop.
  • Network calls that fan out to many endpoints.
  • Database queries that scan unbounded tables.

The unsandboxed shape: the tool runs with the host's full resource access. One call exhausts CPU, memory, or I/O; concurrent calls compound; other agents on the host degrade.

The resource layer caps each call: CPU time, wall time, memory, file handle count, network connection count, I/O bytes. The caps are enforced by the runtime, not by trust in the tool author. A tool that would hit the cap is terminated; the agent sees the termination as a tool error and can decide what to do.


Shape 3 — Credential exfiltration

The agent's environment has credentials — cloud provider IAM, database connection strings, third-party API tokens. An unsandboxed tool inherits all of them via environment variables, file mounts, or instance metadata service access.

The exfiltration shape: a tool — under adversarial prompt or compromised code — reads the credentials and writes them somewhere reachable (an outbound HTTP call, a writable log, a returned tool result).

The credential layer's defence: tools run with scoped credentials specific to their purpose, broker-issued at call time, never the agent's full set. Even if a tool is compromised, the leaked credentials only authorise that tool's scope. The host's instance metadata service is unreachable from the sandbox.


Shape 4 — Cross-tenant leak

The agent serves multiple tenants. A tool's execution context is shared across tenants — same process, same temp directory, same network namespace, same cached state. One tenant's request reads or affects another tenant's data.

Examples:

  • Tool A writes a temp file. Tool A's next call (for tenant B) reads the file.
  • A network connection is reused across tenants; one tenant's response is delivered to another.
  • An in-memory cache populated by tenant A's call is served to tenant B's call.

The defence: per-tenant isolation. Process-level or container-level boundaries per tenant; per-call temp directories that are wiped; no shared state that crosses tenants. The default should be "no state crosses tenants" enforced by the sandbox, not by tool author discipline.


Shape 5 — Irreversible side-effect

Some tool actions cannot be undone in software: payment transfers, email sends to external parties, social-media posts, infrastructure deletes, hardware actuations. Even an "undo" is often a fresh action with separate consequences.

The unsandboxed shape: the model decides to call the tool; the tool executes; the action lands in the world. There is no chance to catch a wrong decision after the call returns.

The approval layer's design: actions classified as irreversible require an explicit human approval — a confirmation step in the UI, an out-of-band approval channel, or a delayed-execution queue with a cancel window. The classification is conservative; when in doubt, require approval.


Shape 6 — Sandbox escape

A sandbox exists; an attacker (a determined adversarial input, a tool author bug, a misconfiguration) escapes it. The escape vector reaches resources beyond the sandbox's intended scope.

Common vectors:

  • Container escape via a kernel vulnerability or misconfigured capability.
  • Filesystem path traversal (../../../etc/passwd) past a misconfigured chroot.
  • Network reach to the instance metadata service via a DNS or routing trick.
  • Side-channel reads via shared caches or timing.

The defence: depth and monitoring. Multiple isolation layers (process + container + VM for high-stakes tools); syscall monitoring that detects abnormal patterns; regular security review of the sandbox itself.

A sandbox without escape-vector awareness is a sandbox that will be escaped by the team's least disciplined integration.


The "not X, Y" diagnosis

The temptation is to fix the model. The real problem is not that the model made a wrong call; it is that the system was designed to amplify any wrong call into a production-scale incident. Better models reduce the rate of wrong calls; sandboxes reduce the blast radius of each wrong call. The two are complementary, not substitutable.

The natural question: so what does the sandbox have to do, surface by surface? The next chapter answers it with the apparatus anatomy.


Operational signals

Healthy. Every tool is run under a sandbox with the six surfaces present. Incidents from tool execution are bounded in blast radius (one tenant, one workspace, one resource budget). Sandbox escape vectors are reviewed quarterly.

First degrading metric. A new tool is integrated without sandbox review. The team's discipline has lapsed; the next incident exposes the gap.

Misleading metric. "Tools work fine." Tools working fine in steady state says nothing about behaviour under adversarial input. The metric to watch is sandbox coverage per tool, not tool success rate.

Expert graph. The matrix of tools × surfaces, with cell colour reflecting whether the surface is enforced for that tool. Red cells are known incident shapes waiting to happen.


Boundary of applicability

Strong fit. AI agents executing tools — code, API calls, database queries, infrastructure actions — on behalf of users at scale. The full sandbox is justified.

Pathology. Tiny prototypes with no users. The sandbox is overkill; a single-purpose, read-only tool may need only basic isolation. The pathology is to ship the prototype to users without then upgrading the sandbox.

Scale limit. Very large platforms have many tools; the sandbox becomes a platform service consumed by every agent. The pattern is shared sandbox infrastructure with per-tool policy.


Failure-prone assumption

The seductive wrong belief: a well-prompted model will not make destructive calls. Even if true on average, the exception (one in a million, one per million users, one per jailbreak) is what produces incidents. The correct belief: the model is treated as an untrusted input to the tool layer; the sandbox enforces what the model is not relied on to do.


Where this appears in production

  • A data-engineering SaaS loses 5 TB of customer data to an unsandboxed Python tool's rm -rf.
  • A devops AI assistant executes a Terraform plan; the plan destroys staging infrastructure because the workspace was misconfigured.
  • A coding assistant runs untrusted user code in the agent's process; a user's code reads the agent's environment variables and posts them to a public paste site.
  • A customer-support AI sends a refund-confirmation email to the wrong user because no approval gate caught the swap.
  • An internal AI tool has a Python execution layer that loops forever; the host's CPU is exhausted; other AI services degrade.
  • A legal-tech AI has a database tool that runs DELETE queries the model generated; one query has no WHERE clause; a customer's data is wiped.
  • A fintech AI has a payments tool with no idempotency or approval; a retry storm sends the same transfer 14 times.
  • A retail AI runs tool code with the agent's full IAM role; a tool reads a sensitive bucket the agent should not have accessed.
  • A travel-booking AI has a tool that books flights; a jailbreak prompts the model to book a flight for someone other than the user.
  • A B2B AI agent writes temp files in a shared directory; another tenant's call reads them.
  • A healthtech AI has a tool that updates patient records; the tool runs with broad write access; one update affects the wrong record.
  • A telecom AI has a tool that disables phone lines; an adversarial prompt disables the wrong line.
  • A government AI has tools with no audit; the postmortem cannot determine what the tool actually did.
  • A logistics AI has a tool that reroutes shipments; a misclassified destination diverts thousands of packages.
  • A search-ops AI has a Python tool to "investigate index issues"; the tool drops the index, taking down search.
  • An e-commerce AI has a price-update tool with no approval; a model hallucinates and lowers all prices to ₹1 for 8 minutes.
  • A media AI has a tool that publishes to social media; an adversarial input publishes a wrong message to thousands of followers.
  • A devtool AI has a shell tool that downloads and runs untrusted code; the code mines cryptocurrency for an hour before detection.
  • A consumer AI has a tool that orders products; an adversarial prompt orders a high-cost item to an attacker's address.
  • A medical AI has a tool that orders lab tests; a jailbreak orders inappropriate tests for a patient.

Recall / checkpoint

  1. Name the six failure families that unsandboxed tool execution produces.
  2. For the worked example, identify which families converged.
  3. Distinguish destructive call from irreversible side-effect — what is the relationship?
  4. What is the cross-tenant leak shape, and what causes it in practice?
  5. Why is "fix the model" the incomplete frame?
  6. What signals a degrading sandbox discipline?
  7. Why is sandbox coverage per tool more useful than tool success rate?

Interview Q&A

Q1. A team is shipping an agent with a code-execution tool. The lead says they trust the model not to write destructive code. Walk through the pushback. The model's average behaviour is one thing; the exceptions are the incidents. Adversarial prompts, jailbreaks, hallucinations, and edge cases will produce destructive calls at some rate. The sandbox is the structural defence: the tool runs with scoped credentials, a policy that blocks destructive operations the tool was not designed for, resource limits, and approval gates for irreversible actions. The model is treated as untrusted; the sandbox enforces what trust would otherwise cover. Without the sandbox, the team is one bad call away from a public incident. Common wrong answer to avoid: "we'll prompt-engineer it out" — prompt-only defences fail on the cases that matter most.

Q2. The worked example (the Chennai SaaS) — walk through which sandbox surfaces would have prevented the incident. Policy layer would have blocked rm -rf on production paths. Credential layer would have ensured the tool's scoped credential did not have delete access to the customer storage bucket. Approval layer would have required a human confirmation for mass deletion. Isolation layer would have prevented the tool from reaching the bucket mount at all. Any one of these would have stopped the incident; defence-in-depth means all four are in place. The cause was that all four were missing. Common wrong answer to avoid: "the prompt should have been clearer" — model trust does not substitute for structural defence.

Q3. The resource exhaustion family — how does the sandbox prevent it? The runtime enforces caps on each tool call: CPU time (e.g., 10 seconds), wall time (e.g., 30 seconds), memory (e.g., 256 MB), file handles, network connections, I/O bytes. Caps are enforced by the runtime, not by trust. A tool that would exceed a cap is terminated; the agent sees the error and can decide whether to retry, escalate, or report. The caps are tool-specific; an expected-long-running data analysis tool may have higher caps than an expected-short utility tool. Common wrong answer to avoid: "the host's overall limits will catch it" — by then, the host is degrading.

Q4. How does cross-tenant leak happen even with per-call isolation, and how is it prevented? Per-call isolation may share state across calls — file descriptors, network connections, in-memory caches, temp directories. Tool A writes a temp file in call 1 (tenant X); call 2 (tenant Y) reads it. Defence: per-tenant boundaries enforced at the isolation layer, per-call temp directories that are deleted after the call, no shared in-memory state. The default should be "no state crosses tenants" enforced by the runtime, not by tool author discipline. Common wrong answer to avoid: "the tool author will be careful" — careful is not enforcement.

Q5. What is the difference between sandbox depth and sandbox monitoring? Depth is multiple layers of isolation — process + container + VM, language-level + OS-level + network-level. Monitoring is observing the sandbox in operation for abnormal patterns — unexpected syscalls, unusual resource use, network calls to unexpected endpoints. Depth raises the cost of escape; monitoring catches escapes that succeed. A sandbox with depth but no monitoring is hardened but blind; a sandbox with monitoring but shallow depth catches escapes after damage. Both are needed. Common wrong answer to avoid: "depth is enough" — even hardened systems are escaped; the question is whether the team sees the escape.

Q6. The team has unsandboxed tools today and limited engineering capacity. How do you prioritise? Prioritise by blast radius. Tools that touch production data, customer-facing resources, money, or external parties are first. Tools that read public APIs, internal-only resources, or read-only data are later. Within each tier, prioritise by call frequency (high-traffic tools have more incident opportunity per unit time). The sandbox migration is sprint-scale work; the cost of deferring it is incident-scale. Common wrong answer to avoid: "do them all at once" — phased migration is faster than big-bang.


Design / debug exercise (10 minutes)

Modelled example. Take the worked example (the Chennai data assistant). For each of the six families, walk through which sandbox surface would have prevented or bounded the failure. Identify the minimum-viable surface set for this tool.

Your turn. Pick one tool your team's agent calls. For each of the six families, walk through (a) how the failure could occur with this tool, (b) which sandbox surface defends, (c) which surfaces you have today. The gaps are your next sprint of sandbox work.

Reproduce from memory. Draw the six families table from this chapter without looking. The signal of internalisation is that the families and their pressures land in under three minutes.


Operational memory

This chapter explained why unsandboxed tool execution amplifies any wrong call into a production-scale incident. The important idea is that the model's average behaviour does not determine the system's worst-case behaviour — the exceptions do — and the sandbox is the structural defence that bounds the worst case.

You learned to enumerate the six failure families, recognise each against your team's current tool coverage, and identify the sandbox surfaces that defend each family. That solves the opening failure because the rest of the module builds each surface in turn.

Carry this diagnostic forward: when a team says "our agent calls tools safely," ask for the sandbox matrix — tools × surfaces. The matrix tells the truth; the assertion is the appearance.

Remember:

  • Six families: destructive, exhaustion, exfiltration, cross-tenant, irreversible, escape.
  • The model is treated as untrusted input to the tool layer.
  • Sandboxes bound blast radius; better models reduce wrong-call rate; both are needed.
  • Defence-in-depth: multiple surfaces, multiple layers, monitoring.
  • Sandbox coverage per tool is the metric; tool success rate is misleading.

Bridge. The diagnosis is complete. The next chapter is the prescription — the six surfaces of the sandbox as one architecture, so the rest of the module can develop each surface in detail. → 02-the-sandbox-surfaces.md