11. Observability and audit¶

~8 min read. Five chapters built the sandbox; multi-tenant isolation made it cross-tenant safe. Observability is what tells you the sandbox is holding — per-call audit, syscall tracing, resource use measurement, and the dashboards that surface anomalies.

Continues from 10-multi-tenant-isolation.md. This chapter develops the observability layer. Recurring concepts in bold: audit record, per-call trail, syscall trace, resource-use metric, anomaly signal, retention policy.

A sandbox without observability is opaque. Incidents are debugged by guessing; trends are invisible; security review has nothing to inspect. The observability layer is the apparatus's lens on the sandbox.

What every call produces¶

Per-call audit record:

{
  call_id, parent_request_id, agent_id, tenant_id, user_id,
  tool_name, tool_version,
  parameters_hash, parameters_redacted,
  start_time, end_time, duration_ms,
  isolation_model, resource_caps_used / resource_actual,
  policy_envelope_id, policy_denials,
  credential_id, credential_scope,
  approval_state, approval_chain,
  outcome (success | error | terminated),
  error_class, error_message_redacted,
  output_hash, output_size_bytes,
  syscall_summary, network_summary,
  cost_estimate
}

The record is structured. Free-form logging supplements but does not replace the structured record. The record's schema is part of the platform; tools opt in to a stable contract.

Per-tenant audit query¶

The audit must be queryable per tenant. Auditors, on-call engineers, and security responders all need:

All calls for tenant X in time range.
Failed calls by error class.
Calls that hit a policy denial.
Calls with unusual resource use.
Calls with approval rejections.

The retention policy depends on regulatory and operational needs: 30-90 days for operational queries, longer for regulated tenants.

Syscall tracing¶

For high-stakes tools, syscall tracing captures what the tool actually did at the kernel boundary:

Which syscalls were attempted (allowed and denied).
The arguments to security-relevant syscalls (execve, open, connect).
The frequency of each syscall.

The trace is voluminous; the pattern is to sample (e.g., 1% of calls) and to capture full traces on demand for incident response.

Trace storage is expensive; the discipline is to record the syscall summary in every call's audit and the full trace only when needed.

Resource use measurement¶

For every call, record:

CPU time used vs. cap.
Wall time vs. cap.
Peak memory vs. cap.
Bytes read/written.
Sockets opened.
Subprocesses spawned.

The measurement informs cap tuning (chapter 04). Tools that consistently use < 10% of their cap have headroom; tools that consistently use > 90% need investigation (legitimate growth or abuse).

Anomaly detection¶

The observability layer produces signals; some of those signals are paging conditions for the on-call apparatus.

Signal	Anomaly
Policy denial rate	Sudden climb signals adversarial probing or tool regression
Resource cap hits	Sudden climb signals abuse or tool regression
Approval rejection rate	Sudden climb signals model misalignment or attack
Cross-tenant attempts	Any occurrence signals a serious isolation issue
Sandbox escape indicators	Unexpected syscalls, unexpected endpoints, unusual processes

Each signal has a threshold and a payload. The on-call apparatus (chapter 03 of module 06) consumes these as paging conditions; the sandbox's observability layer produces them.

A worked example — the audit query that catches the incident¶

The Hyderabad fintech's data assistant is suspected of cross-tenant leakage based on a customer complaint. The auditor queries:

All calls for tenant X in the last 7 days. → 1,243 calls returned.
Of those, calls that accessed paths outside /workspace/<tenant_X>/. → 0 calls (correct).
Calls that resolved network endpoints outside tenant X's allowlist. → 0 calls (correct).
Calls where the credential scope included tenant Y's resources. → 0 calls (correct).
Calls with unusual output size or syscall pattern. → 3 calls flagged for review.

The three flagged calls are reviewed manually. None show cross-tenant access. The customer's concern was a different issue (a misconfigured downstream dashboard) traced through other audit data. The sandbox's observability layer ruled out cross-tenant leak in 12 minutes; without it, the investigation would have taken days and produced uncertainty.

What the audit must capture about the tool's inputs¶

A common gap: audit captures what the tool returned but not what the model asked it to do.

The fix: include the parameters in the audit, redacted appropriately. The model's full prompt may be too long; the tool's call parameters (the structured arguments) are the key data. With parameters in audit, an investigator can answer "what did the model ask the tool to do?" without re-running the model.

Retention and privacy¶

The audit contains sensitive data: user IDs, parameters (which may contain user content), output hashes, error messages. Retention policy:

Operational retention: 30-90 days for general debugging.
Compliance retention: longer per regulatory requirement (PCI, HIPAA, GDPR vary).
Privacy: parameters and output content are hashed or redacted; raw content stored only when explicitly required.
Access: audit is access-controlled; not every engineer can query every audit record.

The audit's own access pattern is itself audited. A team querying audit beyond their authorisation is a security signal.

Operational signals¶

Healthy. Every call produces a structured audit record. Audit is queryable per tenant in under a second. Anomaly signals fire and are reviewed. Retention policy is enforced.

First degrading metric. Audit ingestion lag climbing. Records are produced but not searchable in time; investigation latency grows.

Misleading metric. Audit volume. High volume is normal for active platforms; the metric to watch is audit completeness (every call captured) and query latency.

Expert graph. Per-tool audit completeness; query latency p50/p99; anomaly signal rate vs. genuine incident rate.

Boundary of applicability¶

Strong fit. Production sandboxes with regulatory, security, or operational audit needs. Almost all production AI agents.

Pathology. Audit treated as a write-only log. Without query infrastructure and retention policy, the audit is dead storage. The discipline is to design audit as a queryable, retained, access-controlled artefact.

Scale limit. Very large platforms produce massive audit volume. Pattern: hot storage for recent (30 days), warm for retention period, cold for compliance retention. Per-tenant partitioning helps query performance.

Failure-prone assumption¶

The seductive wrong belief: logging is observability. Logging is unstructured text; observability is structured data plus query plus anomaly detection. A platform with comprehensive logs and no structured audit cannot answer the questions that matter — "did tenant X's tool ever access tenant Y's data?" — in operational time.

Where this appears in production¶

A fintech queries audit to rule out cross-tenant leakage in under a minute.
A telecom AI has audit ingestion lag at < 10 seconds; near-real-time queries.
A retail AI has anomaly signals from audit feeding the on-call apparatus.
A consumer chatbot had logs but no structured audit; an incident investigation took 8 days.
A healthcare AI retains audit for 7 years per HIPAA; cold storage policy.
A coding assistant captures syscall summaries in every call's audit; full traces on demand for high-stakes tools.
A government AI has audit access logged separately; auditor queries are themselves audited.
A legal AI redacts privileged-context parameters in audit; the audit captures structure, not privileged content.
A B2B SaaS has per-tenant audit indices; cross-tenant query is structurally impossible.
A travel platform treats audit completeness as a release gate; tools without audit do not ship.
A media AI caught a sandbox escape attempt via syscall anomaly in audit.
A document AI has audit query SLO < 1 second p99.
A staffing AI has audit reviewed monthly for unusual patterns.
A logistics AI has approval-chain audit; the audit records who approved what when.
A search-ops AI has audit feeding security analytics; correlations across calls.
An ad-tech AI has audit on a separate cluster from production; isolation of forensic data.
A real-estate AI has audit-driven cost attribution; per-tenant cost is queryable.
A medical AI has audit as a regulatory artefact; auditors validate completeness.
A small SaaS has logs but not structured audit; incident investigation depends on grep.
A platform team treats audit schema as a stable contract; tools opt in to versioning.

Recall / checkpoint¶

What does every call's structured audit record contain?
What is the difference between logging and observability?
Why is syscall tracing typically sampled rather than full?
What anomaly signals does the observability layer produce?
How is the audit's own access controlled?
What is the retention pattern across hot, warm, and cold storage?
Why is "logs are enough" a failure-prone assumption?

Interview Q&A¶

Q1. A team has comprehensive logs but no structured audit. Walk through the gap. Logs are unstructured text; useful for debugging-by-grep but not for answering questions like "did tenant X's tool access tenant Y's data?" or "which calls hit the policy denial in the last hour?" Structured audit captures the call-level events in a queryable format with consistent schema; operational queries return in seconds. The gap: investigations that require structured queries take days against logs; structured audit makes them minutes. The fix is to design audit as a structured artefact, not a log enhancement. Common wrong answer to avoid: "logs are exhaustive" — exhaustive does not mean queryable.

Q2. Walk through what a per-call audit record contains. Call ID, parent request ID, agent ID, tenant ID, user ID, tool name, tool version, parameters (hashed or redacted), timing, isolation model, resource caps used vs. allocated, policy denials, credential ID and scope, approval state and chain, outcome and error class, output size and hash, syscall summary, network summary, cost estimate. The record is structured; the schema is a stable contract. Sufficient to reconstruct what happened without re-running the call. Common wrong answer to avoid: "we log everything" — without schema, queries are slow and inconsistent.

Q3. The team retains full syscall traces for every call. Walk through why that is impractical. Volume. A single call may produce thousands of syscalls; multiplied by call rate and retention, the storage is enormous. The pattern is to record the syscall summary (counts per syscall, denials, security-relevant calls) in every audit record, and the full trace only when needed — sampled at 1% normally, or full-trace on demand for incident response. The summary catches anomalies; the full trace supports deep investigation when triggered. Common wrong answer to avoid: "store everything" — cost and operational pain outweigh benefit.

Q4. The audit's own access is itself audited. Why? Auditors and engineers querying audit may themselves be a security signal. An engineer querying tenants outside their normal scope, querying audit at unusual times, or querying for parameters indicating reconnaissance — all are signals worth detection. The pattern is meta-audit: queries against audit produce their own audit records, reviewed for unusual patterns. The cost is small; the value is catching insider threats and over-broad access. Common wrong answer to avoid: "we trust engineers" — trust is layered with verification.

Q5. What is the structural difference between an anomaly signal in the observability layer and a paging condition in the on-call apparatus? Same data, different consumers. The observability layer produces signals (policy denial rate, cap hit rate, cross-tenant attempts, escape indicators). The on-call apparatus (chapter 03 of module 06) consumes those signals as paging conditions when they cross thresholds. The signal exists in the observability layer regardless of paging; the paging is the apparatus's response. Decoupling them means signals can be inspected and tuned without changing paging behaviour. Common wrong answer to avoid: "they're the same thing" — they share data; their purposes and tunings differ.

Q6. How does audit interact with retention and privacy? Audit contains sensitive data: parameters, output hashes, user IDs. Retention is tiered: hot (30-90 days, queryable in seconds), warm (retention period, slower queries), cold (compliance retention, batch retrieval). Privacy: parameters and outputs are hashed or redacted; raw content stored only when explicitly required. Access control: not every engineer can query every audit record; access is itself audited. The audit's design is shaped by the regulatory and privacy regime as well as the operational need. Common wrong answer to avoid: "retain forever for compliance" — operational queries do not need cold storage; privacy demands aging.

Design / debug exercise (10 minutes)¶

Modelled example. Walk through the worked example (the cross-tenant audit query). Verify the audit contains enough structure to rule out cross-tenant leakage in minutes; identify any field whose absence would have made the investigation harder.

Your turn. Pick one tool. Sketch its audit record. Identify any field that is missing or insufficiently structured. Estimate the operational impact (investigation time) of each gap.

Reproduce from memory. Write the per-call audit record schema (the main fields). The signal of internalisation is that you can design audit for a hypothetical new tool quickly.

Operational memory¶

This chapter explained the observability layer: structured per-call audit, syscall tracing, resource measurement, and anomaly signals that feed the on-call apparatus. The important idea is that observability is structured data plus query plus anomaly detection — not unstructured logs.

You learned to design the audit schema, retain at multiple tiers, control audit access, sample syscall traces, and produce anomaly signals for the apparatus. That solves the opening failure because the sandbox is now inspectable in operational time; incidents are diagnosed in minutes, not days.

Carry this diagnostic forward: when an investigation requires grepping logs, you have found the audit gap. Structured audit pays back the moment the next investigation runs.

Remember:

Structured audit beats unstructured logs.
Every call produces a record; schema is a stable contract.
Syscall traces are sampled; full traces on demand.
Audit is access-controlled and meta-audited.
Retention is tiered; privacy demands aging of raw content.

Bridge. Eleven chapters built the sandbox. The architect checklist condenses the module into the items a lead runs through on any AI feature with tool execution. The next chapter is that checklist. → 12-architect-checklist.md