11. Observability and audit¶

Contract testing catches mistakes before merge. Observability catches what happens after. Every tool call leaves a record; the question is whether the record is good enough to reconstruct an incident, defend a decision, and feed back into the contract's own evolution.

A security engineer at a Mumbai marketplace receives a regulator's request: prove that no agent acting on behalf of consumer X ever wrote consumer Y's data, between the dates D1 and D2. The team has logs. The agent platform logs requests, the orchestrator logs decisions, the tool wrappers log SDK calls, and the downstream systems log their own writes. The engineer spends three days correlating them. The result is "we believe nothing crossed tenants, based on partial coverage of tool wrappers and the orchestrator's session boundaries, but a fraction of the calls do not appear in any single system's logs with both consumer and call payload together." The regulator accepts the response under pressure. The engineering team writes a postmortem with one conclusion: the audit log was a side effect of debugging logs, not a designed artefact. Six weeks later the team has rebuilt it from scratch.

The audit log is the substrate every other discipline in this module sits on. Drift detection (chapter 09) investigates it. Contract testing (chapter 10) uses it for fixture capture. Idempotency (chapter 04) records hits and misses in it. Errors (chapter 05) and scopes (chapter 06) record denials in it. If the audit log is missing, weak, or scattered, every downstream discipline weakens.

What the audit log is for¶

Five concrete uses; if any of these is impossible against your current logs, the audit is inadequate.

Incident reconstruction. Given an alarm, an on-call must be able to find every tool call involved, with full request and response, in the conversation/workflow context.
Compliance reporting. Given a question about who accessed what, the audit must answer with enough fidelity for regulators or internal auditors.
Drift investigation. Given an unmapped error, the on-call must be able to find the raw downstream payload that produced it.
Replay. Given a contract change, the on-call must be able to replay an arbitrary historical call against the new contract to see what would have happened.
Feedback to the contract. New error codes seen in audit go into the translator; new shapes seen go into the schema; new failure patterns seen become contract pacts.

The audit log is not a debug log. Debug logs answer "what happened in this process?" Audit logs answer "who did what to which resource, with whose authority, under what contract version, and what was the outcome?"

The per-call record¶

Every tool call produces one audit record. The record is structured (not free text) and contains the fields below.

{
  "audit_id": "aud_01HNF9KZP6V8B0Z3GWQ2X5J7H4",
  "ts_started":  "2026-05-25T11:14:02.371Z",
  "ts_completed": "2026-05-25T11:14:03.118Z",
  "duration_ms": 747,

  "tool": "issue_refund",
  "contract_version": "2.1.0",

  "caller": {
    "agent_identity": "agent.support.v3",
    "conversation_id": "conv_01HNF9...",
    "workflow_step_id": "step_06",
    "trace_id": "trc_01HNF8..."
  },

  "tenant_id": "acme-corp",
  "acting_on_behalf_of": {
    "user_id": "u_8412",
    "session_id": "sess_01HNF8..."
  },

  "scope": {
    "required_scope": "payments:refund:write",
    "credential_id": "cred_01HNF9...",
    "tenant_binding": "acme-corp",
    "issued_at": "2026-05-25T11:14:02.310Z",
    "expires_at": "2026-05-25T11:19:02.310Z"
  },

  "idempotency": {
    "key": "idmp_01HNF9...",
    "result": "executed"        // "executed" | "dedup_hit" | "pending"
  },

  "request": {
    "arguments": { "payment_id": "pay_...", "amount_minor": 50000, ... },
    "redactions": ["arguments.note"]   // redacted before logging
  },

  "upstream": {
    "system": "payments-svc",
    "endpoint": "POST /v3/refunds",
    "request_payload_hash": "sha256:...",
    "request_payload":  { ... raw payload sent, with redactions ... },
    "response_status": 200,
    "response_payload": { ... raw response received, with redactions ... },
    "latency_ms": 631
  },

  "result": {
    "ok": true,
    "value": { "refund_id": "ref_...", "status": "succeeded", ... }
  },

  "preconditions": [
    { "name": "payment_exists",      "passed": true },
    { "name": "amount_within_limit", "passed": true },
    { "name": "sanctions_clear",     "passed": true }
  ],

  "postconditions": [
    { "name": "shape",     "passed": true },
    { "name": "amount_eq", "passed": true }
  ],

  "cost": {
    "upstream_call_count": 1,
    "model_tokens": 0,        // for tool calls; the agent's own token cost is elsewhere
    "estimated_usd": 0.0008
  }
}

Each section is load-bearing for one of the five uses.

Identity and timing — incident reconstruction needs every event timestamped, with a trace ID linking the call to its parent conversation/workflow.
Tool and version — drift investigation needs to know which contract version was running.
Caller and tenant — compliance needs to know on whose authority the call ran.
Scope — security review needs to know which credential was used.
Idempotency — investigations need to know whether this was a fresh call or a dedup replay.
Request and upstream payloads — drift investigation needs the raw payloads, not just the structured contract view.
Result, preconditions, postconditions — outcome reconstruction.
Cost — chargeback, capacity planning, and the cost surface from module 01 chapter 08.

A platform with all of these fields can answer any of the five questions in minutes. A platform missing several has to reconstruct from corner logs and never quite catches up.

What to redact, and what not to¶

The audit log contains payloads. Payloads contain user data. Some of that data is sensitive — PII, financial details, health information, secrets. Some is not. The redaction policy decides what is stripped before the record is stored.

A simple rule:

Redact values that have no investigation value beyond their presence. Keep the field name, the shape, and a marker that a value existed.

Examples:

email: redact the value to [redacted-email]; keep that the field was present.
aadhaar_number: redact the value entirely; for some audits, keep a one-way hash so two records about the same person can be correlated.
password, api_key, auth_token: redact the entire value; do not keep a hash.
address: depending on jurisdiction, redact or keep.
amount_minor: do not redact — investigations need the value.
payment_id, customer_id, refund_id: do not redact — these are the keys that make logs queryable.

The redaction policy is part of the contract:

operational:
  observability:
    pii_fields: [email, aadhaar_number, address.street]
    secrets:    [api_key, oauth_token]
    keep_hashed: [aadhaar_number]  # for correlation
    audit_retention: 7y

Two operational rules:

Redaction happens before the audit record is written, not at query time. Storing the raw value and redacting on read is the failure mode that produces breach reports.
Redacted fields keep their field name and a marker. The shape of the call is preserved for investigation; only the sensitive value is gone.

Special case: errors with sensitive content¶

Sometimes the downstream returns an error message that contains sensitive data (e.g., "rejected: customer 'foo@bar.com' already exists"). The translator (chapter 05) is responsible for stripping or hashing such content before it lands in the structured error returned to the model, and before the upstream payload is logged. The translator and the redactor are two layers of the same defence.

Audit retention and storage¶

Retention windows depend on class and on regulation.

Class	Typical retention
read	30d – 90d
write-idempotent	1y
write-non-idempotent	7y (often regulatory)
irreversible	10y, sometimes signed

Storage rules:

Append-only. No update; no delete except through the retention process. A mutable audit log is not an audit log.
Tamper-evidence for sensitive classes. Irreversible-tool audits should be cryptographically signed or chained (each record references the hash of the previous), so any retroactive modification is detectable.
Separate from application storage. The audit log is on its own storage with its own access controls. Engineers who can read the audit cannot necessarily write it; engineers who can write the application can never write the audit.
Queryable. Investigation requires fast lookup by tenant, by user, by tool, by time range, by trace_id. Plan for the query patterns at design time, not after.

A reasonable architecture: a per-call event bus writes audits to durable storage (object storage for cold, OLAP store for warm queries). Tools that touch money have their audits replicated to a signed log; everything else uses regular append-only writes. A retention job runs on a schedule, removing records beyond their tool's retention window.

Tracing — connecting audit to the broader system¶

A single tool call is part of a longer story: the user message that triggered it, the agent's planning step that chose it, the workflow step that called it, the downstream system's own log of the operation. The audit record's trace_id and caller fields are how the story is reassembled.

Standard practice: W3C trace context propagation. Every layer (orchestrator, agent runtime, contract layer, downstream system) participates in the same trace. A query for trace_id = trc_X reassembles the full sequence.

The contract layer's audit emits a span. Each precondition is a sub-span. The upstream call is a sub-span. The postconditions are sub-spans. The model platform's call to the tool is the parent span. The orchestrator's step is the grandparent.

Module 03 (agent observability and debugging) covers tracing from the agent side; this chapter establishes the contract layer's contribution. The contract layer must emit traces that the agent-side observability can consume.

Replay from audit¶

A useful diagnostic: given a historical call, replay it. Two flavours.

Read-only replay. Reconstruct the call's request shape from audit and present it for review. The downstream is not contacted. Useful for "what would v3 of this contract have returned, given v2's input?" — chapter 08 dual-run windows lean on this for analysis.

Dry-run replay. Replay the call with dry_run: true against the current downstream. The downstream's current state may differ from the historical state; the replay is a snapshot-against-now, not a time-travel. Useful for verifying that contract or downstream changes preserve historical behaviour for representative inputs.

A platform with replay tooling can answer questions like "if we ship this contract change, how would last week's calls have behaved?" — which is the strongest form of pre-merge confidence.

What to monitor live, and what to query on demand¶

Not every audit field needs a live dashboard. The triage rule:

Signal	Live dashboard	On-demand query
Tool call rate per tenant	✓
Error rate per tool, by code	✓
Latency p50/p95/p99 per tool	✓
`UPSTREAM_UNCLASSIFIED` rate	✓
Postcondition violation rate	✓
Scope denial rate	✓
Idempotency dedup hit rate	✓	(anomalies are interesting)
Cost per tenant per day	✓
Audit record completeness (records missing fields)	✓
Individual call details		✓
Cross-tenant access checks		✓ (compliance)
Per-user activity		✓ (privacy review)

A live dashboard with these signals is the on-call's first stop on any incident. The on-demand queries serve investigation and compliance work.

How observability interacts with the other surfaces¶

Idempotency (chapter 04). Dedup hits, pending states, key collisions are all audited. The dedup store is itself audited.
Error (chapter 05). Every structured error returned is audited with the code and the upstream cause. Translator updates are driven by audit query of unmapped errors.
Scope (chapter 06). Every credential issuance and every scope denial is audited. This is the primary feed for security review.
Validation (chapter 07). Every precondition and postcondition outcome is audited. Violations feed drift investigation.
Versioning (chapter 08). The contract version is on every audit record. The dual-run window's traffic split is read out of the audit.
Drift (chapter 09). Investigations begin and end in the audit log.
Testing (chapter 10). Audit records are the source for recorded-fixture pacts.

The audit log is the substrate; everything else writes to it and reads from it.

How to recognise broken observability in the wild¶

Incident reconstruction requires more than one log source to be cross-referenced
The compliance team has to ask engineering for ad-hoc data pulls every time
The audit log is the same store as application logs, with the same access pattern
Sensitive values are visible in audit records without redaction
Idempotency dedup hits are not recorded
Scope denials are not recorded
Raw upstream payloads are not captured (only the contract's structured view)
Tracing stops at the agent runtime; the contract layer's calls do not appear in the trace
Retention is "indefinite" because nobody designed the policy

Interview Q&A¶

Q1. The team logs every tool call to the same application log as everything else. What is the problem? Several. The application log's access pattern (read by engineers debugging recent issues) is different from the audit's access pattern (read by compliance or security under specific queries). The application log's retention is short; the audit's is long, often regulatory. Application logs are mutable in practice (rotate, delete on incident, retroactive redaction) — audit logs must be append-only. Application logs are noisy; audit needs a clean per-call structured record. Mixing them produces incident reconstructions that miss data and compliance queries that cannot be answered. The fix is a separate audit log with its own storage, access controls, retention, and append-only semantics. Wrong-answer notes: "we have ELK already" misses that the audit's properties are not what application log infrastructure provides.

Q2. Why must redaction happen before write, not at query time? Because a stored sensitive value is a breach waiting to happen. An engineer with read access to the storage can read the raw value. A backup of the storage carries the raw value. A leak of the storage exposes the raw value. Redacting at query time gives the appearance of safety while keeping the underlying data; the moment any of the above fails, the data is exposed. Redacting at write means the storage never sees the value, so there is nothing to leak. Wrong-answer notes: "we control read access" overestimates control; defence in depth means storage does not have the value at all.

Q3. Walk through how you would investigate an incident where the agent appears to have charged a customer twice. Pull the conversation_id or trace_id. Query the audit log for all tool calls on that trace. Look at the issue_refund/charge_card calls — count distinct idempotency keys. If two different keys, the issue is upstream of the contract (workflow re-decided, agent re-called, or key not propagated correctly through retries). If one key, the dedup record should show as "executed" once and "dedup_hit" the rest; if it shows "executed" twice on the same key, the dedup store had a failure. Check the upstream system's own audit for the same payment_id; cross-reference. The reconstruction is fast if the audit is well-designed; impossible if it isn't. Wrong-answer notes: "ask the model what it did" is not investigation.

Q4. The contract layer is supposed to redact email addresses. Where would a bad implementation leak them anyway? Several leak paths to check. (a) The structured error returned to the model may contain the email if the translator pulled it from the upstream message. (b) The raw upstream request_payload may contain the email if redaction was only applied to the contract's arguments field. (c) The downstream system's own audit (if your audit pipeline ingests it) may contain the email. (d) Tracing spans may include the email in a tag. The audit pipeline must redact at every entry point, not just the obvious one. A regular audit of "search audit storage for @ characters in fields that should not have emails" catches leaks. Wrong-answer notes: "we strip emails in the contract layer" without verifying all the paths is exactly the failure pattern.

What to do differently after reading this¶

Stand up the audit log as its own pipeline, separate from application logs. Define its schema. Make it append-only.
Per-call records should include all fields from the example above. Audit the audit: are records complete? Set a "records-missing-fields" metric.
Redact at write, not at read. Audit the redaction.
Build the on-call dashboard with the signals from the live-monitoring table. The on-call should not have to write a query to see drift, denials, or latency regressions.
Implement replay tooling — at minimum, read-only replay; ideally dry-run replay against current state.
Set retention windows per class. Make the retention job a tested, audited process.

Bridge. Eleven chapters have built the discipline surface by surface. The last two synthesise it. The next chapter is the architect's checklist — twenty items, in design / build / launch / operate order, that distinguish a tool you can defend from a tool you cannot. → 12-architect-checklist.md