06. Observability retrofitted¶

Prompts are in the registry. Now you need to see what the system is doing. Most inherited AI systems have minimal observability — application logs, maybe a request log, no model-call audit, no eval dashboards. This chapter is the retrofit: instrumentation that goes in without breaking what works.

A platform engineer at a Hyderabad legal-tech company is asked to investigate why the contract-review agent occasionally takes thirty seconds to respond when it usually takes five. The application log says "request started" and "request completed thirty-two seconds later." There is no record of which model was called, how many tokens were used, whether tool calls happened, or whether the model retried. The engineer has nothing to investigate with. After three days of trying, she concludes that she cannot diagnose latency outliers from the data available, and the retrofit becomes the priority. Two weeks later, with per-call traces in place, the same investigation takes twenty minutes: the latency outliers correlate with a specific model fallback path triggered by a downstream tool's timeout.

The cost of operating without observability is real. The retrofit is straightforward; the discipline is doing it on a running system without disrupting users.

What observability is for, in this context¶

Three concrete uses:

Incident reconstruction. Given a complaint, find the call and reproduce the path.
Drift detection. Notice when the system's distributional behaviour shifts (latency, output length, error rate).
Eval feedback. Sample production calls into the eval set so coverage grows from real traffic.

A retrofit that produces these three is enough. More ambitious goals (full distributed tracing, per-tenant dashboards) come later, after the basics are in place.

What to add, in order¶

Four layers, each cheap, each adds capability.

Layer 1 — Per-call audit record¶

The single highest-leverage addition. Every model call produces a structured record:

{
  "audit_id": "aud_...",
  "ts": "2026-05-25T11:14:02.371Z",
  "duration_ms": 712,
  "model_alias": "smart-summariser",
  "model_used": { "provider": "anthropic", "version": "claude-sonnet-4-6" },
  "tenant_id": "acme-corp",
  "feature": "contract-review",
  "prompt_name": "review_clause",
  "prompt_version": "1.0.0",
  "usage": { "input_tokens": 1200, "output_tokens": 312 },
  "result": { "ok": true, "stop_reason": "end_turn" },
  "trace_id": "trc_..."
}

Wherever the system calls the model, the call goes through a thin wrapper that emits this record. The record lands in append-only storage with a queryable index.

The retrofit: identify every call site (the audit from chapter 02 enumerated them), introduce a wrapper, route call sites through the wrapper. The wrapper emits the audit asynchronously so it does not add latency to the call.

Layer 2 — Trace correlation¶

Adopt W3C trace context: every incoming request gets a trace_id. The trace propagates through the request handler, the model call, any tool calls, and the response. The audit record from layer 1 includes the trace_id.

The retrofit: add trace propagation middleware. Most frameworks have it (OpenTelemetry SDKs, Datadog APM, similar). The retrofit is configuring it, not writing it from scratch.

Once traces work, the incident-reconstruction workflow becomes: "what was trace_id X?" → query traces → see the full path of the request including the model call and its surrounding context.

Layer 3 — Aggregated metrics¶

Per-call audits aggregate into metrics:

Calls per minute by feature
Latency p50/p95/p99 by model and feature
Error rate by code
Token usage by tenant and feature
Cost per call

A small streaming pipeline (or a daily batch) rolls audit records into time-series for dashboards. The metrics drive drift detection — when p95 latency or output-token distribution shifts, the dashboard surfaces it.

Layer 4 — Sample-for-eval¶

A small fraction of calls (1–5%) is captured fully (including the prompt as it was rendered) into a sample store. The sample feeds:

Eval-set expansion (new cases from real traffic)
Production-traffic eval (running the eval rubric against real calls)
Manual review (engineers reading samples to understand the system's behaviour)

Privacy applies: redact user-identifying fields before storage; the sample is for behavioural analysis, not personal data.

Without breaking the running system¶

The retrofit changes the code in every call site. The safety properties:

Wrapper introduces no behaviour change. The wrapper around each model call must call the model identically to before, with the same parameters. The audit emission is additive.

Async emission. The audit record is queued and emitted asynchronously. The model call's latency is unaffected. If the audit pipeline is down, calls continue; the audit recovers on its own.

Feature-flag the rollout. Wrap one call site, flag the wrapper at 1% → 10% → 50% → 100%. Watch the eval (chapter 03) and the latency metrics. Roll back if any regression.

One call site at a time. Even within a feature, instrumenting one wrapper at a time prevents wide-blast regressions.

Run the eval throughout. The eval set is the safety net. If scores drift after instrumentation, something is wrong with the wrapper. Investigate before continuing.

What to capture, and what not to¶

A reasonable capture list:

Model alias and concrete model_used
Timestamps, duration, latency breakdown if available
Token usage (input, output, cache reads/writes)
Result shape (ok/error, stop reason)
Caller identity (tenant, feature, agent identity)
Prompt name and version
Trace and conversation IDs

What to be careful about:

Prompt body. Capturing the rendered prompt is high-value for incident reconstruction and high-risk if the prompt contains user data. Sample-only (layer 4) rather than every-call (layer 1) is the typical compromise.
Response content. Same trade-off. The sample store keeps it; the every-call audit captures only metadata.
User identifiers. Hashed or pseudonymised in the audit; never raw email/phone in audit fields.
Tool call arguments. May contain sensitive data; treat as response content.

The redaction policy is part of the audit pipeline, applied at write. Chapter 11 of module 19 covers the discipline; this chapter is the retrofit application of it.

The first dashboard¶

After layers 1 and 3 are in place, the first dashboard surfaces:

+---------------------------------------------------+
|  Contract-review feature — last 24h               |
+---------------------------------------------------+
|  Calls:                12,847                     |
|  p50 latency:          4.2s                       |
|  p95 latency:          7.1s    (alarm > 10s)      |
|  Error rate:           0.8%                       |
|  Cost / 1k calls:      $3.21                      |
|  Top failure modes:    UPSTREAM_TIMEOUT (62%)     |
|                        PARSING_ERROR (18%)        |
|                        VALIDATION_FAIL (12%)      |
|  Prompt distribution:  review_clause:1.0.0 (98%)  |
|                        review_clause:0.9.0 (2%)   |
+---------------------------------------------------+

Five rows is enough. Each row tells a story; deviations from baseline are interesting. The dashboard is what the on-call reads first when something feels off.

What to do when the team objects¶

Some teams resist observability retrofits on the grounds that "the system is fine, we don't need to add complexity." Three pushbacks.

The retrofit is additive. It does not change the system's behaviour. The eval (chapter 03) verifies this. The cost is engineering time, not production risk.

Incidents will happen; recovery is faster with observability. The thirty-second-latency example from the chapter's opening is real; without instrumentation, three days of investigation; with it, twenty minutes. The retrofit pays for itself the first time the on-call uses it.

The next change requires observability to verify. If you cannot see the system's behaviour, you cannot verify that a change preserved it. The eval is one verification layer; observability is another. Both are needed for the structural changes coming in chapter 07.

Common mistakes¶

Trying to instrument everything at once. A big-bang instrumentation patch is a wide-blast change. One call site, one wrapper, one rollout at a time.

Capturing too much. A capture policy that includes raw user data, raw responses, and full prompt bodies in every audit record produces a privacy and storage problem. Sample for the high-detail; aggregate for everything else.

Adding synchronous audit emission. Sync emission adds latency. Use async pipelines (queue or batch); accept eventual consistency on the audit reads.

Reinventing tracing. Adopt W3C trace context and an existing tracing tool. Build the audit layer on top. Custom tracing is a maintenance burden.

Skipping the dashboard. Audits without dashboards are debug logs. The dashboard is what makes the audit actionable.

What this enables for the next chapters¶

Once layers 1–3 are in place:

The strangler migration (chapter 07) can verify that new components produce the same audit shape as old ones.
The model migration (chapter 08) reads from the audit to find every call site using the retiring model.
The 30-60-90 plan (chapter 11) reports progress using dashboard metrics.
The stakeholder communications (chapter 10) cite the dashboard for "the system is healthy" claims.

Observability is the connective tissue across the rest of the modernisation.

Interview Q&A¶

Q1. The system has minimal observability. What is the first thing you add? The per-call audit record (layer 1). Every model call wrapped to emit a structured record with timestamps, model used, tokens, result, trace_id. Async emission so it does not add latency. This single addition unlocks incident reconstruction, drift detection later, and the eval-sample feed. Trace correlation and dashboards come next; they sit on top of the audit. Wrong-answer notes: "tracing" without per-call audit gives request-path visibility but not behaviour visibility; "metrics dashboards" without raw audit is summary without detail.

Q2. How do you retrofit observability without changing the system's behaviour? The wrapper around each model call must be byte-identical in what it passes to the model and what it returns to the caller. The audit emission is additive and asynchronous. The eval (chapter 03) verifies behavioural equivalence — scores match the pre-instrumentation baseline within tolerance. Feature-flag the rollout, watch the metrics, roll back if any regression. The discipline is the same as any other retrofit: additive, verifiable, reversible. Wrong-answer notes: "we'll add tracing inline" can change latency or error behaviour subtly; the async discipline is load-bearing.

Q3. What's the privacy concern with capturing full prompt and response in every audit record? The prompt likely contains user data — the user's question, the document being processed, the customer's account details. Storing that on every call multiplies the privacy surface enormously. The compromise: every-call audit captures metadata (tokens, model, timing); a small sample (1–5%) captures the full content with redaction, in a separate store with stricter access controls. The metadata supports drift detection and cost attribution; the sample supports eval and review. Wrong-answer notes: "capture everything for completeness" produces a breach waiting to happen.

Q4. The on-call complains they cannot diagnose AI-module incidents from the audit alone. What does that suggest? Either the audit is missing fields (e.g., it does not capture which prompt version was used, or the tool calls inside the request), or the dashboard does not surface the relevant signals. The diagnostic question to walk: "what fact do you wish you had when investigating?" That fact's source is the gap. Add it to the audit or the dashboard. The retrofit is iterative — the first version is enough to start, but it evolves with the on-call's actual needs. Wrong-answer notes: "build a better dashboard" without diagnosing what is missing is shooting in the dark.

What to do differently after reading this¶

Wrap every model call. The wrapper emits a structured audit asynchronously.
Adopt W3C trace context. Use an existing tracing tool; do not roll your own.
Build the first dashboard from the audit aggregations. Five rows are enough to start.
Sample 1–5% of calls fully for the eval feed and for manual review.
Add observability gates to your structural-work CI: changes that drop coverage of any audit field fail review.

Bridge. With observability in place, the system is operable — you can run it, investigate incidents, and verify changes. The structural work now scales up. The next chapter is the largest move of the modernisation: the strangler migration that replaces the system one boundary at a time, with old and new running in parallel. → 07-strangler-migration.md