19. Data privacy and retention — observe enough to debug, not enough to betray users¶

~15 min read. The hardest observability question is often not what to capture, but what to keep out.

Built on the ELI5 in 00-eli5.md. The evidence tag — the useful label on a clue — must be designed carefully so the case file stays helpful without leaking private information.

Traces love detail, privacy hates excess¶

Observability rewards rich context; privacy punishes unnecessary context. That tension is permanent. An LLM trace may include raw prompts, retrieved documents, tool outputs, user identifiers, and billing data — very useful for debugging, very risky to store blindly.

helpful to engineers                    risky to store blindly
┌─────────────────────────┐             ┌─────────────────────────┐
│ prompt version          │             │ raw email address       │
│ model name              │             │ full customer transcript│
│ token counts            │             │ payment details         │
│ retrieval doc ids       │             │ medical notes           │
└─────────────────────────┘             └─────────────────────────┘

Not everything observable should be retained. The case file needs principle, not greed.

Decide what belongs in traces¶

Keep structural facts by default. Trace ID. Span name. Duration. Status. Model version. Prompt version. Token counts. Cost estimate. Tool name. Document IDs. These are usually enough to answer many questions.

Treat raw content separately. Raw prompt text. Raw retrieved chunks. Raw tool payloads. Raw user messages. These may need redaction, encryption, shorter retention, or complete omission. The case board rarely needs raw content. Individual investigators sometimes do. Design for that difference.

Worked example: redacting a support transcript¶

Suppose a support copilot answers a billing question. The original prompt contains: Customer name. Email address. Invoice number. Credit-card last four digits. Problem description.

Blind tracing would store all of it. Bad move. Instead the system records:

user_hash = u_9a81
invoice_id = inv_4412
prompt_version = billing-v5
input_tokens = 1804
retrieved_doc_ids = [policy_17, refund_03]
pii_redacted = true

For deeper debugging, a privileged reviewer may access a short-lived secure payload store. That store expires quickly and is audited. The main case file still remains useful. You can see the path, cost, versions, and retrieved docs. You do not need the raw email in every observability backend. See the tradeoff?

Redaction should happen before export¶

Now what is the common mistake? Teams send raw payloads to the tracing vendor. Later they promise to clean them. This is backwards. Redact before export when possible. Hash identifiers. Mask email addresses. Drop payment fields. Summarize large tool outputs. Then emit the span.

Why so strict? Because once copied into many sinks, data becomes hard to control. Dashboards, traces, logs, support tools, warehouses. One careless raw field spreads everywhere. The evidence tag should travel safely. The dangerous payload should travel rarely.

Retention should match value¶

Not all telemetry deserves the same lifetime. Hot traces for incident response may need 7 to 14 days. Aggregated metrics may need months. Complaint-linked trace summaries may need longer if support and quality teams rely on them. Raw content might need only hours or days. This is retention design.

A simple tiering model works well.

tier 1: metrics summaries      keep 12 months
 tier 2: trace metadata        keep 30 days
  tier 3: raw prompt payloads  keep 3 days, tightly restricted

Retention should follow debugging value and privacy risk, not convenience alone.

Sampling is part of privacy and cost control¶

Tracing everything forever is expensive and risky. Sample smartly. Keep all error traces. Keep all complaint-linked traces. Sample a small percent of healthy traffic. Sample more for new rollouts. Sample less for stable paths. This balances visibility with cost and exposure.

The senior move is conditional sampling. If error=true, keep full metadata. If complaint=true, preserve the case file. If high_cost=true, keep the trace. Else maybe keep 5 percent. Now the case board stays representative without over-collecting.

Access control matters as much as redaction¶

Even safe-ish traces can reveal behavior patterns. So who can open which case file? Support may need complaint-linked summaries. ML may need prompt versions and output formats. Only a small group should access raw content stores. Every access should be auditable.

Observability data is production data in disguise. Treat it with the same seriousness — that is the real maturity marker.

Privacy and retention patterns for trace data¶

Klarna assistant — stores trace metadata broadly but gates raw payment-related tool payloads behind audited access; the role is splitting metadata-retention from content-retention.
Intercom Fin — redacts customer emails and phone numbers before exporting traces; the role is enforcing redaction at instrumentation time, not at export.
Notion AI — keeps workspace-level trace metadata longer than raw prompt snippets used for short-term debugging; the role is tiered retention by debugging value.
GitHub Copilot Enterprise — samples healthy code-completion traces but preserves complaint-linked and error traces aggressively; the role is conditional sampling as a privacy-and-cost lever.
Healthcare LLM assistant teams — hash patient identifiers and limit raw transcript retention to tightly controlled stores; the role is mapping HIPAA requirements onto trace schema.
GDPR Article 17 (right to erasure) on LLM traces — explicit user-deletion workflows on retained traces; the role is forcing every observability schema to support per-user deletion.
HIPAA-compliant LLM logging via AWS Bedrock BAA — Business Associate Agreements covering trace storage; the role is making compliant LLM logging a procurement question, not a coding question.
Microsoft Presidio — open-source PII detection and redaction; the role is the canonical PII-redaction library for trace pipelines.
Lakera Guard PII filter — runtime PII detection on prompts and outputs; the role is making redaction a perimeter defense.
Anthropic's data-retention policy — 30-day default retention with zero-data-retention enterprise tier; the role is exposing how vendor retention couples with customer compliance.
OpenAI zero-data-retention (ZDR) tier — opt-out of training and retention for enterprise customers; the role is making vendor retention an explicit contract knob.
LangSmith PII filters — built-in redaction at ingest; the role is making redaction the default, not an opt-in.
Helicone redaction modes — content-redaction toggle per request; the role is giving developers per-request retention control.
LangFuse self-hosted retention — full control over retention windows on-prem; the role is enabling deployments where vendor-managed retention is forbidden.
GDPR DSAR (Data Subject Access Request) workflows — formal user-data-export pipelines; the role is forcing trace schemas to be queryable by user identity.
OpenTelemetry GenAI privacy attributes — span attributes for redaction status (gen_ai.redacted=true); the role is standardising redaction signaling across vendors.
Snowflake/BigQuery long-term trace warehouse — analytics on aggregated metadata only; the role is enabling long-horizon analysis without holding raw content.
Vault-style secret stores for raw transcripts — encrypted at rest with audited access; the role is separating visible metadata from gated content stores.
AWS KMS-encrypted CloudTrail for Bedrock — audited access to LLM trace data; the role is exposing every read of sensitive trace content.
Datadog Sensitive Data Scanner — automatic PII detection in logs and traces; the role is catching unredacted PII at the platform layer.
Sentry data-scrubbing rules — server-side redaction before storage; the role is preventing accidental PII storage at ingest.
EU AI Act trace requirements — emerging regulatory minimums for high-risk AI logging; the role is making trace design a compliance-driven exercise, not just an engineering one.
Anthropic Constitutional AI red-team logs — adversarial trace handling with hashing; the role is showing how research labs structure even sensitive trace data.

Recall — the privacy-vs-debug tradeoff and the conditional sample¶

Why is the richest possible trace often the wrong trace to keep?
Which kinds of fields are usually safe as default metadata?
In the worked example, how did redaction preserve debugging value?
Why should sampling strategy depend on errors, complaints, and rollout risk?

Interview Q&A¶

Q: Why should privacy controls be designed at instrumentation time and not after telemetry export? A: Once raw sensitive data spreads across sinks, cleanup becomes difficult and access risk multiplies quickly. Common wrong answer to avoid: "Because redaction after export is computationally slower."

Q: Why keep trace metadata longer than raw prompt payloads in many systems? A: Metadata often preserves enough operational value for trend analysis and incident review while carrying much lower privacy risk. Common wrong answer to avoid: "Because raw prompts are never useful for debugging."

Q: Why is conditional sampling better than uniform sampling for LLM observability? A: Error, complaint, and rollout traces contain far more debugging value than routine healthy traffic, so keeping them disproportionately improves signal. Common wrong answer to avoid: "Uniform sampling is always statistically pure, so it is always operationally best."

Q: Why are access controls part of observability design, not just security policy? A: The usefulness and safety of telemetry depend on who can read which fields, so access shapes how instrumentation should be structured. Common wrong answer to avoid: "If data is inside the observability tool, engineers should all see it by default."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the three-tier retention ladder I would build for the refund chatbot:

Field	Keep in trace metadata (90d)	Store briefly in secure payload store (7d)	Do not store
trace_id, span_id, latency, status	✓
model version, prompt version, tool name	✓
token counts, cost estimate	✓
customer tier, intent, language	✓
retrieval doc_ids (not contents)	✓
redacted prompt (PII removed)	✓
raw prompt with PII		✓ (encrypted, audited access)
raw tool output (full customer transcript)		✓
raw payment card number, CVV			✓
raw medical notes			✓
user email, phone	hashed only	full only with audit

Tier 1 is the case file every engineer can read. Tier 2 is the gated content store. Tier 3 never enters the system.

Step 2 — your turn. Make the same three columns. Place ten candidate fields from your AI product into them. For each tier-2 entry, write one line on who is allowed to open it and what triggers an audit log.

Step 3 — reproduce from memory. Draw the three-tier retention ladder. Label which part of the case file is long-lived and which part expires quickly. Write one sentence on why the evidence tag is safer than the raw payload.

What you should remember¶

This chapter explained why the richest possible trace is rarely the right trace to keep. Observability rewards detail; privacy punishes excess; retention sits in the middle and must follow debugging value, not convenience. The discipline is three layers: tier-1 metadata that engineers see freely, tier-2 raw content gated behind audited access, tier-3 fields that never enter the system at all.

You also learned that redaction must happen at instrumentation time, not at export. Once raw PII has spread across sinks, cleanup becomes a multi-quarter program. The hash-or-token pattern preserves debugging value (you can still reconstruct sessions, follow user trajectories, count complaints) while removing the field that would otherwise force the case file into a vault.

Carry this diagnostic forward: when somebody proposes capturing a new field, ask which tier it belongs to before it ships. If the answer is "we'll figure out retention later", the field will end up in tier 1 forever and become a compliance problem the day the regulator asks.

Remember:

Observability data is production data in disguise. Apply production seriousness to it.
Three-tier retention: metadata long-lived, raw content briefly gated, sensitive fields never stored.
Redact at instrumentation, not at export. Spread across sinks is irreversible.
Conditional sampling — keep all errors and complaints, sample healthy traffic — preserves the case file value at a fraction of the storage cost.
Access control matters as much as redaction. The case file is only safe if a small audited group can open the deep folders.

Bridge. We now know how to instrument responsibly without leaking user data. Last, we must be honest about the parts of agent debugging that no trace, no eval, and no postmortem template can fully solve — yet. → 20-honest-admission.md