08. Monitoring and Observability — Traces, Dashboards, Alerting on Quality Drift¶

~11 min read. Production AI systems degrade quietly; the only defence is instrumentation that tells you before users do.

Built on the ELI5 in 00-eli5.md. The inspection — the evaluation suite — runs before deployment. After deployment, monitoring is the inspection running continuously. Silent degradation is the production risk that evals cannot catch alone.

The three failure modes of production AI systems¶

See. AI systems fail in production in three distinct ways. Each requires different instrumentation.

Failure mode 1: Sudden crash. The LLM API returns a 500. The retriever times out. Something throws an exception. This is easy to catch. Your existing error-monitoring catches it. Use Sentry, Datadog, or CloudWatch.

Failure mode 2: Quality drift. The model still responds. No exception is thrown. But answers slowly get worse. Your knowledge base was updated. The embedding model was changed by the provider. User query patterns shifted. The model output format started drifting. No crash. No alert. Users complain in week three.

Failure mode 3: Silent wrong answers. The system responds, the format is valid, and the answer is confidently wrong. The chunk retrieved was stale or irrelevant. The model hallucinated a fact. Users acted on wrong information before anyone noticed.

Monitoring must catch all three. Most teams only instrument Failure Mode 1.

Observability: three pillars for AI systems¶

Classic observability has three pillars: logs, metrics, traces. For AI systems, we extend each.

┌─────────────────────────────────────────────────────────────────┐
│  Logs                                                           │
│  - Full prompt sent to model (redacted of PII)                  │
│  - Full model response                                          │
│  - Retriever results: top-k chunks and their scores             │
│  - LLM judge score if running async eval                        │
├─────────────────────────────────────────────────────────────────┤
│  Metrics                                                        │
│  - Latency: retrieval, embedding, LLM call, total pipeline      │
│  - Cost: tokens in, tokens out, cost per call                   │
│  - Quality: running LLM judge score (7-day rolling average)     │
│  - Retrieval: mean retrieval score, zero-result rate            │
├─────────────────────────────────────────────────────────────────┤
│  Traces                                                         │
│  - Span for each component: retrieve → assemble → generate      │
│  - Parent trace ID links all spans for one user request         │
│  - Error rates and p95 latency per span                         │
└─────────────────────────────────────────────────────────────────┘

Look. Every user request should produce a trace that you can replay. If a user reports a wrong answer, you pull the trace and see exactly what was retrieved, what was assembled, and what the model produced.

The quality dashboard: what to display¶

A minimal quality dashboard for an AI system has exactly six panels.

Panel 1:  Pipeline latency p50 / p95 / p99 (line chart, 7-day)
Panel 2:  Cost per request, total daily cost (line chart, 7-day)
Panel 3:  LLM judge quality score (rolling 7-day average)
Panel 4:  Retrieval zero-result rate (% of queries with no good chunks)
Panel 5:  Format error rate (% of responses that failed schema validation)
Panel 6:  Error rate (% of requests that threw an exception)

Simple, no? Six panels. No vanity metrics. Each panel has a threshold line showing the acceptable range. Any panel crossing a threshold fires an alert.

Worked example: detecting quality drift¶

It is week four of production. Your system has served 4 200 requests. You look at Panel 3: LLM judge quality score.

Week 1: 0.88
Week 2: 0.87  ←  minor drop, within noise
Week 3: 0.81  ←  notable drop, investigate
Week 4: 0.74  ←  alert fires (threshold: 0.80)

Threshold breach on week 4. You investigate.

Step 1: Check Panel 4 (retrieval zero-result rate). It jumped from 3% to 14% in week 3.

Step 2: Pull traces from the quality-dropping queries. Many queries are about a new product line launched in week 3. The KB has no articles about the new product line.

Step 3: Root cause: knowledge base lag. New product launched. KB articles written but not yet ingested. Ingestion pipeline was batch: runs nightly. Articles were in draft for three days.

Fix: add a trigger to ingest on KB article publication, not just on schedule. Outcome: retrieval zero-result rate drops to 4%. Quality score recovers to 0.86 by week 5.

See. This failure was not a crash. It was a data pipeline timing issue (broken the plumbing). Only monitoring caught it. Users noticed in week 3. Monitoring could have fired in week 3.

Alerting strategy: alert on symptoms, not components¶

Many teams alert on infrastructure (CPU, memory, disk). For AI systems, alert on quality symptoms.

Alert 1: LLM judge quality score 7-day average < 0.80  →  Page on-call
Alert 2: Pipeline p95 latency > 1 200 ms (150% of SLA)  →  Page on-call
Alert 3: Cost per request > $0.004 (2× budget)          →  Slack warning
Alert 4: Retrieval zero-result rate > 10%               →  Slack warning
Alert 5: Format error rate > 5%                         →  Slack warning
Alert 6: Exception rate > 2%                            →  Page on-call

Alerts 1, 2, 6: wake someone up. These are service-level failures. Alerts 3, 4, 5: notify the team. These are quality risks that need investigation, not emergencies.

Sampling strategy for large-scale logging¶

At high volume, logging every request is expensive. Use stratified sampling.

All requests:           log latency and cost metrics (100%)
1% of requests:         log full prompt + response + judge score
Requests with errors:   log everything (100%)
Requests with judge < 0.70: log everything (100%)

This gives you: - Complete coverage of the performance metrics. - Representative sample of the quality distribution. - Full forensics on every failure and near-failure.

Where this lives in the wild¶

LangSmith (LangChain) — trace every LLM call in a chain; replay traces for debugging failed queries.
Weights & Biases (W&B) — tracks quality metrics over time; alerts on model quality regression after updates.
Datadog LLM Observability — spans per component, cost tracking, quality score over time.
Arize AI — production ML monitoring with LLM-specific quality drift detection.
Honeycomb — distributed tracing applied to LLM pipelines; enables "what happened to this specific request?" queries.

Pause and recall¶

Name the three failure modes of production AI systems. Which is the hardest to detect?
What are the six panels of a minimal quality dashboard?
In the drift example, what was the root cause of the week-3 quality drop?
What is stratified sampling in the context of logging?

Interview Q&A¶

Q: "How do you monitor an LLM application after deployment?"

A: I monitor at three levels. Infrastructure: latency, error rate, cost. Quality: rolling LLM judge score, retrieval zero-result rate, format error rate. Business: user task success rate if measurable. I set alert thresholds on all quality metrics and page on critical ones.

Common wrong answer to avoid: "I watch the API response codes." API success codes do not tell you anything about answer quality. A 200 response can contain a wrong answer.

Q: "What is quality drift and how do you detect it early?"

A: Quality drift is gradual degradation of output quality without any crashes or errors. Detection: run an LLM judge on a sample of live traffic and track the score as a 7-day rolling average. A downward trend before crossing the alert threshold gives you early warning.

Common wrong answer to avoid: "I rely on user complaints." User complaints lag the actual degradation by days or weeks. By the time complaints arrive, many users have already been harmed.

Q: "How do you debug a production AI system giving wrong answers?"

A: I pull the trace for the failing request. I inspect each span: what did the retriever return, what was assembled into the context, what did the model produce. I check whether the retrieved chunks were actually relevant to the query. Most wrong answers trace to a retrieval failure, not a model failure.

Common wrong answer to avoid: "I re-run the query in a Jupyter notebook." The notebook environment is not the same as production; it does not reproduce retrieval state.

Q: "How do you avoid logging sensitive user data while still enabling debugging?"

A: I log with redaction. I apply PII detection before storing any logged text, replacing sensitive fields (names, emails, account numbers) with tokens. I log structured metadata (query category, retrieval score, latency) at 100%, and log redacted full text at 1% sample rate.

Common wrong answer to avoid: "I don't log anything to be safe." No logging means no debugging capability. The risk is not in logging — it is in logging unredacted PII.

Apply now (5 min)¶

Design the monitoring plan for your capstone. List six metrics you will track. Set a threshold for each. Decide: which failures page someone, and which send a Slack notification? Design your logging sampling strategy.

Sketch from memory: Draw the three-pillar observability table (Logs, Metrics, Traces). Fill in two items per pillar from memory.

Bridge. Monitoring is running. But monitoring tells you when something is wrong. The blueprint told you the cost SLA. Now we engineer the system to meet it. → 09-cost-latency-management.md