Skip to content

12. Alerting & dashboards — turning ten thousand traces into one glance and one page

~18 min read. A trace tells you what happened to one request. A dashboard tells you what happened to ten thousand. An alert tells you which of those ten thousand should wake somebody up.

Builds on the ELI5 in 00-eli5.md. The kitchen log keeps every dish's history. Dashboards hang the gauges where the kitchen manager can see them. Alerts are the smoke detector that screams when the oil hits the pan.


What chapter 11 gave us and what it cannot answer alone

Chapter 11 instrumented the refund chatbot so every request leaves a trace — span IDs, retrieval hops, tool calls, latencies, costs, judge verdicts, all linked back to one request ID. Per-request visibility is now solved. You can pull any conversation by ID and reconstruct exactly what the model saw and what it did.

Per-request visibility is necessary and entirely insufficient. At a thousand chats an hour, no human will open traces one at a time to find the cluster that is suddenly burning money or returning policy-incorrect refunds. The pressure that remains is aggregation — we need to see ten thousand requests as one shape and have software wake the right person when that shape deforms in a way humans must repair. Chapter 1 made the rule: a quality claim covers only the population it sampled. This chapter turns that rule into rolling, automated samples that page on collapse. Chapter 10 gave us A/B comparisons across versions; here we monitor the currently shipped version against itself over time.

What this file solves

A refund chatbot can stay healthy by every spot check on Tuesday and silently rot on Wednesday — a model swap upstream, a slow retrieval index, a policy-doc change nobody mentioned, a flag flip in feature config. This file turns the kitchen log into two specific dashboards and an alert hierarchy. You will leave with the five-signal default, an SLO sentence that names the workload, a paging policy that distinguishes a 3am wake-up from a Monday ticket, and one inspectable Prometheus alert rule for rolling pass-rate that you can paste into a real cluster.

Why a dashboard has to exist before an alert does

The temptation is to skip dashboards and go straight to alerts. We will just page when something is wrong. It feels lean. It is wrong, because "something is wrong" is an inference that requires a baseline, and a baseline lives on a dashboard. Without one, every alert threshold is a guess, and every page is either fatigue (too tight) or silence (too loose).

A dashboard is the place where a human builds the model of "normal" that an alert later automates. Pass rate dipping from 84% to 81% on Tuesday afternoon is meaningless until you have seen what Tuesday afternoons usually look like. P95 latency at 3.2 seconds is meaningless until you have seen the diurnal shape. The dashboard is the curve. The alert is the threshold drawn on the curve. Build the curve first.

The naive repair, the visible break, the diagnosis

The first instinct after an incident is "add an alert for this exact thing." Six months later the team has 140 alerts and acknowledges none of them. The on-call engineer auto-mutes the channel. The next real incident arrives in a wall of green-yellow noise and nobody notices for forty minutes.

The second instinct is "build one big dashboard with everything on it." The big dashboard gets bookmarked, looked at once, never opened again, and quietly diverges from reality as panels break and nobody fixes them. Six months later half the queries return no data.

Not an alert-count problem. Not a dashboard-completeness problem. An attention-budget problem. Operators have a finite amount of attention per shift, and every panel and every page spends a slice of it. So how do we design a small number of gauges that earn their attention, and an even smaller number of alerts that earn an interruption?

The five-signal default before anything bespoke

Before you invent custom panels, the refund chatbot — and almost every production LLM app — needs the same five signals on the first screen. They cover the four kinds of pain users feel and the one cost finance feels.

┌────────────────────────────────────────────────────────────────┐
│ refund-chatbot  ·  prod  ·  last 24h  ·  refresh 30s           │
├────────────────────────────────────────────────────────────────┤
│ p95 latency        │  error rate       │  tool success         │
│ 2.1s ▁▂▂▃▂▂▂       │  0.4% ▁▁▁▁▂▁▁     │  97.8% ▇▇▇▇▆▇▇        │
│ slo: <3.5s         │  slo: <1%         │  slo: >95%            │
├────────────────────┴───────────────────┴───────────────────────┤
│ cost / request     │  rolling quality (judge pass-rate, 7d)    │
│ $0.018 ▁▂▂▂▂▂▃     │  84% ▆▇▇▇▇▆▆ ◀── enterprise slice: 71% ⚠ │
│ slo: <$0.03        │  slo: ≥80% all, ≥75% enterprise           │
└────────────────────────────────────────────────────────────────┘

These are the gauges, not the dashboard. Latency catches infrastructure pain. Error rate catches stack pain — timeouts, 5xx, schema validation rejects. Tool success catches the integration layer — refund API, CRM lookup, policy retrieval. Cost catches finance pain and also the silent shape of "the model is suddenly retrying three times per request." Rolling quality is the one signal that no traditional APM tool gives you for free: it is the kitchen log scored by a judge or by sampled human labelling, summarised as a rolling pass rate over a window.

The fifth signal is the chapter's whole point. The first four can all stay green while quality collapses. That is the shape of every AI-specific outage that surprises a team using only conventional SRE tooling.

Teacher voice. The first four signals are what your platform team has been monitoring for a decade. The fifth signal is why your team exists. If your dashboard is missing it, you are running an AI product on infrastructure metrics.

SLOs in one sentence each

Every signal earns its place by mapping to a sentence a stranger could grade against. The refund chatbot's four:

  • Latency. 99% of /refund chats answer in under 3.5 seconds end-to-end, measured at the API edge, over a rolling 30-day window.
  • Error rate. Fewer than 1% of /refund chats end in a non-2xx response or a schema-rejected tool call, over a rolling 7-day window.
  • Tool success. At least 95% of refund-API calls return success on first attempt, over a rolling 24-hour window.
  • Quality. At least 80% of judged samples are policy-correct overall, and at least 75% on the enterprise slice, over a rolling 7-day window.

The grammar is consistent: what percentage of which population in what time window? If a stakeholder cannot tell you what window your number covers, the number is decoration. Notice the windows differ. Latency and quality move slowly — long windows. Tool success can collapse in minutes — short window. Picking the window is half the SLO design.

Mini-FAQ. "Why not one window for everything?" Because failure shapes have different characteristic timescales. A retrieval-index outage shows up in tool success in five minutes. A model-swap drift in quality takes two days to be statistically visible. One window either flaps on the fast signal or buries the slow one.

The three-level alert hierarchy

Not every breach is a wake-up. Three explicit levels keep the on-call rotation alive.

   trigger condition                level        what happens
   ─────────────────────            ──────       ─────────────────────────────
   tool success < 90% for 10m       PAGE         PagerDuty → on-call phone
   p95 > 5.0s for 15m               PAGE         PagerDuty → on-call phone
   rolling 7d pass-rate < 75%       TICKET       Jira ticket, next business day
   cost/request > $0.04 for 1h      TICKET       Slack #refund-bot-ops
   enterprise pass-rate < 78%       DASHBOARD    visible, no notification
   any slice drops >3pp wk-over-wk  DASHBOARD    visible, weekly review

A page interrupts a human's sleep. It must imply an action that cannot wait until morning — money is being lost, customers are being harmed, an SLO is burning through its error budget at a rate that will exhaust it in hours. A ticket is for things that are wrong but tolerable until tomorrow's standup. A dashboard signal is for trends that inform priorities without demanding immediate action.

The implicit rule: every page must have a runbook. If a page fires and the responder does not know what to do, the page should not have existed. Either write the runbook or demote the alert. This is the single most effective discipline against alert fatigue, and the one teams skip first.

Teacher voice. "Not more alerts, but stronger ones." The page list should be small enough that every engineer on rotation has memorised it. If your page list is longer than your home address, it is a notification list, not an alert list.

A concrete rolling pass-rate alert — the chapter's signature artifact

The fifth signal is the one most teams cannot express in their monitoring tool because it depends on the kitchen log scored by a judge, not on infrastructure counters. Here is the rule, written for Prometheus with metrics emitted by a LangSmith-shaped judge sidecar.

# /etc/prometheus/rules/refund_chatbot_quality.yml
groups:
  - name: refund_chatbot_quality
    interval: 5m
    rules:
      # Numerator and denominator come from the judge sidecar
      # llm_judge_verdict_total{verdict="pass"|"fail", slice="..."}
      - record: refund:judge_pass_rate_7d
        expr: |
          sum by (slice) (
            rate(llm_judge_verdict_total{
              app="refund-chatbot",
              verdict="pass"
            }[7d])
          )
          /
          sum by (slice) (
            rate(llm_judge_verdict_total{
              app="refund-chatbot"
            }[7d])
          )

      - alert: RefundChatbotQualityBelowSLO
        expr: refund:judge_pass_rate_7d{slice="all"} < 0.80
        for: 30m
        labels:
          severity: ticket
          team: refund-bot
        annotations:
          summary: "Refund-bot 7d pass-rate {{ $value | humanizePercentage }} < 80% SLO"
          runbook: "https://wiki/runbooks/refund-bot-quality"
          dashboard: "https://grafana/d/refund-bot/overview"
          description: |
            Judge pass-rate on the all-traffic slice has been below the 80% SLO
            for 30 minutes. This is a ticket, not a page — quality drifts on
            day-scales. Open the linked dashboard, inspect the slice panel
            to find which slice collapsed, then pull 20 failing trace IDs
            from the linked LangSmith view and triage by failure shape.

      - alert: RefundChatbotEnterpriseQualityCollapse
        expr: refund:judge_pass_rate_7d{slice="enterprise"} < 0.65
        for: 15m
        labels:
          severity: page
          team: refund-bot
        annotations:
          summary: "Enterprise pass-rate {{ $value | humanizePercentage }}  page"
          runbook: "https://wiki/runbooks/refund-bot-enterprise-collapse"
          description: |
            Enterprise slice has dropped well below the 75% SLO floor. This
            slice carries revenue and contractual penalties. Treat as P1.
            First action: roll back to last known-good prompt+model pair
            (see runbook step 1). Do not debug in production traffic.

Three details worth pausing on. The recording rule does the expensive rolling-window math once and caches it; the alert rule evaluates a cheap threshold against the cached value. The for: 30m clause prevents flapping on the noisy edges of a 7-day window. The two alerts on the same metric have different severities because the same number means different things on different slices — the enterprise slice collapsing is a paging event even when the overall number still looks fine. This is the slice-aware paging chapter 1 promised.

For the refund chatbot, this is the alert that catches the failure no infrastructure dashboard can see — a prompt-template change that silently makes the bot more confident and more wrong on edge cases. Latency unchanged. Errors unchanged. Tool success unchanged. Pass rate slides from 84% to 71% over 36 hours. Without this rule, the team finds out from a customer escalation. With it, the ticket fires at 79.9% and someone is already looking before 78%.

The two dashboards every team needs

A common confusion is to build one giant dashboard for both launches and on-call. They want opposite things. Launch wants width — every slice, every cohort, every comparison against baseline. On-call wants depth on a tiny number of signals with one-click drill-down.

   LAUNCH-DAY DASHBOARD                   ONCALL DASHBOARD
   ─────────────────────                   ─────────────────────
   purpose: is the new version OK?         purpose: is the system OK NOW?
   audience: PM, eng lead, ML, ops         audience: on-call engineer at 3am
   refresh: 1 min                          refresh: 30s, autoplay
   layout: 20+ panels, slice tables        layout: 5 big signals top, drill-downs below
   time window: last 4h vs baseline        time window: last 1h, last 24h
   shows: per-slice pass-rate deltas       shows: SLO burn, active alerts, recent errors
   key affordance: A/B comparison toggle   key affordance: one click → failing traces
   active for: 24h after a release         active for: every shift, always

The on-call dashboard is the one the alert annotation links to. Five big numbers, sparklines, and one click to the failing traces. The launch-day dashboard exists to answer "did the v2.3 prompt rollout regress anything?" and gets retired after release stabilises. They share data sources, not layout.

The single-dashboard pathology is what kills teams. The launch dashboard's 20 panels become the on-call view. At 3am, the on-call engineer scans 20 panels looking for the one that matters and misses it for ten minutes. Separation is not bureaucracy; it is reaction time.

Static thresholds, anomaly detection, and SLO burn-rate alerts

Three ways to decide when a number has gone bad. They are not interchangeable.

Static thresholds. "Alert when p95 > 5s for 15 min." Cheapest to set up, easiest to reason about, easiest to explain in an incident review. Wrong when the metric has strong diurnal or weekly seasonality — Sunday midnight is not Monday noon. Right when the SLO is a hard contract ("3.5s ceiling") and seasonality is mild.

Anomaly detection. "Alert when latency is more than 3σ from a moving baseline." Adapts to seasonality automatically. Catches change, which is often what you actually care about. Wrong when the baseline itself has been drifting bad — anomaly detection on a slowly rotting metric will never fire because the rot is "normal." Also fires on benign seasonality you forgot to teach the model. Right for cost and throughput where absolute thresholds make no sense across customer growth.

SLO burn-rate alerts. "Alert when the error budget for the next 30 days will be exhausted in under 6 hours at the current rate." The Google SRE pattern. It collapses the two questions ("how bad" and "how fast") into one. A small breach that lasts a long time and a big breach that lasts five minutes can both fire, because both threaten the budget. Right for hard-SLO services. Wrong when you do not actually have an error budget you are willing to enforce — burn-rate alerts on a fictional SLO produce real fatigue.

For the refund chatbot, the right mix is static thresholds on tool success and error rate (they have hard contracts and almost no seasonality), burn-rate on the latency SLO (long-window, want to catch sustained mild breaches and short hard breaches with one rule), and a hybrid for quality — a static floor at 75% paging, plus a rate-of-change ticket if the 7-day rolling number drops more than 4 percentage points week-over-week.

A cost table the on-call engineer feels

The cost of a loud pager is not theoretical. Here is the working math an engineering manager should be able to recite.

Choice Pages / week / engineer After-hours fraction Annual cost / engineer What it buys
Page on every threshold breach 12-20 60% \(20K–\)30K (lost sleep, attrition, mistake risk) catches everything, signals nothing
Page only on customer-facing burn-rate 1-3 40% \(5K–\)8K catches most real incidents, on-call sustainable
Page only on hard P1 (revenue, contract, safety) <1 30% \(1K–\)3K catches few real incidents, may miss some
No paging, daytime tickets only 0 0% $0 direct, \(50K–\)200K from undetected outages unsuitable for revenue-bearing systems

The numbers are rough — adjusted for senior-engineer time, sleep-loss productivity decay (well documented at 20-40% next-day output drop after a 3am page), and attrition probability. The middle row is where almost every mature SRE team lands. The top row is where every team starts.

The refund chatbot, at $2M/year revenue contribution, can absorb the middle-row policy. A 50-engineer org rotating on the top-row policy is burning roughly $1M/year in pager-loaded engineering time and producing worse incident response. That is a real budget that nobody puts on a spreadsheet.

Operational signals — healthy, first-degrading, misleading, expert

Healthy looks like this: pass rate steady, slice rates within 5pp of overall, alert acknowledgement time under 5 minutes during business hours, no auto-mute on any channel, runbook revisions every 4-6 weeks driven by incident postmortems.

The first metric to degrade is almost never the one you watch. It is alert acknowledgement time. When responders start ignoring alerts for 20 minutes, the issue is not the alert config — it is that the team has lost trust in the signals. The fix is to demote noisy alerts immediately, not to escalate the ones that are being ignored.

The misleading beginner metric is alert count. "We have 200 alerts configured" sounds rigorous. It is the symptom of fatigue, not the cure for it. The same shape appears in eval-rubric design from chapter 7: more dimensions ≠ better measurement.

The graph an experienced engineer opens first during an incident is not any of the five top signals. It is the slice panel on rolling quality: which slice broke, when. If enterprise dropped at 14:32 and overall dropped at 14:34, the enterprise change drove the overall — start there. If everything moved together, suspect an infrastructure-layer change. Slice-correlation timing is the highest-leverage debugging signal in the chapter.

Where one dashboard is enough, and where you must separate

One unified dashboard is honestly sufficient when three conditions hold: traffic is below roughly 1,000 requests per day so on-call is rare, there is no release process distinct from "the engineer who wrote the code pushed it," and quality variation week-over-week is below the natural noise floor of your eval set. Internal tools and very early-stage products often qualify.

Two dashboards become mandatory when any one of: revenue or contractual SLOs exist, releases happen on a cadence distinct from individual commits (canary, blue-green, weekly trains), or the team is large enough that the launch reviewer and the on-call engineer are different people. The refund chatbot crosses all three thresholds. Most production AI systems past their first quarter do.

The pathological case is the team between these two regimes that builds one dashboard "for now" and never separates. Three months in, the launch dashboard has accreted 40 panels, the on-call view is unusable, and incidents take 3-4x longer to triage than they should. The fix is rarely incremental — it is a half-day to start a fresh on-call dashboard with five panels and a hard rule that nothing else goes on it.

Wrong mental model — "more alerts means better coverage"

The seductive belief is that an alert for every failure mode adds up to comprehensive protection. It does not. Alerts are a shared attention budget. Past a small number, every new alert lowers the signal-to-noise ratio of every existing alert. Coverage and noise are not independent axes; they trade against each other on the same resource — the on-call engineer's belief that an interruption means something.

Replace the wrong model with: alerts compete for attention; only alerts that pay rent should stay. Rent is paid when the page leads to an action that would not have happened otherwise. Every quarter, audit every alert: how many times did it fire, how many of those were actioned, how many caught a real incident. Alerts with low action rates get demoted to tickets. Tickets with low triage rates get demoted to dashboard panels. Dashboard panels with no views in 90 days get deleted.

The same anti-pattern appears in module 11's "log everything" temptation and in module 7's "every dimension on the rubric" temptation. Three different layers, one shared invariant: operator attention is the scarcest resource in any operational system, and abundance of signal degrades response, not improves it.

Mini-FAQ. "What about silent failures — won't we miss them without an alert?" Silent failures are caught by the dashboard, not the alert. The trend is visible on a daily review even if no threshold ever fires. Alerts are for things that need response in minutes. Trends are for things that need response in days. Confusing these is what causes the alert sprawl.

Six failure shapes the chapter protects against

  • Alert fatigue. 140 alerts, 5 ever actioned, on-call channel muted. The system has no alerts in practice.
  • Aggregate hides slice. Overall pass-rate 82%, enterprise 65%, no slice-level alert. Customer escalation arrives before the dashboard does.
  • Threshold drift. Threshold set a year ago at 95th percentile of "normal." Normal has degraded 8pp since. Alert never fires.
  • Page-on-noise. Threshold tuned to fire on a 2-sigma event. P95 latency naturally crosses 2-sigma several times a week from benign causes. Page becomes wallpaper.
  • No runbook. Page fires at 3am. Responder spends 45 minutes figuring out what the alert means before any action. Time to mitigation balloons.
  • Single-pane fantasy. One dashboard for everything. On-call uses it at 3am, scans 30 panels, misses the one that matters, finds it at 14 minutes instead of 2.

Each disappears with explicit hierarchy, slice-aware alerts, regular threshold review, runbooks-mandatory-for-pages discipline, and two-dashboard separation.

Cross-topic references

  • Same pressure, different layer. Chapter 11's tracing pressure (attention budget per request) is this chapter's pressure (attention budget per shift) one layer up — both are operator-attention scarcity expressed differently.
  • Recurring invariant. Chapter 1's "a quality claim covers only the sample that generated it" directly powers the rolling-window judge alert: the 7-day pass-rate is exactly such a sample, automated.
  • Failure geometry echo. The aggregate-hides-slice failure first named in chapter 1 reappears here as the slice-aware paging rule — same shape, now at production-monitoring scale.
  • Forward dependency. Chapter 13 will use these dashboards as the inner loop of development; without the panels and alerts here, the feedback loop there has no signal source.

A fast self-test for an on-call dashboard

  • Can a stranger glance at the first screen and tell whether users are hurting right now?
  • Does every alert have a linked runbook and a linked dashboard view?
  • Has every page in the last quarter resulted in an action, or are some pure noise?
  • Is the quality signal on the dashboard at the same prominence as latency?
  • Can you click from any panel to a sample of the underlying traces in one hop?

Five yeses means the dashboard is doing operational work. One or more nos means the next 3am page will go badly.

Where this lives in the wild

  • Grafana + Prometheus — the most common stack for the five-signal dashboard and YAML alert rules; metric-shaped data, Loki for log drill-down, Tempo for trace drill-down.
  • Datadog APM + LLM Observability — unified panels for infra metrics plus LLM-specific signals like prompt drift and judge pass-rate.
  • Honeycomb — high-cardinality slicing for trace-level drill-down; built for "click from aggregate to the failing 12 events" workflows.
  • PagerDuty — the paging policy layer; routing, escalation, on-call rotations, and the place runbooks should be linked from.
  • Opsgenie / VictorOps / incident.io — direct PagerDuty competitors; the paging layer is commoditised, the runbook discipline is not.
  • LangSmith dashboards — judged-quality rolling metrics with one-click trace drill-down; the source-of-truth for the fifth signal in many production LLM stacks.
  • Arize Phoenix — open-source tracing plus eval dashboards; combines drift detection (chapter 9) with monitoring.
  • LangFuse — open-source LLM observability; cost-per-request and judge-pass-rate panels out of the box.
  • Helicone — proxy-level monitoring; cost-per-request and provider error rates land here without app instrumentation.
  • AWS CloudWatch + CloudWatch Alarms — static thresholds on metrics, anomaly detection in beta; the lowest-friction option inside AWS.
  • Azure Monitor + Application Insights — same pattern, plus Smart Detection for some anomaly cases.
  • GCP Cloud Monitoring + SLO objects — first-class SLO and burn-rate alert primitives, modelled directly after the Google SRE book.
  • Anthropic internal release dashboards — model-card-style signal sets gating production model swaps; the discipline that lets a 200B-param swap ship on a Tuesday.
  • OpenAI status page — the public surface of an internal multi-signal dashboard; outages declared from threshold breaches, not from Twitter.
  • Stripe Sigma + dashboards — the canonical example of revenue-aware paging: payment success rate alerts wake people up, dashboard wallpaper does not.
  • Cloudflare Radar / status — burn-rate alerts on edge SLOs; public dashboards as a trust signal to enterprise customers.
  • Datadog SLO product — managed error budgets with burn-rate alert templates; the productised version of the Google pattern.
  • Sentry — error-rate spike detection with release-correlation; pairs naturally with the error-rate signal in the default five.
  • Better Stack / Uptime Kuma — synthetic-probe monitoring for the public endpoint; a different layer than judged-quality, both needed.
  • Slack #ops channels with throttled webhooks — the social layer of alerting; channel hygiene is a real reliability practice.

Recall — can you reconstruct the chapter cold?

  1. Name the five default signals every production LLM app should put on its first dashboard.
  2. Write the grammar of a well-formed SLO sentence.
  3. What distinguishes a page from a ticket from a dashboard-only signal?
  4. Why does the rolling pass-rate alert use for: 30m instead of firing immediately?
  5. Under what three conditions can a single dashboard serve both launch and on-call?
  6. State the rule about runbooks and pages.
  7. What is the first metric to degrade when alert quality is rotting, and why isn't it any of the alerts themselves?

Interview Q&A

Q1. Your team has 80 alerts configured and the on-call channel was muted last week. What is your first move?

A. Not adding tuning, not building a new dashboard — deleting alerts. Run a 90-day audit: every alert that has fired more than 3 times without leading to an action gets demoted to a ticket. Every ticket without triage gets demoted to a dashboard panel. The page list should fit on a sticky note when you are done. Then enforce the runbook rule: any surviving page without a linked runbook either gets one or gets demoted. Only after the noise floor is real do you tune thresholds. Common wrong answer to avoid: "Tighten the thresholds so they only fire on real incidents." Thresholds are downstream of the deletion. You cannot tune your way out of fatigue.

Q2. You want to alert on a judged quality drop. Should you page on it?

A. Almost never directly. Judged quality moves on day-scales; a real drop takes hours of judge samples to be statistically visible. Page-grade signals must reflect minute-scale customer harm. Use quality drops as tickets (next-business-day action) with a single page exception for a high-stakes slice — for the refund bot, enterprise slice collapse below a hard floor. Pair the ticket with a for: clause long enough that noise on the rolling window does not flap. Common wrong answer to avoid: "Any quality drop is a P1." Quality drops are real but slow; treating them as pages destroys the pager.

Q3. Static thresholds vs anomaly detection vs burn-rate — pick one for p95 latency on a customer-facing API.

A. Burn-rate. Latency has a hard customer-facing contract (the SLO sentence) and you care about both "big spike, briefly" and "small mild breach, sustained" — burn-rate collapses both into one rule against the error budget. Static thresholds force you to pick one shape and miss the other. Anomaly detection ignores the absolute contract, which is exactly what the SLO is encoding. Common wrong answer to avoid: "Anomaly detection because it adapts to seasonality." It also adapts to slow rot, which is the failure mode you most need to catch.

Q4. The on-call dashboard and the launch dashboard — why not unify them and save effort?

A. They optimise for opposite goals. Launch wants breadth: 20+ panels, slice tables, baseline-vs-candidate toggles. On-call wants speed: 5 panels, big numbers, one-click drill-down. A unified dashboard makes on-call slower at 3am, which directly costs incident response time. The maintenance saving is illusory because the unified dashboard ends up doing neither job well and gets bookmarked-then-ignored. Common wrong answer to avoid: "One dashboard is simpler to maintain." It is also worse at both jobs; simplicity in the wrong dimension.

Q5. Aggregate pass rate is 82%, your SLO is 80%. Ship?

A. Same question chapter 1 forced — what is the slice table? Pull the slice panel; if enterprise or any contract-bearing slice is below its slice-SLO, do not ship. The aggregate-hides-slice failure is exactly the shape this chapter protects against. The dashboard must show slice-level rolling pass-rate next to the aggregate, and the alert must be slice-aware. Common wrong answer to avoid: "Aggregate is above SLO, ship." Aggregates flatten the very shape that matters for revenue-bearing slices.

Q6. Cumulative — your trace from chapter 11 shows a slow retrieval span, your dashboard from this chapter shows tool-success steady at 97%, your judge from chapter 6 shows pass rate falling. Where do you look?

A. The judge is reporting the symptom the user feels; the tool-success counter is too coarse to catch a slow-but-not-failing retrieval. Open the slice-by-intent panel on quality to find which intent is collapsing, then pull a sample of failing trace IDs from the linked traces and check whether the retrieval span timing changed even when the call still succeeded. Tool success is binary — succeeded or not. Quality captures "succeeded but returned stale or sparse context." This is a layer-mismatch debugging path: the dashboard layer says fine, the trace+judge layer says broken, and the trace is the truth. Common wrong answer to avoid: "Tool success is green, so it isn't a retrieval problem." Tool success and retrieval quality are different layers; success counters miss soft failures.

Q7. A new page fires every Sunday at 02:14. Investigation shows it is a backup job spiking latency briefly. What do you do?

A. Demote, not tune. The backup is benign; the page is wallpaper that trains the on-call to ignore Sunday-night pages. Either move the threshold above the known benign spike, exclude the maintenance window from the alert, or remove the alert entirely if the failure mode it was meant to catch is already covered by burn-rate. The general rule: any recurring page with a known benign cause should not be a page. Common wrong answer to avoid: "Tighten the threshold so it only fires when the spike is even larger." You are now training the alert to ignore the very seasonality you should encode.

Q8. You inherit a system with no quality alerting, only infra alerts. The team says infra has been clean for months. Reaction?

A. Clean infra is not evidence of healthy quality — it is evidence of nothing about quality. The first move is to wire the kitchen log through a judge, emit a llm_judge_verdict_total counter, and put a rolling pass-rate panel on the dashboard before configuring any new alerts. Once a baseline exists for a week or two, tune slice-aware tickets, then a paging rule on the highest-stakes slice. The order matters: dashboard before alert before page. Common wrong answer to avoid: "Add a page for any quality drop." You have no baseline; thresholds set without a baseline produce fatigue or silence on day one.

Apply now (10 min)

Step 1 — model the exercise. Here is the refund chatbot's complete first-screen on-call dashboard plus paging policy you should be able to defend in a review:

PANELS (5 boxes, top of screen):
  p95 latency 24h  │  error rate 1h  │  tool success 1h  │  cost/req 24h  │  quality 7d (sliced)

PAGES (3 total, all with runbooks):
  1. tool_success < 90% for 10m
  2. p95_latency burns 6h of 30d budget in <1h
  3. enterprise_pass_rate_7d < 0.65 for 15m

TICKETS (4 total):
  1. quality_pass_rate_7d < 0.80 for 30m
  2. cost_per_request > $0.04 for 1h
  3. error_rate_1h > 1% for 30m
  4. any slice drops >4pp week-over-week

DASHBOARD-ONLY (4 trends):
  - per-slice pass-rate, retrieval-hit rate, token-volume drift, p50 latency

Notice how few pages there are. Three. Every page has a slice or a hard SLO contract. The runbooks are linked from each alert annotation.

Step 2 — your turn. Take one AI feature you own. Write its five-signal default. Write the SLO sentence for each, with explicit window. Write three pages, four tickets, three dashboard-only trends. For each page, write one sentence of runbook — the first action the responder takes. If you cannot, the page is not ready to exist yet.

Step 3 — reproduce from memory. Without scrolling up, recreate the on-call dashboard sketch and the Prometheus YAML alert for rolling pass-rate, including the recording rule, the for: clause, and the slice-aware paging variant. Then write one sentence connecting this to chapter 1's load-bearing rule about samples. If you can do this cold, you carry the chapter into production.

What you should remember

This chapter explained why per-request traces from chapter 11, however complete, cannot keep a production LLM system healthy without aggregation, and how to do that aggregation in a way humans can actually act on. The refund chatbot's five-signal default — p95 latency, error rate, tool success, cost-per-request, and rolling judged quality — covers the four kinds of pain users feel plus the one cost finance feels. The fifth signal is the one your platform team cannot give you for free, and it is the one that catches every AI-specific outage that surprises infrastructure-only monitoring.

You learned the three-level alert hierarchy and why it exists. A page is a $5K-per-engineer-per-year promise that interrupts sleep; it must imply an action that cannot wait until morning and must carry a runbook. A ticket is for things wrong-but-tolerable. A dashboard-only signal is a trend, not an interruption. You also learned the two-dashboard rule — launch and on-call optimise for opposite goals and a unified dashboard does neither well. The concrete artifact is the Prometheus rolling pass-rate alert, slice-aware, with for: clauses sized to the metric's natural noise.

Carry this diagnostic forward: when a new alert is proposed, ask three questions. What action does it imply? What runbook does it link to? What signal-to-noise ratio does it have after one quarter of running? If the answers are "unclear, none, untested," the alert should not exist yet. Page fatigue is not a tuning problem; it is a deletion problem.

Remember:

  • The kitchen log earns its weight only when summarised into gauges humans actually watch and thresholds tied to actions humans actually take.
  • Five signals on the first screen: latency, errors, tool success, cost, quality. The fifth is the one your APM tool cannot give you for free.
  • Page → ticket → dashboard-only is an attention hierarchy, not a severity list. Every page needs a runbook; no exceptions.
  • SLO sentences need a window. Latency windows are long; tool-success windows are minutes. Pick the window to match the failure shape's natural timescale.
  • Aggregate hides slice. Slice-aware alerts are the production-scale version of chapter 1's rule about samples.
  • More alerts ≠ better coverage. Alerts compete for the same attention budget; past a small number, each new one lowers the value of every existing one.

Bridge. Dashboards and alerts close the loop after code ships — they catch what production reveals. But waiting until production to discover regressions is still expensive: every fix is at minimum a ticket, often a page, sometimes a customer escalation. The next chapter pulls evals leftward into the developer's inner loop, so the same signals that wake the on-call at 3am also fail the pre-merge check at 3pm — and the same rolling pass-rate that triggers the slice-aware page becomes the unit test that blocks the regressing PR before any of this monitoring ever sees it.

13-eval-driven-development.md