03. Alert design for AI systems¶

~11 min read. The apparatus's eyes are the alert plane. AI failures need paging conditions that classic alerts do not produce — sliced, anomaly-based, anchored to deploys, and pre-loaded with the context an on-call needs in the first ten minutes.

Continues from 02-the-oncall-apparatus.md. This chapter develops the alert plane. Recurring concepts in bold: quality alert, prompt-version page, provider-drift watch, cost-spike page, safety violation page, payload schema, deploy-anchored window.

The apparatus has six surfaces; the alert plane is the first to develop because every other surface fires from a page. A team can have a perfect rotation, a clean runbook, and a tested escalation graph and still miss the incident if the alert never fired. This chapter is the discipline of designing alerts that catch the five failure families.

What an AI alert is, and is not¶

A classic alert is a threshold on a symptom: error rate > 1%, latency p99 > 2 seconds, queue depth > 500. The threshold is static; the symptom is observable; the alert is a "this thing is now bad" notification.

AI alerts share the shape but not the substrate. The "thing" is rarely a single number above a fixed threshold. It is a comparison — eval-on-traffic score against a rolling baseline, refusal rate against the last week, tenant cost against the rolling mean, slice-level regression after a deploy, classifier hit on a sampled output. The alert is an anomaly or a deploy-anchored regression, not a static threshold breach.

Three properties an AI alert must have that a classic alert often does not:

Sliced. Eval scores are tracked by intent, tenant, cohort, or product surface — not just in aggregate. Regressions hide in slices.
Baseline-relative. The threshold is a deviation from rolling history, not an absolute number. A 3% refusal rate is fine for some features and a crisis for others; only the deviation from the feature's own baseline tells you which.
Deploy-anchored. When a prompt or model is deployed, a tight anomaly window opens. Anomalies inside that window are nearly always caused by the deploy; alerts in the window page louder.

The five paging conditions¶

The apparatus needs at least five paging conditions, one per failure family. Each has a name in the apparatus vocabulary.

The quality alert¶

What it watches. Eval-on-production-traffic score, sliced by intent, tenant tier, and cohort. A rolling-window comparison surfaces regressions before the user-feedback signal lags catch up.

When it fires. A slice's score drops below its rolling baseline by a defined threshold (commonly 5% absolute or 1σ, whichever is tighter), sustained for a minimum window (15-30 minutes for high-traffic features; longer for sparse).

Severity. P1 if the slice represents > 5% of traffic or contains a high-tier tenant; P2 otherwise.

Payload required. Slice name, score before/after, sample of affected calls (with trace IDs), the model version, the prompt version, the time of the last deploy of either.

The quality alert is the alert classic apparatus most lacks. The single most consequential failure shape for AI products is silent quality regression; the quality alert is the only paging condition that catches it.

The prompt-version page¶

What it watches. Anomalies in any production signal within a deploy-anchored window after a prompt or model change. Combines with quality alert (sliced) and adds anomaly windows for refusal rate, token usage, latency distribution, and feedback signal.

When it fires. A signal moves outside its rolling baseline by a configured threshold during the deploy window (typically 30-90 minutes post-deploy).

Severity. Always at least P2; P1 if the regression is large or affects high-tier tenants.

Payload required. Deploy ID, prompt/model version before and after, signal moving, magnitude, suggested rollback command.

The prompt-version page lowers the threshold for the post-deploy window because the probability of cause-and-effect attribution is high. The same signal during quiet hours might be noise; immediately after a deploy, it is the deploy until proven otherwise.

The provider-drift watch¶

What it watches. Provider behaviour shift detected at the model gateway — refusal rate by provider, response-shape anomaly, latency distribution change, error class distribution change. See 01_model_gateway_provider_ops chapter 09.

When it fires. Drift signal exceeds threshold; the drift is sustained beyond a noise window.

Severity. P2 by default; P1 if a workload class is heavily affected or if business-critical workflows are degraded.

Payload required. Provider name, model name, drift type, magnitude, traffic share affected, candidate fallback options.

The cost-spike page¶

What it watches. Per-tenant and per-feature cost telemetry, compared against rolling baseline (per-day or per-hour, depending on volume).

When it fires. Cost exceeds rolling baseline by a defined deviation (commonly 3σ or a configured percentage), sustained beyond a noise window.

Severity. P2 by default; P1 if absolute spend crosses budget thresholds.

Payload required. Tenant or feature ID, current rate, baseline rate, magnitude of deviation, expected spend if not contained, kill paths (feature flag, tenant quota, agent rate limit).

The safety violation page¶

What it watches. Safety classifier output on production responses, complaint volume on safety-flagged categories, regulator or trust-and-safety escalations.

When it fires. A confirmed violation is detected (single confirmed violation is a paging condition; an unconfirmed-but-classified-high-confidence event is also paging-worthy depending on severity).

Severity. Always P1.

Payload required. Affected response sample (redacted), input class, model/prompt version, feature involved, suggested freeze paths.

Anchoring alerts to deploys¶

A common pattern that improves precision dramatically: anchor the alert threshold to deploy events. During the 30-90 minutes after a prompt or model deploy, the alert thresholds are tightened (perhaps doubling sensitivity), and the alert payload includes the deploy ID and rollback command.

The intuition: when a signal moves immediately after a deploy, the probability that the deploy caused the move is very high. The alert is justified at a sensitivity that would be too noisy in steady state. After the deploy window closes, thresholds relax.

This pattern requires:

A deploy event stream the alert system subscribes to. The release management module (03_ai_release_management) provides the schema.
An alert engine capable of dynamic thresholds based on the deploy state.
Rollback hooks accessible from the on-call's tooling.

When all three are in place, the post-deploy window catches roughly 60-80% of the regressions that ever fire (the percentages vary by team; the principle holds).

A worked example — wiring the quality alert¶

The Bengaluru insurance SaaS team from the first-principles chapter wires its first quality alert after the silent-regression postmortem. The design:

Signal source. Eval-on-production-traffic, sampled at 2% of calls, scored by an LLM-as-judge fixture against a fixed rubric per intent. Cost: ~$80/day for the feature's traffic level.
Slicing. By intent (policy_question, policy_comparison, claim_status, claim_initiation, agent_handoff_request), by tenant tier (free, paid, enterprise), and by region.
Baseline. Rolling 7-day score per slice, recomputed daily.
Threshold. 1σ regression below baseline sustained for 30 minutes triggers P2; 2σ triggers P1.
Deploy-anchored. During the 60-minute window after a prompt or model deploy, the threshold tightens to 0.7σ.
Payload. Slice name, score before/after, ten sample call trace IDs, model version, prompt version, deploy ID if within window, link to the runbook card, link to the rollback command.

The alert fires in the first month after a prompt deploy that improved aggregate scores by 4% but regressed policy_comparison by 19%. The on-call sees the alert, opens the runbook, executes the rollback command. Time to contain: 12 minutes. The same incident pre-apparatus took 14 hours.

The payload schema¶

The most common alert failure is not "the alert did not fire"; it is "the alert fired and the on-call wasted ten minutes looking up context." The payload schema is the apparatus's discipline against this.

Required fields:

- severity (P1 | P2 | P3)
- failure_family (quality | drift | cost | safety | other)
- summary (one line: what is wrong, where, how bad)
- runbook_link (versioned URL to the runbook card)
- rollback_command (if applicable — copy-paste ready)
- affected_scope (feature, tenant, slice — explicit)
- versions (model, prompt, retrieval, gateway — all four)
- deploy_id (if within deploy-anchored window)
- sample_traces (3-10 trace IDs the on-call can open)
- baseline_vs_current (the number that justifies the page)
- escalation_path (the next hop if the on-call cannot act)

The payload is the difference between a 12-minute time-to-contain and a 90-minute one. The runbook chapter develops what the on-call does with the payload; this chapter ensures the payload exists.

Operational signals¶

Healthy. Each AI feature has all five paging conditions wired. False-positive rate is below 20% (above this, on-call begins to mute). Mean time from page to first action is under 5 minutes (the payload pre-loads context).

First degrading metric. False-positive rate climbing. The on-call is starting to second-guess pages; if not addressed, they will mute.

Misleading metric. Total alert volume. A team with too few alerts has the same low volume as a team with too many that are muted; the metric does not distinguish health from blindness.

Expert graph. Per-alert precision and recall against postmortems — for each alert, what fraction of its fires were real incidents (precision), and what fraction of real incidents had the alert fire (recall). Recall below 80% is the more dangerous failure.

Boundary of applicability¶

Strong fit. Production AI features with measurable signal volume. The full five paging conditions are justified.

Pathology. Low-traffic features where the signal is too sparse to drive anomaly detection. The alert plane should adapt to volume — longer windows, simpler thresholds — rather than fire constant noise.

Scale limit. Multi-tenant platforms where per-tenant alerts can produce hundreds of pages a day if not capped. The pattern is to aggregate at the tenant cohort level for paging and surface per-tenant detail in the payload.

Failure-prone assumption¶

The seductive wrong belief: more alerts produce more safety. They do not. Beyond a threshold, more alerts produce alert fatigue, the on-call mutes or rationalises, and the apparatus degrades by use. The correct belief: alerts are designed for precision and recall, with the on-call's attention as the constrained resource. A few high-precision, high-recall alerts beat many noisy ones.

Where this appears in production¶

A fintech wires quality alerts sliced by intent; catches a refund-policy regression 22 minutes after a prompt deploy.
An e-commerce SaaS anchors alert thresholds to deploys; post-deploy precision improves from 40% to 78%.
A telecom AI has no slice-level quality alert; an enterprise tenant regresses; the team learns from the customer.
A consumer chatbot wires cost-spike alerts at 3σ; catches a misconfigured retry loop in 18 minutes; saves an estimated $6,000.
A legal AI wires safety violation as P1 single-event; the apparatus catches three violations in six months; each is contained within an hour.
A coding assistant has provider-drift watch wired to the gateway; catches a refusal-posture change overnight; the team pins the prior model version while the candidate model is dual-run.
A travel platform muted its quality alert after false positives; the next real regression is caught by customer support, not the apparatus.
A medical AI wires safety as P1 always; the on-call accepts the higher burden because the user impact is regulated.
A payments AI has alert payloads missing the deploy ID; on-call wastes time finding it; payload schema is updated.
A retail AI wires per-tenant cost alerts on enterprise tenants only; smaller tenants are aggregated; alert volume stays manageable.
A search-rerank service has alerts per workload class — interactive, batch, embeddings — each with its own thresholds.
A staffing AI has alerts but no payload schema; each on-call writes their own context-gathering script; mean time to context is 18 minutes.
A logistics AI wires alerts to a paging system but not to the runbook tooling; the runbook link is absent; on-call searches manually.
A document AI uses a 7-day rolling baseline; a feature with weekly seasonality produces false positives every Monday; baseline window adjusted.
A government AI wires alerts to ticket creation, not to paging; the apparatus catches issues but resolution lags by hours.
An ad-tech AI tracks alert precision and recall against postmortems quarterly; alert tuning is a standing activity.
A healthcare AI wires safety violation alerts with auto-freeze hooks; the on-call's first action is sometimes "confirm the freeze was correct."
A B2B SaaS wires the prompt-version page across all features; the post-deploy window is consistent at 60 minutes.
A real-estate AI has alerts wired but no drill that exercises them; the first real fire is the first validation; one of three alerts had a misconfigured payload.
A media AI has provider-drift watches per provider; catches an API-shape change three hours before the provider's status page acknowledged it.

Recall / checkpoint¶

Name the five paging conditions and the signal each watches.
Why is a sliced eval score essential, and not just an aggregate?
What is a deploy-anchored alert window and why does it improve precision?
List the required fields in an alert payload.
What is the alert plane's most common degradation pattern, and how does the apparatus detect it?
When is a low alert volume a sign of health versus blindness?
How does the alert plane interact with the rollback hooks the deploys provide?

Interview Q&A¶

Q1. A team has alerts firing constantly and the on-call is muting them. What is the apparatus failure, and what is the remediation? Low alert precision. The on-call's behaviour is a signal about the alerts, not a discipline problem. The remediation is to measure precision per alert against postmortems, retune the lowest-precision alerts (tighten thresholds, add slice confirmation, lengthen sustain windows), and split severity tiers so noisy alerts create tickets rather than page. The principle is that the on-call's attention is the constrained resource; alert design optimises for that resource. Common wrong answer to avoid: "tell the on-call to take alerts seriously" — the on-call is responding correctly to a noisy stream.

Q2. Walk through wiring a quality alert for a new AI feature. Source: eval-on-production-traffic, sampled, scored against a per-intent rubric. Slicing: intent, tenant tier, cohort if relevant. Baseline: rolling window (7-14 days), recomputed daily. Threshold: 1σ regression sustained 30 minutes for P2, 2σ for P1, tightened during the 60-minute post-deploy window. Payload: slice, score before/after, sample traces, versions, deploy ID, runbook link, rollback command. Verification: run the alert against the last quarter's incident set and confirm it would have fired. Common wrong answer to avoid: "an aggregate eval score is enough" — aggregate hides slice regressions.

Q3. Why is the deploy-anchored window load-bearing for alert precision? Because the probability of cause-and-effect attribution is high. A signal moving immediately after a deploy is almost certainly caused by the deploy; the alert is justified at sensitivity that would be too noisy in steady state. The window also allows the payload to carry the deploy ID and a copy-paste rollback command, which collapses the mean time to action. Without the deploy anchor, the same alert fires constantly on baseline noise and the on-call mutes it. Common wrong answer to avoid: "tighter thresholds always work" — they only work when paired with the deploy anchor; otherwise they explode false positives.

Q4. The cost-spike alert is wired at an absolute threshold and misses a slow runaway. How would you redesign it? Move from absolute threshold to baseline-relative anomaly. The signal is per-tenant or per-feature cost rate compared to a rolling baseline (hourly or daily depending on volume). The alert fires on deviation (3σ or a configured percentage) sustained beyond a noise window. The absolute threshold remains as a backstop for catastrophic runaways but is not the primary signal. The redesigned alert catches the slow climb that an absolute threshold misses entirely. Common wrong answer to avoid: "lower the absolute threshold" — that produces false positives at lower volumes without catching the climb at higher volumes.

Q5. The alert payload is missing the deploy ID. How serious is this, and what is the fix? Serious. The on-call cannot connect the alert to the change that caused it; they spend the first 5-10 minutes of the incident looking up recent deploys. The fix is to enrich the payload at alert-fire time — the alert engine subscribes to the deploy event stream and includes the deploy ID if the alert fires inside the post-deploy window. The payload schema enforces the field as required for the prompt-version page. Common wrong answer to avoid: "the on-call can look it up" — they can, but the apparatus has just imposed avoidable latency on the response.

Q6. How do you tune alerts on a feature with low traffic volume? Longer windows, simpler thresholds, lower per-event sensitivity. The signal is sparse; anomaly detection on sparse data produces noise. Pattern: aggregate the feature with siblings if their failure modes are similar; lengthen the rolling baseline window; accept that alert recall on low-traffic features will be lower than on high-traffic features. Document the choice explicitly so the apparatus's known coverage gaps are visible. Common wrong answer to avoid: "make the thresholds tighter" — produces noise without improving recall.

Design / debug exercise (10 minutes)¶

Modelled example. Take the worked example (the Bengaluru insurance SaaS feature). Walk through the five paging conditions and verify each is wired with: source, slicing, baseline, threshold, deploy anchor, payload schema, escalation. Identify any condition that is missing one of these elements.

Your turn. Pick one AI feature. For each of the five paging conditions, fill in the same six elements. Flag any condition you cannot fill in — that is your next apparatus work.

Reproduce from memory. Write the alert payload schema from memory. The signal that you have internalised this chapter is that the schema lands in under three minutes with the required fields present.

Operational memory¶

This chapter explained AI alert design: sliced eval scores, baseline-relative anomaly thresholds, deploy-anchored windows, and structured payloads. The important idea is that AI alerts must be designed for precision, recall, and on-call attention — five paging conditions tuned per feature, with payload that pre-loads context.

You learned to wire each of the five paging conditions with the six structural elements (source, slicing, baseline, threshold, deploy anchor, payload). That solves the opening failure because the alert plane is the apparatus's eyes; without it, the other surfaces have no input to act on.

Carry this diagnostic forward: when an on-call complains about alert quality, measure precision and recall per alert against postmortems. The on-call's behaviour is a signal; the apparatus's tuning is the response.

Remember:

Five paging conditions: quality, prompt-version, provider-drift, cost-spike, safety violation.
Slice eval scores by intent, tenant, cohort.
Anchor alerts to deploys for precision and rollback context.
The payload pre-loads context; the schema enforces it.
Alert volume is misleading; precision and recall against postmortems is the truth.

Bridge. Alerts produce pages. Pages reach people. The next chapter is the rotation plane — who is on call for which AI surface, with the right context and training, so the page lands somewhere productive. → 04-rotation-and-ownership.md