10. Alarm panel at build time — Observability and the eval gates that decide ship-or-no-ship¶

~18 min read. Two sides of the same question: "how do you know the agent is working correctly?" Observability answers at runtime. Eval gates answer before deployment.

The refund bot that passed every test and still broke¶

Recovery saves the agent after a crash. But how do you know it crashed in the first place?

Tuesday, 2:47 AM. A customer-service agent starts telling users their refunds are "processing" when the refund service is returning 500s. No alarm fires. No eval caught it. The prompt changed last Thursday — a minor wording tweak — and the model stopped retrying on tool errors. It just... made something up.

The on-call wakes to a Slack flood at 7 AM. Five hours of wrong answers. She opens the dashboard and finds... request count. Average latency. Error rate at the HTTP layer — green, because the agent itself never errored. It fabricated answers successfully.

She has no span showing what the model decided at each step. No tag linking the prompt version to the failure. No pre-launch eval that tested "what does the agent do when get_refund_status returns 500?" The alarm panel was never wired. The eval gate never existed.

This file is about building both — before the first user complains.

What we know so far¶

From earlier files we have the pieces:

The agent loop (think → act → observe) defines the unit of work.
The toolbelt is the set of actions an agent can take.
The budget caps cost per conversation.
The kill switch halts traffic when things go wrong.
Recovery restores agent state after failure.

But none of those tell you whether the agent is misbehaving right now or whether it was safe to ship in the first place. That requires two mechanisms:

Observability — the runtime alarm panel. Spans, traces, metrics, alerts.
Eval gates — the pre-launch yardstick. Capability, safety, regression, cost, latency, drift.

Same design surface. Different time horizons. Both answer: "is this agent working correctly?"

The tension: visibility vs overhead¶

Every span costs storage. Every eval costs compute. But flying blind costs incidents.

   more visibility                    less visibility
   ──────────────                     ────────────────
   50M spans/day = $$$                no tags = 5-hour debug
   6 eval suites = 20 min CI         no gates = ship and pray
   full PII in traces = breach risk   no context = can't reproduce

The design question is not "should we observe?" It is: what minimum signal set catches misbehavior before users complain?

Part I — The runtime alarm panel¶

Where to cut: span boundaries for agent loops¶

A trace is a tree of spans. One trace per request. Where do you draw span boundaries? If every Python function is a span, you drown in noise. If only the top-level call is a span, you have no detail. Both useless.

Simple rule:

one span per agent step       ← always
one span per LLM call         ← always
one span per tool call        ← always
one span per retry attempt    ← always
one span per HITL approval    ← always

everything else               ← off by default

The agent step is the think → act → observe unit. LLM call and tool call are children of it. Three levels. That is your minimum schema.

The 3 AM rubric for required tags¶

When the agent breaks at 3 AM, the on-call needs to filter fast. Filter by what? By the tags you put on every span at design time.

Ask one question: What would I need to debug this if it broke at 3 AM? The answer becomes a required tag.

required on EVERY span:
  ├── trace_id          (global identifier)
  ├── parent_span_id    (tree structure)
  ├── tenant_id         (which customer?)
  ├── user_id           (which end user?)
  ├── session_id        (which conversation?)
  ├── step_index        (which step in the loop?)
  └── timestamp_ms      (when?)

required on LLM spans:
  ├── model_id, model_version
  ├── prompt_version    (hash or semver)
  ├── input_tokens, output_tokens
  ├── temperature
  └── cost_usd

required on TOOL spans:
  ├── tool_name, tool_version
  ├── args_redacted     (after redaction)
  ├── result_status     (success / error / timeout)
  └── latency_ms

Skip any and the 3 AM debug takes hours longer. Write them once into your span helper. Every span inherits them by default.

Span tree for one trajectory¶

Customer-service agent answering "where is my refund?":

trace_id = abc-123
└── step[0]  agent_step                 [tenant=acme, user=u42]
    ├── llm[0.0]  model_call            [model=sonnet-4-7, cost=$0.004]
    │             prompt_version=v17, decision: call get_order
    ├── tool[0.1]  get_order            [status=ok, latency=120ms]
    │              args: {order_id=o99} → {status=shipped, refund=null}
    └── step[1]  agent_step
        ├── llm[1.0]  model_call        [cost=$0.005, decision: get_refund_status]
        ├── tool[1.1]  get_refund_status [status=ok, latency=80ms]
        │              → {state=pending, eta=2d}
        └── step[2]  agent_step
            ├── llm[2.0]  model_call    [decision: respond]
            └── final_response          [latency=4.2s, cost_total=$0.013]

Three steps. Two tool calls. Three LLM calls. Every span carries tenant=acme, user=u42, trace_id=abc-123. The on-call can now query: "Show me all traces where get_refund_status errored on tenant=acme last week." One query. Two seconds.

Metric baselines: the numbers that define "normal"¶

Before you can alert, you must know what normal looks like. Three baselines, measured on the first week of production traffic:

metric              what it measures             alert threshold
──────────────────  ──────────────────────────   ────────────────────
error_rate          tool calls returning error   > 2× baseline for 5 min
p95_latency         end-to-end wall clock        > 2× baseline for 10 min
cost_per_request    sum of cost_usd per trace    > 3× baseline for 15 min

These three catch most incidents. Error rate catches broken tools. Latency catches model slowdowns and retry storms. Cost catches prompt regressions that balloon token usage.

The thresholds are multiples of baseline, not absolute numbers. A refund bot with baseline p95 of 4s alerts at 8s. A coding agent with baseline p95 of 90s alerts at 180s. Same rule, different numbers.

Cost attribution from day one¶

The budget is not just a runtime cap — it is an accounting axis.

   no tags                       with tags
   ─────────                     ─────────
   total: $48,000/mo             total: $48,000/mo
                                 ├── tenant=acme:    $18,200
   "where did this go?"          ├── tenant=globex:  $11,400
                                 └── feature=refund_bot: $26,000
   no answer                       → refund_bot on acme is 54% of spend

Add cost_usd to every LLM span. Compute it: input_tokens × input_price + output_tokens × output_price. Roll up by tenant and feature. Now you know which customer is unprofitable.

Sampling: completeness vs cost¶

Two pressures. Completeness — rare bugs live in rare traces. Cost — 50M spans/day is real money.

low-volume  (< 1K traces/day)   → sample 100%
high-volume (> 100K traces/day) → head-based 5-10%
                                  PLUS keep 100% of:
                                  ├── error traces (status=error)
                                  ├── high-cost (> $0.50)
                                  ├── HITL-flagged
                                  └── complaint traces

Head-based decides at trace start — whole tree kept or dropped together. Tail-based decides after trace finishes (better coverage, needs buffering). Most teams start head-based plus error-always-kept. That covers 90% of debug value at 10% of storage cost.

Redaction: protect users, preserve debuggability¶

User data flows through the agent. PII, payment info, health data.

   redact at source                redact at query time
   ─────────────────               ─────────────────────
   agent ──strip PII──→ store      agent ──raw──→ store ──mask──→ dashboard

   pro: GDPR-safe by default       pro: full data for debug
   con: cannot un-redact            con: one bug = data breach

Redact at source is the safe default. The blast radius of a PII leak is huge. Keep a tightly-scoped "raw" store with 24-hour retention only if your security posture allows it.

The alarm panel in action — 5-minute diagnosis¶

Three weeks after launch. Tenant acme reports slow, wrong refund queries. The on-call opens the alarm panel:

q1: tenant=acme last 7d, sort by latency
    → 1,243 traces, p95=12.4s (baseline 4s)  ← alert fired
q2: filter tool=get_refund_status status=error
    → 287 traces with errors
q3: group by model_version
    → v2026-04-10: 12 errors
    → v2026-05-01: 275 errors  ← model upgrade broke it
q4: group by prompt_version
    → v17: all 287 errors

Four queries. Five minutes. Prompt v17 plus model v2026-05-01 is the bad pair on tenant acme. Roll back prompt to v16. Incident over. None of that works without tags wired at design time.

Eval-data emission: every output is a candidate eval¶

The alarm panel is runtime. But it also feeds the pre-launch system:

when the agent finishes a trajectory:
  ├── log the input
  ├── log the final output
  ├── log every tool call sequence
  ├── log user feedback (thumbs up/down)
  └── log human override (if HITL stepped in)

  → push to eval-data store
  → fraction gets labeled by humans
  → labeled set becomes regression eval
  → regression eval gates the next launch

If you do not emit eval data from day one, you have no labeled data when you need it. Building the yardstick from scratch six months in is painful. This is where observability feeds eval gates — the runtime alarm panel generates the data that the pre-launch gates consume.

Part II — The pre-launch eval gates¶

Six gates. All green. Or the agent does not ship.¶

The yardstick is a hiring panel. Six interviewers. Each asks one question. Each has veto power. Pass all six, the agent ships. Fail one, no launch.

new agent build
      │
      ▼
┌───────────────────┐
│ 1. Capability     │  ≥ 80% on golden set?
└────────┬──────────┘
         │ pass
         ▼
┌───────────────────┐
│ 2. Safety         │  ≥ 99% refusal on red-team set?
└────────┬──────────┘
         │ pass
         ▼
┌───────────────────┐
│ 3. Regression     │  100% on locked bug set?
└────────┬──────────┘
         │ pass
         ▼
┌───────────────────┐
│ 4. Cost           │  p50/p95 within budget?
└────────┬──────────┘
         │ pass
         ▼
┌───────────────────┐
│ 5. Latency        │  p50/p95 within SLA?
└────────┬──────────┘
         │ pass
         ▼
┌───────────────────┐
│ 6. Drift baseline │  output distribution captured?
└────────┬──────────┘
         │ pass
         ▼
   ship to canary

Any red light, the kill switch never even gets armed. The launch does not happen.

Gate 1 — Capability: does it solve the intended tasks?¶

A golden set of 100–500 representative inputs with expected behaviors.

Who builds: product owner + agent engineer.
Min size: 100 for narrow agents, 500+ for broad ones.
Threshold: ≥ 80% pass rate.
Sign-off: product manager — they own "intended task."

Why 80% and not 95%? LLM evals are noisy. A 100-case set has ±5% statistical wiggle. Chasing 95% on 100 cases is chasing noise. Either accept 80%, or grow to 1000+ cases where 95% becomes meaningful.

Gate 2 — Safety: does it refuse what it should refuse?¶

A red-team set — prompts trying to leak secrets, abuse tools, ignore policy, exfiltrate data.

Who builds: security team + red-teamers.
Min size: 200 adversarial cases (injection, jailbreak, prompt extraction, tool abuse, data exfil).
Threshold: ≥ 99% refusal.
Sign-off: security lead. Not the engineer. Not the PM.

Why not 100%? Because 100% is unachievable with current LLMs on any non-trivial adversarial set. Claiming 100% means you are not testing hard enough. The 99% bar says "at most 2 failures in 200 cases." Production has additional layers — rate limits, kill switch, audit logs.

Gate 3 — Regression: does the new build still pass everything we already fixed?¶

Every production bug becomes a locked case. Frozen. Cannot be edited. Cannot be deleted.

Who builds: every engineer shipping a bug fix adds the case.
Min size: zero at start. Mature agents reach 200–2000.
Threshold: 100%. No exceptions.
Sign-off: automated — CI blocks the merge if any case fails.

Why so strict? Every case represents a real customer who got burned once. Letting it through twice is unforgivable. The regression set only grows.

Gate 4 — Cost: is p50/p95 within budget?¶

Run the golden set. Measure dollars per conversation.

Threshold: p50 ≤ budget, p95 ≤ 2× budget. Hard cap at p99.
Sign-off: engineering manager — they own burn rate.

cost distribution for an internal coding agent:
   p50: $0.04  ← typical session
   p95: $0.18  ← long debug session
   p99: $0.40  ← deep refactor
   cap: $1.00  ← circuit breaker fires

If p95 already exceeds budget at 100 users, what happens at 10K? You go broke. Enforce the budget before traffic, not after the invoice.

Gate 5 — Latency: is p50/p95 within SLA?¶

Same golden set, now measuring wall-clock end to end. Includes tool calls, retries, model thinking time.

Threshold: chat agent — p50 ≤ 3s, p95 ≤ 10s. Background agent — p50 ≤ 60s, p95 ≤ 300s.
Sign-off: SRE or platform owner.

A snappy demo lies. Demo runs one query on warm cache. Production runs thousands with cold caches and rate limits. The latency gate must use realistic conditions — not ideal ones.

Gate 6 — Drift baseline: what does "normal" look like today?¶

This gate does not pass or fail at launch. It captures. Today's outputs on a fixed 200-case probe set become the baseline. Tomorrow's outputs get compared.

Maintains: automated. Runs daily on the fixed probe set.
Threshold at launch: none. The baseline IS the pass.
Threshold post-launch: drift > 10% from baseline fires the alarm panel.
Sign-off: ML platform owner.

Without a baseline, you cannot tell the agent slowly got worse. Drift is invisible without a t=0 reference point. The drift gate is your future self's gift.

The ship / no-ship decision matrix¶

gate              must hit     who signs off       blocking?
────────────────  ───────────  ──────────────────  ─────────
capability        ≥ 80%        product manager     yes
safety            ≥ 99%        security lead       yes
regression        = 100%       CI (automated)      yes
cost              ≤ budget     eng manager         yes
latency           ≤ SLA        SRE / platform      yes
drift baseline    captured     ML platform         yes

All six green. No "we will fix it post-launch." That phrase is how teams end up debugging at 3 AM without an alarm panel.

Keeping the yardstick fresh¶

The gates themselves go stale. After 6 months, the golden set covers tasks nobody asks. The red-team set misses new jailbreak styles. Cost targets reflect old token prices.

trigger                          action
────────────────────────────     ──────────────────────────
production bug found             add to regression set (immediate)
new jailbreak in the wild        add to safety set (within a week)
golden set pass rate drifts      refresh 20% of cases (quarterly)
model upgrade                    re-measure cost + latency (immediate)
new feature shipped              add capability cases (per feature)

Rule: regression set only grows. Safety set grows fast. Capability set refreshes 20% per quarter. Cost and latency thresholds get reviewed every model upgrade.

Never silently lower a threshold to make a build pass. That is how the yardstick becomes theatre.

How the two sides connect¶

Observability and eval gates are not independent. They form a feedback loop:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  RUNTIME (observability)              PRE-LAUNCH (eval gates)│
│  ─────────────────────                ───────────────────────│
│  spans + traces + metrics ──────→ eval-data store            │
│                                       │                      │
│  alarm fires ──→ bug found ──────→ regression set grows      │
│                                       │                      │
│  drift detected ──→ baseline ────→ drift gate threshold      │
│                                       │                      │
│  cost spike ──→ budget review ───→ cost gate re-calibrated   │
│                                       │                      │
│  eval gates pass ──→ ship ───────→ alarm panel monitors      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Every production incident enriches the eval gates. Every eval gate that passes enables production monitoring. The alarm panel generates the data. The eval gates consume it. They are the same design surface at different time horizons.

Where this lives in the wild¶

LangSmith — span schemas for LLM/tool/agent-step spans with metadata tags; used at Replit and Klarna for production debugging AND eval pipelines that convert traces into regression sets.
Braintrust — eval-driven observability tying every production trace to a potential eval example; capability and regression gates run in CI.
Datadog LLM Observability — span-based view with per-tenant cost attribution, prompt-version drilldowns, and threshold-based alerting.
OpenTelemetry GenAI semantic conventions — standard span names (gen_ai.completion, gen_ai.tool.call) and attributes (gen_ai.request.model, gen_ai.usage.input_tokens) for cross-vendor instrumentation.
Vercel AI SDK eval harness — ai-sdk/evals defines capability and regression suites running on every PR as CI gates.
OpenAI / Anthropic launch criteria — internal capability suites plus safety bars (Anthropic's Responsible Scaling Policy, OpenAI's preparedness framework) that gate every model release.

Recognition: spotting the pattern in unfamiliar systems¶

You are reviewing a new agent system. How do you tell if it has the alarm panel wired?

Healthy signs: - Every LLM call carries model_version and prompt_version tags. - There is a regression eval set that grows after every bug fix. - CI blocks the merge if any eval gate fails. - The on-call can query "show me all failed traces for tenant X in the last hour." - Cost per tenant is a dashboard metric, not a monthly surprise.

Red flags: - "We log to stdout and grep when something breaks." - "We test manually before launch." - "The eval set hasn't been updated in three months." - "We lowered the threshold last sprint because the build kept failing." - "We don't know our p95 latency."

Interview Q&A¶

Q: Why must observability be designed before the agent ships? A: Spans require parent-child links and tags baked into instrumentation at write time. Adding tags retroactively means redeploying and losing all historical context. You discover missing tags during the 3 AM incident — when there is no time to add them. The alarm panel must be wired at architecture time so the first incident is debuggable.

Wrong model to avoid: "We can log everything to a file and grep later." Files are flat, lack parent-child links, lack consistent tags. You cannot filter "all traces where tool X failed on tenant Y" from raw logs.

Q: Why six separate eval gates instead of one "quality score"? A: They fail differently and have different owners. A build can be fast and expensive (parallel tool calls eating tokens) or cheap and slow (tiny model). It can be capable but unsafe. Combining gates into one number hides which dimension regressed and who should fix it.

Wrong model to avoid: "Latency and cost correlate, so one gate is enough." They often anti-correlate. Caching trades cost for latency. Reasoning traces inflate cost without changing latency.

Q: Why is the regression gate 100% but capability only 80%? A: They test different things. Capability measures general task performance on a sampled, refreshing set — noisy by nature. Regression pins down specific bugs that already burned real customers. Each is a permanent locked case. A build scoring 90% on capability and reintroducing one known bug is unshippable.

Wrong model to avoid: "Regression is just a subset of capability." Capability refreshes; regression only grows. Capability threshold is 80%; regression is 100%. Mixing them lets old bugs slip through statistical noise.

Q: When would you choose tail-based sampling over head-based? A: When you need all error traces plus a small fraction of successes. Head-based decides at trace start (random 5–10%). Tail-based decides after trace finishes based on outcome. Tail-based gives better debug coverage but requires buffering every span in memory until the trace ends — expensive for long-running agents. Most teams start head-based plus error-always-kept.

Apply now (10 min)¶

Wire the alarm panel. For an agent you have built or imagined, write the required tag list for one LLM span and one tool span (six tags each). Apply the 3 AM rubric — would each tag help an on-call debug?

Define the gates. For the same agent, fill in a six-gate card:

gate           | dataset size | threshold | sign-off
───────────────|──────────────|───────────|──────────
capability     |              |           |
safety         |              |           |
regression     |              |           |
cost           |              |           |
latency        |              |           |
drift baseline |              |           |

Sketch from memory. Draw the span tree for a 2-step agent (1 tool call per step, final answer). Label every span with 4+ required tags. Then draw the six-gate flowchart with thresholds. These two diagrams are the alarm panel and the yardstick — one picture for runtime, one for pre-launch.

Operational memory¶

This file explained that observability and eval gates are the same design surface — both answer "how do you know the agent is working correctly?" one at runtime, one before deployment. The minimum signal set that catches misbehavior before users complain is the central design question, balanced against the overhead cost of visibility.

For the runtime alarm panel: spans are designed at architecture time (agent step, LLM call, tool call — three levels); every span carries the 3 AM rubric tags (trace_id, tenant_id, user_id, session_id, step_index, model_version, prompt_version); metric baselines (error rate, p95 latency, cost/request) define "normal" and alerts fire at 2–3× baseline; head-based sampling plus error-always-kept covers 90% of debug value at 10% of cost; redact PII at source; emit eval-shaped events from day one so the yardstick has labeled data when needed.

For the pre-launch eval gates: six gates with six veto holders — capability (≥80%, product manager), safety (≥99%, security lead), regression (100%, CI), cost (≤budget, eng manager), latency (≤SLA, SRE), drift baseline (captured, ML platform). Regression set only grows. Safety set grows fast. Capability refreshes 20% per quarter. Never lower a threshold to make a build pass.

The two sides feed each other: production incidents become regression cases; drift baselines become alert thresholds; eval-data emission in runtime fuels the pre-launch gates.

Remember:

Observability is designed before the agent ships; tags cannot be added retroactively to historical traces.
The 3 AM rubric: "what would I need to debug this at 3 AM?" → that becomes a required span tag.
Three metric baselines (error rate, p95, cost/request) catch most incidents; alert at 2–3× baseline.
Six eval gates, six veto holders; combining them into one score hides which dimension regressed.
Regression = 100% (real customers burned once), safety = 99% (honest about LLM limits), capability = 80% (noisy below 1K cases).
Every production bug becomes a locked regression case. The set only grows.
Never silently lower a threshold to make a build pass — that is how the yardstick becomes theatre.
Runtime observability generates eval data; eval gates consume it. Same surface, different time horizons.

Bridge. The alarm panel tells you when something is wrong. The eval gates tell you when it is safe to ship. But how do you actually get the agent into production — and how do you yank it back? Shadow mode, canary ramp, kill switch choreography. Next: the deployment lifecycle. → 11-rollout-versioning-kill-switch.md