06. Observability & Error Budgets — instruments plus rules for sailing¶

~12 min read. Seeing problems is useful only when it changes decisions.

Built on the ELI5 in 00-eli5.md. The weather check — reminder of risk assessment — turns raw signals into better operating discipline.

1) Observability means seeing the whole voyage¶

Look. A captain does not stare at one wave and call it navigation. The crew needs a full view of the voyage. Where are we? What changed? What slowed us down? What cost more than expected? Which source did we trust? That is observability.

In AI systems, observability is more than logs. You need traces, model version, prompt version, retrieval sources, latency, cost, feedback, and error classes. Simple, no? Without that picture, the course becomes guesswork. Decisions lose evidence. Handovers become arguments from memory. The ship's log fills with vague stories. And the weather check becomes theatre.

See the shape of a useful trace.

┌──────────┐   ┌──────────┐   ┌─────────────┐   ┌──────────┐
│ Request  │──▶│ Retrieval│──▶│ Model call  │──▶│ Response │
└──────────┘   └──────────┘   └─────────────┘   └──────────┘
      │              │               │               │
      ├─ trace id    ├─ source ids   ├─ model ver    ├─ feedback
      ├─ user tier   ├─ hit counts   ├─ latency      ├─ outcome
      └─ route       └─ tenant       └─ cost         └─ error tag

This picture matters because AI failures are layered. Sometimes retrieval failed. Sometimes the model changed. Sometimes latency came from tool fan-out. Sometimes user feedback fell because answers grew vague. Without a joined trace, you only see symptoms.

2) The signals that actually matter¶

So what to do? Log the signals that change decisions. Start with traces because one trace ties a request to every internal step. That lets the crew replay a bad voyage.

Next, capture model version and prompt or workflow version. If a quality drop starts after version change, that clue matters. If a rollback fixes it, that clue matters too. You cannot defend the course without version history.

Then capture retrieval sources. Which documents were fetched, which tenant filter ran, and whether sources were stale? Faithfulness debates become shorter when evidence exists. Yes?

Track end-to-end and per-step latency with percentiles. Averages hide pain. Users feel tails, not means.

Cost also belongs on the bridge. Track token usage, tool-call costs, retry costs, and cost per successful task. A cheap failure is still a failure. An expensive success may still break the business.

Feedback closes the loop. Thumbs, escalations, corrections, abandonment, and support tickets all help. Human signals keep the compass grounded.

3) Error budgets force honesty¶

Observability tells you what is happening. Error budgets tell you what to do about it. That is the forcing function. Without a budget, teams promise reliability emotionally. With a budget, they manage reliability explicitly.

Think like a captain. You can tolerate only so much unreliability before the voyage becomes reckless. That tolerated amount is the budget. If the sea stays calm and the budget is healthy, ship features. If the budget burns fast, freeze changes and fix reliability.

See. A team says, "Only a small quality dip happened." But the budget says 80 percent is already gone. Now the decision is clearer. The weather check wins over wishful thinking. Nobody needs a dramatic speech. Simple, no?

Error budgets can track availability, latency, unsafe answers, or runaway cost. Choose budgets that match the user promise. Do not hide behind one convenient metric.

Budget healthy means controlled risk. Budget exhausted means discipline. Pause risky launches, fix the burn source, and update the ship's log. That is engineering maturity.

4) Running the loop with discipline¶

A healthy operating loop is simple. Observe. Compare against budget. Decide. Act. Then record what happened. That loop sounds basic, but many teams skip the middle. They observe and then rationalize. That is how incidents repeat.

Look at the loop visually.

┌──────────────┐
│ Observe      │ traces, versions, sources, latency, cost, feedback
└──────┬───────┘
       ▼
┌──────────────┐
│ Compare      │ error budget healthy or burning?
└──────┬───────┘
       ▼
┌──────────────┐
│ Decide       │ ship, slow-roll, freeze, rollback, or fix
└──────┬───────┘
       ▼
┌──────────────┐
│ Record       │ update docs for handoff
└──────────────┘

The compass matters at every step. Signals without thresholds create noise. Thresholds without ownership create blame. Ownership without clear docs creates repeat confusion. Yes?

So what to do? Pick a few critical signals first. Attach each signal to a user promise. Set a budget and burn rule. Define who acts when the burn is high. Document rollback paths in the ship's log. Review this weekly, not only during incidents.

This is how observability becomes operational discipline. Not more dashboards. Better decisions. The course stays visible because evidence flows into action. The weather check becomes a routine, not a panic ritual.

5) Common traps to avoid¶

The first trap is vanity instrumentation. Teams log everything and explain nothing. They have giant dashboards but weak decisions. That is noise, not observability.

The second trap is missing lineage. You see a bad answer, but not the model version. You see latency, but not the slow step. You see user complaints, but not the retrieval source. Without lineage, the written record becomes guesswork.

The third trap is budget theatre. Leaders declare a budget but never freeze launches. Then everyone learns the rule is fake. Trust in the process collapses. That is dangerous under pressure.

The fourth trap is measuring only uptime. AI systems can stay up and still be wrong. They can respond fast and still be unsafe. They can answer cheaply and still destroy trust. Quality, latency, and cost must be read together. See. That is why honest risk review matters.

Where this lives in the wild¶

Loan-assistant API — SRE tracks traces, model versions, and error-budget burn before scaling traffic.
Enterprise search copilot — platform engineer records retrieval sources, latency percentiles, and user feedback.
Sales-call summarizer — ML engineer watches cost per summary, transcript source quality, and rollback signals.
Legal drafting assistant — staff engineer gates launches on safety budget and citation-faithfulness burn.
Voice support bot — product engineer freezes experiments when abandonment and escalation budgets are exhausted.

Pause and recall¶

Which observability signals help explain a sudden faithfulness drop?
Why is an error budget stronger than a vague reliability goal?
What should happen when the budget is exhausted?
How do shared ownership and written notes make observability actionable?

Interview Q&A¶

Q1. What observability signals are essential for an AI workflow? A. Traces, model and prompt versions, retrieval sources, latency, cost, and user feedback are core. Together they explain behavior across the full path. Common wrong answer to avoid: "Just log request IDs and stack traces."

Q2. What is an error budget in simple terms? A. It is the amount of unreliability you are willing to tolerate before slowing down delivery. It turns reliability from opinion into an operating rule. Common wrong answer to avoid: "It is just a dashboard target for the SRE team."

Q3. What should a team do when the budget is burning quickly? A. Freeze risky launches, investigate the burn source, and fix reliability first. Then record the decision clearly for the next handoff. Common wrong answer to avoid: "Ship faster so the new version can average things out."

Q4. Why is uptime alone a weak AI health metric? A. A system can stay available while giving wrong, unsafe, slow, or expensive answers. AI reliability needs richer signals than simple uptime. Common wrong answer to avoid: "If the service is up, the AI system is healthy."

Apply now (5 min)¶

Exercise: Choose one AI workflow you know. Write six signals you would log from request to response. Then write one error budget metric that would change shipping behavior. Add one sentence saying who owns the first response.

Sketch from memory: Draw the observability loop with Observe, Compare, Decide, and Record. Label where risk review happens. Add the notes box on the side and connect it back to the operating plan.

Bridge. Even with good monitoring, hidden mess keeps accumulating below the surface. Next, see how AI-specific technical debt grows silently until scale or failure exposes it. → 07-technical-debt-ai.md