03. SLOs, Error Budgets, and Alerting¶

⏱️ Estimated time: 22 min | Level: advanced

ELI5 callback: In the hospital analogy, the monitor alarm should wake people for patient danger, the thermometer should show trend, and the playbook should guide response.

1) SLI, SLO, and SLA are not the same thing¶

Teams mix these terms and then build messy alerts. A thermometer without a target is just trivia.

An SLI is the measured indicator.

An SLO is the target for that indicator.

An SLA is the customer-facing commitment with consequences.

See. Measure, target, promise. Three layers.

Your SLI might be request success under a latency threshold.

Your SLO might be 99.9 percent over 28 days.

Your SLA may be looser because legal promises need caution.

┌────────────┐ │ SLI │ what you measure ├────────────┤ │ SLO │ target you operate to ├────────────┤ │ SLA │ external commitment └────────────┘ Use the X-ray to validate whether slow paths match the SLI. - Choose indicators from the user perspective first.

Avoid vanity measures that look busy but miss pain.
Keep the math explainable to product and leadership.
Separate internal goals from contractual promises.

2) Good SLIs map to user happiness¶

A service can have green CPU and still fail users badly.

That is why SLIs should reflect successful work.

Availability alone is often too weak.

Latency, freshness, correctness, and durability may matter more.

So what to do?

Another X-ray helps explain which dependency burns the budget. Pick the narrow slice that users truly feel.

For a search API, success plus p95 latency may be enough.

For payments, correctness and idempotency can be mission critical.

Use request-based indicators for online user paths.
Use window-based indicators carefully for intermittent workloads.
Segment critical routes instead of mixing cheap and expensive endpoints.
Review whether retries hide user-visible failures.

3) Error budgets turn reliability into a pacing tool¶

Error budget is the allowed unreliability inside the SLO target.

If your SLO is 99.9 percent, your budget is 0.1 percent.

That small number creates a useful management lever. The medical chart confirms whether retries hid real failures.

Spend budget when shipping riskier change.

Slow down when the budget burns too fast.

Simple, no? Reliability becomes a rate of spend.

This avoids religious fights between feature speed and platform caution.

Everyone can look at the same remaining budget.

Use rolling windows so budget reflects recent reality.
Burn-rate alerts detect budget loss before the month ends.
Tie release gates to budget health for critical systems.
Budget policies should be written before the incident, not during it.

Another medical chart query separates noisy errors from harmful ones.

4) Alerting should interrupt only for action¶

An alert without action is just noise with a badge.

Page only when a human can reduce user harm now.

Ticket or email the rest.

This one rule cuts noise more than fancy tooling.

Now watch. Threshold alerts alone often page too late or too often.

Burn-rate alerts tie pages to customer impact velocity.

Symptom alerts usually beat cause alerts for first detection.

But cause alerts still help route ownership faster. A monitor alarm should fire on burn rate, not vanity metrics.

Page on user-impacting symptoms, not every host wobble.
Add runbook links to the alert payload itself.
De-duplicate related alerts so one incident does not page twenty people.
Tune by reviewing false positives and false negatives every week.

5) Fighting alert fatigue takes deliberate design¶

Alert fatigue is not a personal weakness.

It is a system design failure.

Too many pages train people to distrust the channel.

Too few pages train teams to miss early damage.

See. The sweet spot needs iteration.

Group alerts by service, symptom, severity, and owner.

Suppress downstream duplicates during known primary failures.

Retire alerts that never led to useful action.

Keep a weekly alert review with examples, not opinions.
Track page volume, acknowledgement delay, and actionability.
Re-check SLOs after major architecture or traffic changes.
Teach product partners what budget burn means for release pace. The playbook should define pager steps once the budget is burning.

Where this lives in the wild¶

Consumer apps define separate SLOs for login, checkout, and feed freshness.
Payment platforms gate risky releases when error budget burn accelerates.
Infrastructure teams use multi-window burn-rate alerts for core APIs.
SaaS admin products keep email alerts for low urgency and pages for user-visible breakage.
Platform leaders use SLO reviews to balance roadmap speed with reliability work.

Pause and recall¶

What is the difference between an SLI, an SLO, and an SLA?
Why should an alert page only when action is possible now?
How does an error budget help settle release-versus-reliability debates?
Why can availability alone be a weak reliability indicator?

Interview Q&A¶

Q: Why are SLOs better operational targets than raw uptime goals? A: They tie reliability to specific user journeys and measurable thresholds, so teams know what good service actually means. Common wrong answer to avoid: "Because uptime is outdated" - uptime is still useful, but it is often too coarse for user experience.

Q: Why use error budgets in release decisions? A: They quantify how much unreliability has already been spent, turning vague risk arguments into visible pacing rules. Common wrong answer to avoid: "Because budgets punish developers" - the goal is shared trade-off clarity, not blame.

Q: Why do burn-rate alerts often beat static thresholds? A: They measure how fast the budget is being consumed, so they react to meaningful customer harm earlier and more proportionally. Common wrong answer to avoid: "Because thresholds are mathematically wrong" - thresholds can still help; burn rate is usually better for paging impact.

Q: How do you reduce alert fatigue without hiding real issues? A: Page on actionable symptoms, de-duplicate noisy chains, and review which alerts actually drove useful intervention. Common wrong answer to avoid: "Just raise every threshold" - that cuts noise, but it can also hide genuine damage.

Apply now (5 min)¶

Choose one customer-facing endpoint. Write one SLI, one 28-day SLO target, and one page-worthy burn-rate condition. Then write one non-page alert that still matters for backlog or capacity. If the wording feels vague, refine the user impact first.

Bridge. SLOs defined. But how do we visualize all this data? → 04