00. Observability, Reliability, and Incidents — The Five-Year-Old Version¶
You can't fix what you can't see. Observability is giving your system X-ray vision.
Imagine you're a doctor in a hospital. Patients come in sick. You can't just look at them and guess. You need instruments.
A thermometer tells you the temperature — one number that says "something is wrong" but not what. That is a metric. CPU at 95%. Latency at 500 ms. Error rate at 5%. The thermometer tells you there's a fever.
An X-ray shows the inside. You see which bone is broken, which organ is swollen. That is a trace — the full path of a request through your system, showing where it slowed down or failed. The X-ray tells you WHERE the problem is.
A medical chart records everything: when symptoms started, what treatment was given, how the patient responded. That is a log — timestamped events that tell the full story. The medical chart tells you WHAT happened and WHEN.
When a patient is critical, you set a monitor alarm — if heart rate drops below 60, beep. If blood pressure exceeds 180, beep. That is an alert tied to an SLO. The monitor alarm wakes you at 3 AM when something is dying.
Finally, every hospital has a playbook for emergencies. "If cardiac arrest: do CPR, call code blue, administer epinephrine." That is a runbook. You don't think during a crisis. You follow the playbook. Incident response is practiced, not improvised.
Observability is equipping your system with thermometers, X-rays, medical charts, monitor alarms, and playbooks. Reliability is making sure the patient (your system) stays healthy. Incident response is what you do when it doesn't.
Why does this matter for AI systems specifically? Because ML models fail silently. A web server crashes — you get a 500 error. A model starts giving bad predictions — no error, just quietly wrong answers. You need the thermometer to track prediction quality, not just CPU usage.
Also, AI inference is expensive. A model serving 1000 requests per second on GPUs costs real money. If latency creeps up, you might be wasting GPU cycles. The X-ray shows you where time goes — is it the network? The model? The pre-processing?
Reliability engineering (SRE) gives you a framework: define what "healthy" means (SLOs), measure it (SLIs), set a budget for downtime (error budget), and when the budget runs low, stop shipping features and focus on reliability. Simple math, hard discipline.
The hospital analogy holds all the way through. Prevention (chaos engineering) is better than cure (incident response). Regular checkups (monitor alarms firing before users notice) catch problems early. And postmortems — reviewing what went wrong without blaming — make the whole system healthier over time.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| thermometer | metrics — numeric measurements (latency, error rate, CPU, QPS) |
| X-ray | traces — the full path of a request through multiple services |
| medical chart | logs — timestamped text records of events and errors |
| monitor alarm | alerts and SLOs — thresholds that trigger notifications |
| playbook | runbooks and incident response — predefined steps for known failures |
Top resources¶
- Google SRE Book — the foundational text on reliability engineering; free online
- Observability Engineering by Charity Majors — modern observability beyond the three pillars
- OpenTelemetry Documentation — the emerging standard for traces, metrics, and logs collection
- PagerDuty Incident Response Guide — free; practical incident management procedures
- Datadog Learning Center — hands-on observability with real dashboards and alerting
What's coming¶
- 01-metrics-logs-traces.md — the three pillars and how they complement each other
- 02-opentelemetry-instrumentation.md — standardized collection across languages and services
- 03-slos-error-budgets-alerting.md — defining "good enough" and alerting when you're not
- 04-dashboards-and-queries.md — building dashboards that answer real questions
- 05-distributed-tracing.md — following a request across 20 services
- 06-incident-response.md — detection, triage, mitigation, resolution, and postmortem
- 07-safe-rollbacks-kill-switches.md — feature flags, rollback procedures, and circuit breakers
- 08-chaos-engineering.md — breaking things on purpose to find weaknesses
- 09-deployment-strategies.md — canary, blue-green, rolling, and shadow deployments
- 10-disaster-recovery.md — multi-region, RTO/RPO, failover, and backup strategies
- 11-honest-admission.md — what we don't fully understand about reliability
Bridge. The doctor's toolkit starts with three instruments. Metrics, logs, and traces — each reveals different things about system health. → 01-metrics-logs-traces.md