00. Observability, Reliability, and Incidents — The Five-Year-Old Version¶

You can't fix what you can't see. Observability is giving your system X-ray vision.

Imagine you're a doctor in a hospital. Patients come in sick. You can't just look at them and guess. You need instruments.

A thermometer tells you the temperature — one number that says "something is wrong" but not what. That is a metric. CPU at 95%. Latency at 500 ms. Error rate at 5%. The thermometer tells you there's a fever.

An X-ray shows the inside. You see which bone is broken, which organ is swollen. That is a trace — the full path of a request through your system, showing where it slowed down or failed. The X-ray tells you WHERE the problem is.

A medical chart records everything: when symptoms started, what treatment was given, how the patient responded. That is a log — timestamped events that tell the full story. The medical chart tells you WHAT happened and WHEN.

When a patient is critical, you set a monitor alarm — if heart rate drops below 60, beep. If blood pressure exceeds 180, beep. That is an alert tied to an SLO. The monitor alarm wakes you at 3 AM when something is dying.

Finally, every hospital has a playbook for emergencies. "If cardiac arrest: do CPR, call code blue, administer epinephrine." That is a runbook. You don't think during a crisis. You follow the playbook. Incident response is practiced, not improvised.

Observability is equipping your system with thermometers, X-rays, medical charts, monitor alarms, and playbooks. Reliability is making sure the patient (your system) stays healthy. Incident response is what you do when it doesn't.

Why does this matter for AI systems specifically? Because ML models fail silently. A web server crashes — you get a 500 error. A model starts giving bad predictions — no error, just quietly wrong answers. You need the thermometer to track prediction quality, not just CPU usage.

Also, AI inference is expensive. A model serving 1000 requests per second on GPUs costs real money. If latency creeps up, you might be wasting GPU cycles. The X-ray shows you where time goes — is it the network? The model? The pre-processing?

Reliability engineering (SRE) gives you a framework: define what "healthy" means (SLOs), measure it (SLIs), set a budget for downtime (error budget), and when the budget runs low, stop shipping features and focus on reliability. Simple math, hard discipline.

The hospital analogy holds all the way through. Prevention (chaos engineering) is better than cure (incident response). Regular checkups (monitor alarms firing before users notice) catch problems early. And postmortems — reviewing what went wrong without blaming — make the whole system healthier over time.

The placeholders you will see called back¶

Placeholder	Meaning
thermometer	metrics — numeric measurements (latency, error rate, CPU, QPS)
X-ray	traces — the full path of a request through multiple services
medical chart	logs — timestamped text records of events and errors
monitor alarm	alerts and SLOs — thresholds that trigger notifications
playbook	runbooks and incident response — predefined steps for known failures

Top resources¶

Google SRE Book — the foundational text on reliability engineering; free online
Observability Engineering by Charity Majors — modern observability beyond the three pillars
OpenTelemetry Documentation — the emerging standard for traces, metrics, and logs collection
PagerDuty Incident Response Guide — free; practical incident management procedures
Datadog Learning Center — hands-on observability with real dashboards and alerting

What's coming¶

01-metrics-logs-traces.md — the three pillars and how they complement each other
02-opentelemetry-instrumentation.md — standardized collection across languages and services
03-slos-error-budgets-alerting.md — defining "good enough" and alerting when you're not
04-dashboards-and-queries.md — building dashboards that answer real questions
05-distributed-tracing.md — following a request across 20 services
06-incident-response.md — detection, triage, mitigation, resolution, and postmortem
07-safe-rollbacks-kill-switches.md — feature flags, rollback procedures, and circuit breakers
08-chaos-engineering.md — breaking things on purpose to find weaknesses
09-deployment-strategies.md — canary, blue-green, rolling, and shadow deployments
10-disaster-recovery.md — multi-region, RTO/RPO, failover, and backup strategies
11-honest-admission.md — what we don't fully understand about reliability

Bridge. The doctor's toolkit starts with three instruments. Metrics, logs, and traces — each reveals different things about system health. → 01-metrics-logs-traces.md