Skip to content

Observability Reliability Incidents

The chapters in this module, in reading order.

# Chapter
00 Observability, Reliability, and Incidents — The Five-Year-Old Version
01 Metrics, Logs, and Traces
02 OpenTelemetry Instrumentation
03 SLOs, Error Budgets, and Alerting
04 Dashboards and Queries
05 Distributed Tracing
06 Incident Response
07 Safe Rollbacks and Kill Switches
08 Chaos Engineering
09 Deployment Strategies
10 Disaster Recovery
11 Honest Admission