10. Disaster Recovery¶

⏱️ Estimated time: 23 min | Level: advanced

ELI5 callback: In the hospital analogy, the playbook must cover the whole hospital outage, the monitor alarm must tell you recovery status, and the medical chart must prove what data was preserved.

1) RTO and RPO define the business promise¶

Disaster recovery starts with two hard questions. The thermometer must show health for both primary and backup paths.

How long can we be down?

How much data can we afford to lose?

RTO answers the first.

RPO answers the second.

See. These are business commitments disguised as engineering numbers.

A premium payment system may need very small tolerance.

A reporting dashboard may accept much more loss and delay.

┌──────────────┬──────────────────────────────────────┐ │ RTO │ maximum acceptable recovery time │ ├──────────────┼──────────────────────────────────────┤ │ RPO │ maximum acceptable data loss window │ └──────────────┴──────────────────────────────────────┘ Another thermometer should track replication lag against recovery promises. - Set RTO and RPO per critical workflow, not only per company.

Validate the numbers with product and business owners.
Different tiers of service can justify different DR targets.
Cost rises sharply as tolerance approaches zero.

2) Multi-region patterns come with real trade-offs¶

Multi-region is not one design.

Active-passive keeps one primary and one standby.

Active-active serves from more than one region.

Warm standby sits between cold backup and full hot redundancy.

Simple, no? More readiness usually means more cost and complexity.

Cross-region latency, consistency, and failover routing all matter. Use the X-ray to compare failover paths before a real outage.

The hard part is not only serving traffic.

The hard part is preserving correctness during stress.

Active-passive simplifies writes but can slow failover.
Active-active improves availability but complicates consistency.
Warm standby reduces cost while keeping faster recovery than cold backups.
Region routing must be tested with realistic client behavior.

3) Backups are only useful when restore is real¶

Many teams celebrate successful backups and ignore restore practice.

That is backward.

Recovery depends on restore speed, integrity, and order.

Another X-ray helps expose DNS, auth, or cache dependencies during drills. Snapshots, WAL shipping, object backups, and config backups may all matter.

So what to do?

Define restore runbooks for data, infrastructure, and secrets together.

Verify backup age and restore time against RPO and RTO.

Backup without restore drills is theater.

Track backup freshness, completeness, and encryption status.
Restore to isolated environments regularly for validation.
Include secrets, certificates, and configuration in recovery scope.
Check application compatibility with restored state before declaring success.

4) Failover needs orchestration, not hope¶

Failover touches DNS, load balancers, data stores, and queues. The medical chart should preserve exact failover timestamps and decisions.

Each layer can lag or disagree.

Some clients cache aggressively.

Some background jobs restart in the wrong place.

Now watch. One-button failover is rare without serious preparation.

Decide who authorizes failover and which checks must pass.

Plan how to drain, freeze, or re-route writes safely.

Also plan the harder step: failback after stability returns.

Keep authority and command order explicit in the runbook. A monitor alarm should fire when RPO or RTO drift looks dangerous.
Verify queue semantics and idempotency during region transitions.
Watch replication lag before making irreversible traffic moves.
Treat failback as a separate risk event with its own checks.

5) DR testing exposes the truth¶

A DR plan that has never been tested is only a story.

Run tabletop exercises for coordination gaps.

Run technical drills for routing, restore, and promotion gaps.

Measure actual recovery time instead of estimated recovery time.

See. Real numbers often hurt, and that is useful.

Start with narrow scenarios and expand toward full-region assumptions.

Document surprises, missing access, and hidden dependencies.

Then feed those into platform and product roadmaps.

Test data restore and traffic routing as one end-to-end workflow.
Track actual RTO and RPO achieved in drills.
Include third-party dependencies in scenario planning.
Review whether on-call documentation is sufficient for fresh operators. The playbook should define failover order, fallback, and communication.

Where this lives in the wild¶

Global SaaS platforms define different DR targets for control plane and data plane services.
Payment processors invest heavily in low RPO and controlled failover procedures.
Cloud-native products use warm standby to balance cost and acceptable recovery time.
Database teams validate restore paths because backup success alone is not enough.
Executive risk reviews often depend on measured DR drill results, not slideware claims.

Pause and recall¶

What business questions do RTO and RPO answer?
Why is active-active harder than simply adding another region?
Why is restore testing more valuable than backup success alone?
What makes failback a distinct risk event?

Interview Q&A¶

Q: Why must RTO and RPO be agreed with the business, not only engineering? A: Because they encode acceptable downtime and data loss, which are business impact choices before they are technical designs. Common wrong answer to avoid: "Because product people like meetings" - alignment is needed because cost and tolerance are business trade-offs.

Q: When is active-passive a better DR choice than active-active? A: When simpler write patterns, lower operational complexity, and acceptable failover delay matter more than maximum regional parallelism. Common wrong answer to avoid: "Active-active is always superior" - it improves some resilience properties but complicates correctness and operations.

Q: Why are backups insufficient without restore drills? A: Because real recovery depends on integrity, ordering, tooling, permissions, and timing, none of which backup completion alone proves. Common wrong answer to avoid: "Because backups often fail silently" - that can happen, but even valid backups may still restore too slowly or incompletely.

Q: What should a DR exercise measure? A: Actual recovery time, actual data-loss window, coordination friction, and any hidden dependencies that block declared recovery. Common wrong answer to avoid: "Only whether traffic came back" - partial recovery can still violate business promises badly.

Apply now (5 min)¶

Take one critical workflow in your system. Write its target RTO, target RPO, primary region, standby pattern, and restore source. Then list the first three failover steps and one thing that could make failback dangerous. If any answer is fuzzy, your DR plan needs tightening.

Bridge. DR covered. What don't we fully understand about reliability? → 11