07. Saga pattern — coordinate many local commits without one global transaction¶

⏱️ Estimated time: 17 min | Level: intermediate

ELI5 callback: The town crier pins one notice on the bulletin board. The board rules decide delivery, and the town directory helps readers find it.

Why sagas exist in the first place¶

One user action often touches payment, inventory, shipping, and notifications. A single database transaction cannot cover all those services safely. See. Two-phase commit looks neat, but it adds latency and tight coupling. So teams use sagas to break one big transaction into local steps. Each town crier emits the next signal after its local commit. The goal is business consistency, not one magical global lock. Diagram:

+--------+   +---------+   +-----------+   +----------+
| order  |-> | payment |-> | inventory |-> | shipping |
+--------+   +---------+   +-----------+   +----------+
    fail here? compensation must walk backward
  local commits happen in separate services
no single rollback can rewind every database instantly

The customer clicks checkout once.
Order service creates a pending order locally.
Payment service charges the card locally.
Inventory service reserves stock locally.
Any later failure needs compensation, not a global rollback.
Each service owns its own database.
Cross-service consistency arrives as a workflow.
Local commits should stay small and clear.
Compensation design belongs in the first draft. Sagas trade instant certainty for controlled recovery.

Choreography lets services react without a central conductor¶

Choreography means services react to events without a central coordinator. The bulletin board carries the chain from one service to another. One notice says payment succeeded, and inventory reacts automatically. This style keeps services loosely coupled at runtime. It also makes the overall flow harder to see. Now watch. A missed event or ambiguous ownership can create hidden bugs. Diagram:

+---------+   event    +-----------+   event    +-----------+
| payment | ---------> | inventory | ---------> | shipping  |
+---------+            +-----------+            +-----------+
      ^                        |                        |
      +--------- refund <------+                        v
                 on failure                   send confirmation

OrderPlaced wakes payment.
PaymentSucceeded wakes inventory.
InventoryReserved wakes shipping.
InventoryFailed wakes payment refund.
Everyone reacts to events instead of direct calls.
Choreography reduces central dependencies.
It can hide the full workflow across many repos.
Ownership boundaries must stay explicit.
Observability becomes essential, not optional. Loose coupling feels nice until nobody sees the whole dance.

Orchestration puts the workflow in one visible place¶

Orchestration puts one coordinator in charge of the saga. The coordinator sends commands and waits for replies. That makes the flow easier to reason about. It also creates one more component to build carefully. A strong town directory helps the orchestrator route commands correctly. Timeouts, retries, and compensations live in one place. Simple, no? Diagram:

+-------------+   cmd    +---------+
| orchestrator | ------> | payment |
+-------------+          +---------+
       |   cmd                 |
       +---------------------> +-----------+
                                 inventory

Orchestrator starts the saga with a durable state record.
It asks payment to charge the card.
On success, it asks inventory to reserve stock.
On inventory failure, it issues RefundPayment.
The workflow stays visible in one timeline.
Orchestration improves auditability.
It centralizes retry and timeout logic.
It can become a bottleneck if overloaded.
Keep saga state durable, not only in memory. Visibility is often worth the extra component.

Compensation is business repair, not time travel¶

Sagas succeed by undoing business intent, not by rolling back bytes. Compensation must be explicit for every irreversible step. Refund payment is different from pretending payment never happened. Timeouts, retry caps, and compensation triggers become board rules. Some actions cannot be compensated fully, like sending an email. For those, design human remediation paths. The system should surface partial completion clearly. Diagram:

+--------- forward path ---------+
order -> payment -> inventory -> ship
+------ compensation path -------+
ship fail -> release stock -> refund
timeout -> mark saga waiting or failed
humans may handle the final messy step

Charge succeeds but stock reservation times out.
The orchestrator waits for a bounded period.
It retries if the step is safely retryable.
If stock still fails, it refunds payment.
The saga ends in a clear compensated state.
Compensation handlers must be idempotent too.
Keep timeout values aligned with business patience.
Record every step transition durably.
Expose stuck sagas through alerts and dashboards. A hidden partial failure is worse than an honest visible one.

Choose choreography or orchestration by debugging cost¶

Choose choreography when steps are simple and boundaries are strong. Choose orchestration when branching, timeouts, or visibility matter more. See. Both styles still need idempotent handlers and durable state. Teams often start with orchestration for clarity. Later, they move simple side effects to event reactions. The smart choice is the one operators can debug at 2 AM. Diagram:

+---------------+     compare     +----------------+
| choreography  |  <---------->  | orchestration  |
+---------------+                +----------------+
simple flows                        complex flows
fewer central parts                 clearer control
harder tracing                      easier tracing

Count how many services take part.
List every compensation step explicitly.
Mark where timeouts and human actions appear.
Decide who owns end-to-end visibility.
Pick the style your tooling can support well.
There is no universal winner.
Workflow complexity should drive the choice.
Small systems can stay simple for a long time.
Operational clarity beats architectural fashion. Good saga design feels boring in production, and that is success.

Where this lives in the wild¶

Uber-style trip flows coordinating payments, driver state, and rider updates.
Swiggy or Zomato order lifecycles spanning restaurants, delivery, and refunds.
Temporal workflows, where orchestration state stays durable and inspectable.
Camunda process engines running long business transactions with compensation.
AWS Step Functions coordinating cloud service steps with retries and branches.

Pause and recall¶

Why can sagas not depend on one database rollback across services?
When does choreography become hard to reason about?
Why is compensation different from reversing a SQL transaction?
What signals tell you orchestration may be a better fit?

Interview Q&A¶

Q: What problem does the saga pattern solve? A: It coordinates many local commits across services when one ACID transaction is impractical or too costly. Common wrong answer to avoid: "It gives microservices a free global database transaction."

Q: When is choreography risky? A: When many services react indirectly and nobody can easily trace ownership, branching, or timeout behavior. Common wrong answer to avoid: "Choreography is always simpler because there is no coordinator."

Q: What makes compensation hard? A: Business undo steps may be partial, delayed, or impossible, so they need domain-specific design and idempotency. Common wrong answer to avoid: "Just delete the row you inserted and everything is rolled back."

Q: Why keep saga state durable? A: Because coordinators crash too, and the workflow must resume without forgetting progress or compensation decisions. Common wrong answer to avoid: "The coordinator can rebuild everything from memory after restart."

Apply now (5 min)¶

Take one workflow, like checkout or ride booking. List the local commit made by each service. Write the compensation for every step. Mark which failures are retryable and which need human help. Choose choreography or orchestration and defend that choice. Add one timeout and one alert you would implement first. Now watch the vague workflow become an explicit contract.

Bridge. Sagas coordinate writes across services. But what about separating reads from writes entirely? → 08