08. State Machines and Workflows — wire the legal moves before features multiply¶

~13 min read. Hidden state feels harmless, then one illegal jump ruins an order flow.

Built on the ELI5 in 00-eli5.md. The wiring — the behavioral circuit between rooms — decides which move is legal now and which move must wait.

1) Name the business state explicitly, otherwise bugs hide in flags¶

Many systems begin with one innocent status field. Then more conditions arrive. Soon logic spreads across booleans, timestamps, and queue side effects. Nobody can tell the real state quickly. See. For an order, start with named states. The basic story is simple. Placed becomes paid, paid becomes shipped, shipped becomes delivered. Cancellation and failure branches are separate decisions.

┌────────┐   pay    ┌──────┐   ship   ┌─────────┐ deliver ┌───────────┐
│ PLACED │ ───────▶ │ PAID │ ───────▶ │ SHIPPED │ ───────▶ │ DELIVERED │
└────────┘          └──────┘          └─────────┘          └───────────┘
     │ cancel                 │ refund before ship
     └──────────────────────▶ └──────────────────────▶ CANCELLED

That picture already answers many arguments. Can PLACED jump straight to SHIPPED? Usually no. Can DELIVERED jump back to PAID? Definitely no. Simple, no? Implicit state example:

if (paymentReceived && courierAssigned && shippedAt == null) {
    order.markShipped();
}

What is wrong here? The business rule is hidden inside three separate signals. A reviewer cannot inspect all legal moves in one place. Testing every combination becomes tiring and incomplete. Explicit state example:

enum OrderState {
    PLACED, PAID, SHIPPED, DELIVERED, CANCELLED
}

Worked example. Suppose 10,000 orders arrive today. 8,900 reach PAID, 8,300 reach SHIPPED, 8,050 reach DELIVERED. With explicit state, each drop is visible and measurable. With scattered booleans, reporting becomes guesswork. So what to do? Name the state first, then code the moves.

2) A transition is event plus guard plus action, not just an arrow¶

A state machine is not only boxes and arrows. Each move needs three details. What event occurred, which guard allowed it, and what action followed. That is proper wiring.

┌───────────┬──────────────┬──────────────────────┬────────────┐
│ From      │ Event        │ Guard                │ To         │
├───────────┼──────────────┼──────────────────────┼────────────┤
│ PLACED    │ payment_ok   │ amount matches       │ PAID       │
│ PAID      │ ship_request │ address confirmed    │ SHIPPED    │
│ SHIPPED   │ delivery_ok  │ otp verified         │ DELIVERED  │
│ PLACED    │ cancel       │ not paid yet         │ CANCELLED  │
└───────────┴──────────────┴──────────────────────┴────────────┘

See the benefit. When finance changes a rule, you update one guard. When operations asks for audit logs, you attach them to actions. The transition stays readable. Concrete Java code:

void markPaid(Money received, Money expected) {
    if (state != OrderState.PLACED) throw new IllegalStateException();
    if (!received.equals(expected)) throw new DomainException("amount mismatch");
    state = OrderState.PAID;
    events.add("payment_ok");
}

This is much safer than setStatus("PAID"). A generic setter cannot defend invariants. A named transition can. Worked example with numbers. Assume 500 payment webhooks arrive in one minute. 495 are valid, 3 have amount mismatch, 2 are duplicates. If transitions log from, event, and to, support can classify every case quickly. If not, everyone stares at raw logs nervously. A tiny transition log looks like this.

order_id | from    | event        | to         | at
----------------------------------------------------
501      | PLACED  | payment_ok   | PAID       | 10:03
501      | PAID    | ship_request | SHIPPED    | 11:20
501      | SHIPPED | delivery_ok  | DELIVERED  | 18:05

So what to do? Store the legal transition table in code, not in team memory. That one decision makes debugging much calmer.

3) Use the state pattern when behavior differs by state, not only labels¶

Sometimes an enum plus transition methods is enough. Sometimes every operation behaves differently in each state. Then the state pattern becomes useful. Do not use it for fashion. Use it when switches start spreading everywhere. Naive approach:

switch (ticket.state) {
  case 'OPEN':
    return assign(agentId);
  case 'RESOLVED':
    throw new Error('cannot assign');
  case 'CLOSED':
    throw new Error('cannot assign');
}

One switch is fine. Fifteen switches in six files are not fine. That means behavior has leaked out of its room. State pattern sketch:

interface TicketState {
  assign(agentId: string): TicketState;
  resolve(): TicketState;
  reopen(): TicketState;
}
class OpenState implements TicketState { /* ... */ }
class ResolvedState implements TicketState { /* ... */ }
class ClosedState implements TicketState { /* ... */ }

Now each state owns its own rules. The caller uses one stable hallway. Illegal operations fail close to the state definition. Reviewing behavior becomes much easier. Simple, no? Worked example. A support ticket allows 6 actions in OPEN. It allows only 2 actions in RESOLVED. It allows 0 write actions in CLOSED. State classes express that difference directly. One giant switch usually hides it. A quick rule of thumb: - few states and few behaviors: enum is enough; - repeated branching across files: state pattern helps; - many external effects: add transition audit logs too.

4) Workflow engines coordinate long-running steps across many systems¶

A state machine can live inside one service. A workflow spans time, retries, external callbacks, and compensation. That is the difference. One is local behavior. The other is cross-system progress. Take an order workflow. Payment, inventory, shipping, and notification may involve four services. One step may fail after ten seconds. Another may wait thirty minutes for a courier callback. A plain HTTP handler is the wrong place to hold all that.

OrderPlaced
   │
   ▼
ChargePayment ──fail──▶ MarkPaymentFailed
   │ success
   ▼
ReserveInventory ──fail──▶ RefundPayment
   │ success
   ▼
CreateShipment ──fail──▶ ReleaseInventory + RefundPayment
   │ success
   ▼
SendConfirmation

See. A workflow must remember what already happened. Otherwise a crash between payment and shipment can double-charge or lose progress. Durable execution matters here. Pseudo-code idea:

workflow.run(() -> {
    payment.charge(orderId);
    inventory.reserve(orderId);
    shipment.create(orderId);
    notifier.send(orderId);
});

The nice syntax is not the real value. The real value is persisted progress, retries, timers, and compensation. Temporal, Camunda, and AWS Step Functions exist for this exact pain. Worked example. Suppose shipment booking succeeds 97% of the time immediately. The remaining 3% need partner retry after 20 minutes. A workflow engine can park the process, resume later, and keep one trace. A single synchronous controller cannot do that safely. One more code-level lesson. Make transitions idempotent. If payment_ok arrives twice, second processing should not create two invoices. Explicit state plus idempotency keys is a calm combination. So what to do? Use local state machines for service rules, and workflows for long-running orchestration.

Where this lives in the wild¶

A Flipkart supply-chain engineer models order states so warehouse and delivery systems agree on parcel progress.
An Uber backend engineer controls trip states like requested, matched, started, and completed to block illegal jumps.
A Zepto operations-platform engineer orchestrates payment, picker assignment, and rider dispatch as one durable workflow.
A Jira platform engineer at Atlassian uses explicit issue workflows so project boards show only legal transitions.
A Netflix media-workflow engineer coordinates long-running processing jobs where retries and compensation are central.

Pause and recall¶

Why is explicit state safer than many booleans and timestamps?
What three pieces make a transition complete?
When is the state pattern more useful than an enum?
What extra problem does a workflow engine solve?

Interview Q&A¶

Why explicit state not derived flags for order flow?¶

Because one named state makes legal moves inspectable, testable, and measurable. Derived flags spread meaning across unrelated fields and side effects. Common wrong answer to avoid: "Booleans are simpler, so they are always the better design."

Why named transition methods not a generic `setStatus()`?¶

Because transition methods can enforce guards, emit events, and protect invariants. A generic setter allows illegal jumps with no business language. Common wrong answer to avoid: "A setter is more flexible, so domain rules can come later."

Why state pattern not one giant `switch` everywhere?¶

Because repeated switches duplicate behavior rules and scatter them across files. State objects keep each state's behavior close to its definition. Common wrong answer to avoid: "Switches are faster, so structure does not matter."

Why workflow engine not one synchronous controller method?¶

Because long waits, retries, and compensation need durable progress beyond one request thread. Workflows survive crashes and resume from the right step. Common wrong answer to avoid: "Clean code inside one method is enough for long-running orchestration."

Apply now (5 min)¶

Exercise. Take a food delivery order and write 6 states, 4 events, 2 guards, and 2 illegal transitions. Then add one compensation step for payment success followed by inventory failure. Sketch from memory. Draw the order line from PLACED to DELIVERED, add a cancellation branch, and mark one idempotent event. Say aloud which parts belong to a local state machine and which belong to a workflow engine.

Bridge. Once the wiring is explicit, failures still travel through the service, and careless handling can flood every room. → 09-error-handling-and-resilience.md