08. State Machines and Workflows — wire the legal moves before features multiply¶
~13 min read. Hidden state feels harmless, then one illegal jump ruins an order flow.
Built on the ELI5 in 00-eli5.md. The wiring — the behavioral circuit between rooms — decides which move is legal now and which move must wait.
1) Name the business state explicitly, otherwise bugs hide in flags¶
Many systems begin with one innocent status field.
Then more conditions arrive.
Soon logic spreads across booleans, timestamps, and queue side effects.
Nobody can tell the real state quickly.
See.
For an order, start with named states.
The basic story is simple.
Placed becomes paid, paid becomes shipped, shipped becomes delivered.
Cancellation and failure branches are separate decisions.
┌────────┐ pay ┌──────┐ ship ┌─────────┐ deliver ┌───────────┐
│ PLACED │ ───────▶ │ PAID │ ───────▶ │ SHIPPED │ ───────▶ │ DELIVERED │
└────────┘ └──────┘ └─────────┘ └───────────┘
│ cancel │ refund before ship
└──────────────────────▶ └──────────────────────▶ CANCELLED
PLACED jump straight to SHIPPED?
Usually no.
Can DELIVERED jump back to PAID?
Definitely no.
Simple, no?
Implicit state example:
What is wrong here?
The business rule is hidden inside three separate signals.
A reviewer cannot inspect all legal moves in one place.
Testing every combination becomes tiring and incomplete.
Explicit state example:
Worked example.
Suppose 10,000 orders arrive today.
8,900 reach PAID, 8,300 reach SHIPPED, 8,050 reach DELIVERED.
With explicit state, each drop is visible and measurable.
With scattered booleans, reporting becomes guesswork.
So what to do?
Name the state first, then code the moves.
2) A transition is event plus guard plus action, not just an arrow¶
A state machine is not only boxes and arrows. Each move needs three details. What event occurred, which guard allowed it, and what action followed. That is proper wiring.
┌───────────┬──────────────┬──────────────────────┬────────────┐
│ From │ Event │ Guard │ To │
├───────────┼──────────────┼──────────────────────┼────────────┤
│ PLACED │ payment_ok │ amount matches │ PAID │
│ PAID │ ship_request │ address confirmed │ SHIPPED │
│ SHIPPED │ delivery_ok │ otp verified │ DELIVERED │
│ PLACED │ cancel │ not paid yet │ CANCELLED │
└───────────┴──────────────┴──────────────────────┴────────────┘
void markPaid(Money received, Money expected) {
if (state != OrderState.PLACED) throw new IllegalStateException();
if (!received.equals(expected)) throw new DomainException("amount mismatch");
state = OrderState.PAID;
events.add("payment_ok");
}
setStatus("PAID").
A generic setter cannot defend invariants.
A named transition can.
Worked example with numbers.
Assume 500 payment webhooks arrive in one minute.
495 are valid, 3 have amount mismatch, 2 are duplicates.
If transitions log from, event, and to, support can classify every case quickly.
If not, everyone stares at raw logs nervously.
A tiny transition log looks like this.
order_id | from | event | to | at
----------------------------------------------------
501 | PLACED | payment_ok | PAID | 10:03
501 | PAID | ship_request | SHIPPED | 11:20
501 | SHIPPED | delivery_ok | DELIVERED | 18:05
3) Use the state pattern when behavior differs by state, not only labels¶
Sometimes an enum plus transition methods is enough. Sometimes every operation behaves differently in each state. Then the state pattern becomes useful. Do not use it for fashion. Use it when switches start spreading everywhere. Naive approach:
switch (ticket.state) {
case 'OPEN':
return assign(agentId);
case 'RESOLVED':
throw new Error('cannot assign');
case 'CLOSED':
throw new Error('cannot assign');
}
interface TicketState {
assign(agentId: string): TicketState;
resolve(): TicketState;
reopen(): TicketState;
}
class OpenState implements TicketState { /* ... */ }
class ResolvedState implements TicketState { /* ... */ }
class ClosedState implements TicketState { /* ... */ }
OPEN.
It allows only 2 actions in RESOLVED.
It allows 0 write actions in CLOSED.
State classes express that difference directly.
One giant switch usually hides it.
A quick rule of thumb:
- few states and few behaviors: enum is enough;
- repeated branching across files: state pattern helps;
- many external effects: add transition audit logs too.
4) Workflow engines coordinate long-running steps across many systems¶
A state machine can live inside one service. A workflow spans time, retries, external callbacks, and compensation. That is the difference. One is local behavior. The other is cross-system progress. Take an order workflow. Payment, inventory, shipping, and notification may involve four services. One step may fail after ten seconds. Another may wait thirty minutes for a courier callback. A plain HTTP handler is the wrong place to hold all that.
OrderPlaced
│
▼
ChargePayment ──fail──▶ MarkPaymentFailed
│ success
▼
ReserveInventory ──fail──▶ RefundPayment
│ success
▼
CreateShipment ──fail──▶ ReleaseInventory + RefundPayment
│ success
▼
SendConfirmation
workflow.run(() -> {
payment.charge(orderId);
inventory.reserve(orderId);
shipment.create(orderId);
notifier.send(orderId);
});
payment_ok arrives twice, second processing should not create two invoices.
Explicit state plus idempotency keys is a calm combination.
So what to do?
Use local state machines for service rules, and workflows for long-running orchestration.
Where this lives in the wild¶
- A Flipkart supply-chain engineer models order states so warehouse and delivery systems agree on parcel progress.
- An Uber backend engineer controls trip states like requested, matched, started, and completed to block illegal jumps.
- A Zepto operations-platform engineer orchestrates payment, picker assignment, and rider dispatch as one durable workflow.
- A Jira platform engineer at Atlassian uses explicit issue workflows so project boards show only legal transitions.
- A Netflix media-workflow engineer coordinates long-running processing jobs where retries and compensation are central.
Pause and recall¶
- Why is explicit state safer than many booleans and timestamps?
- What three pieces make a transition complete?
- When is the state pattern more useful than an enum?
- What extra problem does a workflow engine solve?
Interview Q&A¶
Why explicit state not derived flags for order flow?¶
Because one named state makes legal moves inspectable, testable, and measurable. Derived flags spread meaning across unrelated fields and side effects. Common wrong answer to avoid: "Booleans are simpler, so they are always the better design."
Why named transition methods not a generic setStatus()?¶
Because transition methods can enforce guards, emit events, and protect invariants. A generic setter allows illegal jumps with no business language. Common wrong answer to avoid: "A setter is more flexible, so domain rules can come later."
Why state pattern not one giant switch everywhere?¶
Because repeated switches duplicate behavior rules and scatter them across files. State objects keep each state's behavior close to its definition. Common wrong answer to avoid: "Switches are faster, so structure does not matter."
Why workflow engine not one synchronous controller method?¶
Because long waits, retries, and compensation need durable progress beyond one request thread. Workflows survive crashes and resume from the right step. Common wrong answer to avoid: "Clean code inside one method is enough for long-running orchestration."
Apply now (5 min)¶
Exercise.
Take a food delivery order and write 6 states, 4 events, 2 guards, and 2 illegal transitions.
Then add one compensation step for payment success followed by inventory failure.
Sketch from memory.
Draw the order line from PLACED to DELIVERED, add a cancellation branch, and mark one idempotent event.
Say aloud which parts belong to a local state machine and which belong to a workflow engine.
Bridge. Once the wiring is explicit, failures still travel through the service, and careless handling can flood every room. → 09-error-handling-and-resilience.md