08. Shared State vs Messages — the architecture fork¶

~10 min read. One choice shapes debugging, scaling, and cost. Pick wrong and every handoff suffers.

Built on the ELI5 in 00-eli5.md. The handoff — what passes between departments — can flow through a shared bulletin board or through direct memos. Each path has real consequences.

1) Two models — picture first¶

See the picture first.

  SHARED STATE                    MESSAGE PASSING
  ┌─────────────┐               ┌──────┐    ┌──────┐
  │  shared DB   │               │ Agent│──→ │Agent │
  │  or store    │               │  A   │    │  B   │
  └──┬──┬──┬────┘               └──────┘    └──┬───┘
     │  │  │                                    │
     ▼  ▼  ▼                                    ▼
    A   B   C                              ┌──────┐
  (all read/write)                         │Agent │
                                           │  C   │
                                           └──────┘
                                     (explicit payloads)

Now name the two ideas clearly. Shared state means many agents read and write one common store. That store may be a database, cache, blackboard, or context object. Message passing means agents exchange explicit payloads directly. One agent sends a packet, another receives it, uses it, and replies.

Simple, no? In shared state, everybody can look at the same board. In message passing, every move must fit the memo format. That sounds cleaner, but it also creates new work.

Now what is the problem? Architectures do not fail only from bad models. They fail from bad the handoff design. If the board is messy, every the department reads confusion. If the memo format is vague, the next the department still guesses. So the real question is not "Which is modern?" The real question is "Which failure are we choosing to manage?"

Shared state optimizes visibility first. Message passing optimizes interfaces first. Most teams eventually mix both, but you should see the fork clearly before mixing.

2) Shared state — strengths and risks¶

Start with the attractive part. A shared store gives easy global visibility. One dashboard can show the whole run. One checkpoint can capture progress, errors, and pending work. Long-running workflows like this pattern. An agent can pause, and another can resume later from the same state. That is useful for retries and audits. If a manager agent wakes up late, it can still read history.

See. This is why workflow engines love shared state. This is why many prototypes feel fast in LangGraph-like systems. You sketch one graph. Each node reads and writes the common object. The result feels convenient.

Now what is the problem? Hidden coupling enters quietly. Agent B starts depending on a field Agent A wrote last week, and nobody declared that dependency clearly. A field name changes. A default value changes. Suddenly the downstream agent behaves strangely. Race conditions can also appear. Two agents write near the same time, one update wins, and one update disappears. Or one agent reads stale data and acts on old facts.

Debugging then becomes painful. You do not ask only, "Why is the answer bad?" You ask, "Who wrote this garbage?" That question costs hours.

Shared state is powerful when the schema is disciplined. It is dangerous when the schema is casual. So what to do? Use shared state for durable workflow records, checkpoints, and monitoring. Do not treat it like a magical dumping ground. Give ownership to fields. Version important structures. Track who wrote what and when. Otherwise the shared board becomes a rumor wall.

3) Message passing — strengths and risks¶

Now flip the picture. Message passing forces explicit interfaces. Agent A must say what it is sending, and Agent B must say what it expects. That discipline is healthy. A message can be replayed later, audited, and tested in isolation. If a bug appears, you inspect the payload trail. That is much easier than scanning one giant mutable object.

See. Good message passing makes the handoff concrete. It turns hidden assumptions into named fields. It also improves replacement. You can swap one the department for another if the payload contract stays stable. This is why AutoGen-style systems feel natural for conversation-first collaboration. This is also why Kafka-backed agent systems can replay event streams cleanly. Every handoff is a durable event.

Simple, no? Now what is the problem? Serialization adds work. You must pack the context, and you must unpack the context. If you are lazy, you overstuff the message with whole transcripts. Then token cost rises, latency rises, and clarity still does not improve. Duplication is another issue. The same facts may appear in several messages. Global visibility can also become harder. A human operator may ask, "Show me the entire workflow state now." The answer is not one neat object anymore. It is a chain of events and local views.

So what to do? Keep messages self-contained for the next decision, not for all future history. Summarize hard. Pass links or IDs when full payloads are unnecessary. Add tracing so the event trail stays searchable. Message passing rewards discipline. Without discipline, it becomes verbose chaos.

4) Worked example — same workflow, two architectures¶

Take one concrete workflow. Research. Write. Review. Publish. Same business goal. Two different architectures.

Look at the shared state version first.

Research ─┐
Write    ─┼─> workflow_state
Review   ─┤
Publish  ─┘

All agents read and write one shared workflow_state object. - Research writes state.claims = [...] - Writer reads state.claims and writes state.draft = "..." - Reviewer reads state.draft and writes state.review = { verdict: "pass" } - Publisher reads state.draft plus state.review

Very convenient. One place to inspect. One place to resume. Now the failure. The writer accidentally overwrites state.claims with a short summary. No exception fires. No contract screams. Reviewer later uses the summary, not the original claims. The published article looks polished but loses evidence. That is silent data corruption.

Now see the message-passing version.

Research -> { claims: [...] } -> Writer
Writer   -> { draft: "...", sources_used: [...] } -> Reviewer
Reviewer -> { verdict: "pass", issues: [] } -> Publisher

Each agent receives explicit input and returns explicit output. That is cleaner for audit. You can inspect the memo format at each step. You can replay Writer with the same research packet. You can replay Reviewer with the same draft packet.

Now what is the problem? The reviewer may need original claims, not only the draft. But the draft packet does not include them. So Reviewer must re-request the claims or Writer must include them. That feels more annoying, but the missing need is visible. In shared state, the dependency stayed implicit. In message passing, the missing field becomes a design decision.

See the trade-off. Shared state hides some needs until runtime surprises you. Messages surface some needs earlier, but at packaging cost. So what do most real systems do? They mix both. Use shared state for durable workflow records, retries, status, and dashboards. Use explicit messages for precise the handoff between steps. That hybrid model is common because products need both memory and clarity.

5) Memory layers in multi-agent systems¶

One more layer is easy to miss. State is not only one thing. There are three memory layers. 1. Working memory — current step context 2. Workflow memory — current run state 3. Long-term memory — persistent facts across runs

Working memory should stay small. It is what the active agent needs right now. Workflow memory is where shared state often lives: task status, artifacts, and intermediate outputs. Long-term memory is different: user preferences, trusted facts, and past outcomes worth reusing.

See. Teams confuse these layers and create expensive systems. If every agent reads full history every time, token cost explodes. Latency also grows. And the agent still misses the important part. Why? Because more context is not the same as better context. Memory management is summarization management. You decide what survives each the handoff. You decide which facts stay local, which facts become workflow state, and which facts deserve long-term storage.

Simple, no? So what to do? Keep working memory narrow and task-shaped. Keep workflow memory structured and inspectable. Keep long-term memory selective and verified. Do not let every agent drag the whole past behind it. A good multi-agent system is not one with maximum memory. It is one with the right forgetting.

Where this lives in the wild¶

LangGraph customer-support workflow — AI engineer: nodes often read and write one shared state object, which makes prototyping and checkpointing fast.
AutoGen research assistant — applied AI engineer: conversable agents mostly exchange explicit messages, so handoffs stay visible in the chat history.
Kafka-based investigation pipeline — platform engineer: agents publish events to durable streams, which makes replay, retries, and audit easier.
Enterprise claims-processing engine — workflow architect: agents read and write database status tables, so operators can inspect one durable run record.
Slack incident bot — developer productivity engineer: the system mixes shared channel history with direct component-to-component messages for targeted actions.

Pause and recall¶

Why does shared state feel easy early, then painful later?
Why does message passing improve auditability but still add cost?
In the Research → Write → Review → Publish flow, what exactly broke in the shared-state version?
Why do strong systems often mix durable shared state with explicit messages?

Interview Q&A¶

Q: Why choose message passing instead of shared state for critical handoffs? A: Because explicit payloads create clearer contracts, replayable traces, and easier auditing when failures matter. Common wrong answer to avoid: "Because message passing is always more scalable" — scale depends on workload, payload size, and coordination design.

Q: Why not put everything in one shared workflow object if all agents need visibility? A: Because global visibility can become hidden coupling, stale reads, and unclear ownership unless the schema is tightly governed. Common wrong answer to avoid: "Because shared state is bad practice" — it is useful for checkpoints, dashboards, and durable run records.

Q: Why do many senior systems use both shared state and messages instead of choosing one? A: Because shared state is great for durable workflow memory, while messages are better for precise step-to-step contracts. Common wrong answer to avoid: "Because architects like complexity" — the mix exists to separate durability from handoff clarity.

Q: Why is memory design really a cost and latency question too? A: Because every extra field read, every oversized payload, and every repeated history fetch increases tokens, time, and debugging load. Common wrong answer to avoid: "Because larger context always improves quality" — beyond a point, it only increases noise and cost.

Apply now (5 min)¶

Exercise: Take one workflow you know. Mark which facts belong in working memory, workflow memory, and long-term memory. Then choose one handoff and write its payload fields explicitly.

Sketch from memory: Draw the same workflow twice. First as a shared board with agents reading and writing it. Then as direct messages between agents. Circle where hidden coupling may appear. Circle where duplicated context may appear.

Bridge. State and messages are clear. But every extra agent, every handoff, every retry costs tokens and time. Multi-agent systems multiply cost. Next: how to keep the whole thing affordable and fast enough for real products. → 09-cost-latency-multiagent.md