06. LangGraph deep dive — turning the control plane into code¶

~20 min read. The previous five files built vocabulary for workflows: graphs, steps, executors, patterns, state layers. LangGraph is the most widely-adopted framework that makes those abstractions into running Python. This file maps every concept we've discussed to concrete LangGraph primitives — StateGraph, nodes, edges, reducers, TypedDict schemas, and MemorySaver checkpointers — then stress-tests those primitives on the loan-approval workflow.

Built on the first-principles overview in 00-first-principles.md. The workflow graph becomes a StateGraph. The durable checkpoint becomes a checkpointer backend. The handoff contract becomes a TypedDict state schema. The pressure is coordination cost: every abstraction LangGraph adds (conditional edges, reducer functions, checkpoint serialisation) costs runtime and cognitive overhead — worth it only when you need explicit control-flow visibility, crash recovery, or multi-actor coordination.

What files 01–05 established and what remains¶

We have a language for orchestration: a control plane dispatches steps along a workflow graph; each step has a typed input contract; state is layered and scoped; patterns (sequential, parallel, DAG, conditional) shape execution. All of that was framework-agnostic. The gap: we haven't yet shown how these ideas map to a concrete execution engine. LangGraph is that mapping for Python-first teams.

The specific question this file answers: When I define a StateGraph, add nodes, add edges, and call .compile(), what is happening under the hood — and where does each piece of our vocabulary land?

The loan-approval workflow in 40 lines of LangGraph¶

from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional, Literal

class LoanState(TypedDict):
    applicant_id: str
    identity_verified: bool
    credit_score: Optional[int]
    compliance_flag: Optional[Literal["pass", "review", "fail"]]
    decision: Optional[str]
    human_override: Optional[str]

graph = StateGraph(LoanState)

graph.add_node("verify_identity", verify_identity_node)
graph.add_node("pull_credit", pull_credit_node)
graph.add_node("compliance_check", compliance_check_node)
graph.add_node("human_review", human_review_node)
graph.add_node("issue_decision", issue_decision_node)

graph.set_entry_point("verify_identity")
graph.add_edge("verify_identity", "pull_credit")
graph.add_edge("pull_credit", "compliance_check")
graph.add_conditional_edges(
    "compliance_check",
    route_after_compliance,  # returns "human_review" or "issue_decision"
    {"human_review": "human_review", "issue_decision": "issue_decision"},
)
graph.add_edge("human_review", "issue_decision")
graph.add_edge("issue_decision", END)

app = graph.compile(checkpointer=MemorySaver())

Every concept from files 01–05 has a line of code here: the graph is StateGraph, the handoff contract is LoanState, the conditional edge is the routing function, and the checkpointer makes execution durable. The rest of this file unpacks each piece.

Teacher voice. LangGraph doesn't invent new orchestration ideas. It gives our existing vocabulary — graphs, typed state, conditional edges, durable checkpoints — a standard Python API. The design quality still depends on decomposition, routing, and state management choices made before you write the graph code.

Concept mapping: our vocabulary → LangGraph primitives¶

Module vocabulary	LangGraph primitive	What it does
the workflow graph	`StateGraph`	Declares nodes and edges as a DAG
step	Node (a function)	Receives state, returns partial update
handoff contract	`TypedDict` schema	Types the state every node reads/writes
conditional edge	`add_conditional_edges`	Routing function chooses the next node
durable checkpoint	Checkpointer (`MemorySaver`, `PostgresSaver`)	Persists full state after each node
approval gate	`interrupt_before` / `interrupt_after`	Pauses graph, waits for external input
dispatch loop	`.invoke()` / `.stream()`	Runs the graph to completion or interruption
state layer	Reducer + annotation	Controls how concurrent writes merge

This table is the Rosetta Stone. When the rest of this file says "node," it means a Python function registered with add_node. When it says "checkpoint," it means the serialised state blob written by the checkpointer after that node returns.

Nodes: bounded units with typed contracts¶

A LangGraph node is a function that takes the current state and returns a partial state update. "Partial" is critical — you return only the fields you changed, not the entire schema.

def verify_identity_node(state: LoanState) -> dict:
    result = identity_service.verify(state["applicant_id"])
    return {"identity_verified": result.success}

This is the handoff contract in action. The node reads applicant_id from state, calls an external service, and writes back one field. It doesn't touch credit_score or compliance_flag. That field-level isolation prevents the "dump everything" anti-pattern from file 05.

Design constraints for good nodes: - One bounded unit of work (single responsibility) - Declare which fields are read (even if Python doesn't enforce this at type level) - Return only mutated fields - Idempotent where possible — if the graph resumes and re-executes this node, will it double-charge, double-email, or double-create?

┌──────────────────────────────────────────┐
│ Node contract                            │
│                                          │
│  input:  reads applicant_id              │
│  effect: calls identity_service          │
│  output: writes identity_verified        │
│  idempotent: yes (verify is read-only)   │
└──────────────────────────────────────────┘

Mini-FAQ. "What if my node needs to write many fields?" That's fine — return them all in the dict. The concern isn't output count; it's whether each field is intentional. The anti-pattern is returning state | {"new_field": x} — passing the entire state through unchanged — because it defeats diff-based merging.

Edges and conditional routing¶

Fixed edges are simple: add_edge("A", "B") means "after A, always go to B." Conditional edges are where orchestration logic lives.

def route_after_compliance(state: LoanState) -> str:
    if state["compliance_flag"] == "review":
        return "human_review"
    return "issue_decision"

This is the routing function from file 03, made concrete. The function receives the full state and returns a string naming the next node. LangGraph matches that string against the mapping you provide.

The critical design rule: routing functions must be deterministic given the same state. If you call an LLM inside a routing function, you've pushed stochastic decisions into the control plane — making the graph unreplayable and hard to test.

deterministic routing:
  state["compliance_flag"] == "review"  →  human_review
  state["compliance_flag"] != "review"  →  issue_decision

stochastic routing (dangerous):
  llm("should we review?", context=state)  →  unpredictable

Routing approach	Testable?	Replayable?	Auditable?
Deterministic function on state fields	✅	✅	✅
LLM call in routing function	❌ mostly	❌	❌ mostly
Hardcoded edge (no condition)	✅ trivially	✅	✅

State schemas and reducers — controlling concurrent writes¶

When a graph has parallel branches (file 04's parallel pattern), two nodes may write to the same state field simultaneously. LangGraph handles this through reducers — annotation functions that define how concurrent writes merge.

from typing import Annotated
from operator import add

class ResearchState(TypedDict):
    sources: Annotated[list[str], add]  # append-merge
    final_summary: Optional[str]         # last-write-wins (default)

The add reducer means: if branch A writes sources: ["doc1"] and branch B writes sources: ["doc2"], the merged state is sources: ["doc1", "doc2"]. Without the reducer, the second write overwrites the first — the lost-update problem from file 05.

Without reducer (last-write-wins):
  branch A writes sources: ["doc1"]
  branch B writes sources: ["doc2"]
  final state: sources: ["doc2"]  ← doc1 lost

With add reducer:
  branch A writes sources: ["doc1"]
  branch B writes sources: ["doc2"]
  final state: sources: ["doc1", "doc2"]  ← both preserved

Choosing the wrong reducer is a silent corruption bug. The graph compiles, runs, and produces output — but merged state is wrong. This is why file 05's invariant ("each step receives only the state it declared as input") matters even more in LangGraph: reducers must match the semantic intent of the field.

Checkpointing: crash recovery as a framework feature¶

Every time a node completes, LangGraph's checkpointer serialises the full state to a backend store. When the process crashes and restarts, the graph resumes from the last checkpoint — it does not re-execute completed nodes.

Node execution timeline with checkpointing:

verify_identity → ✓ checkpoint 1 written
pull_credit     → ✓ checkpoint 2 written
compliance_check→ ✓ checkpoint 3 written
human_review    → ⏸ interrupt (waiting for human)  ← checkpoint 4 written
                        ...6 hours pass...
                  → ✓ human approves, graph resumes from checkpoint 4
issue_decision  → ✓ checkpoint 5 written → END

The checkpointer backends: - MemorySaver — in-memory dict, good for testing, lost on crash - SqliteSaver — local file, survives process restart - PostgresSaver — production-grade, supports multi-instance deployments - Custom — any backend implementing the BaseCheckpointSaver protocol

Cost awareness: Each checkpoint serialises the entire TypedDict state. If your state contains 50KB of raw documents (violating file 05's compression guidance), every checkpoint writes 50KB. Ten nodes = 500KB per workflow run. At 10,000 concurrent workflows = 5GB of checkpoint storage. State compression isn't optional at scale.

Checkpointer	Durability	Multi-instance	Latency per write
MemorySaver	None (testing only)	No	<1ms
SqliteSaver	Process restart	No	2–5ms
PostgresSaver	Full	Yes	5–15ms

Interrupts: the approval gate made concrete¶

File 08 will cover human-in-the-loop design in depth. Here's the LangGraph mechanism: interrupt_before and interrupt_after pause execution and persist state.

app = graph.compile(
    checkpointer=PostgresSaver(conn),
    interrupt_before=["human_review"],  # pause BEFORE this node runs
)

When the graph reaches human_review, it writes a checkpoint and returns control to the caller. The workflow is now paused. Hours or days later, a human provides input and the caller resumes:

# Resume with human input
app.update_state(thread_id, {"human_override": "approved"})
result = app.invoke(None, config={"configurable": {"thread_id": thread_id}})

The key insight: the interrupt is not an error. It's a designed pause point — the approval gate from our vocabulary. The checkpoint captures the exact pre-pause state, making resume deterministic regardless of how long the human takes.

Threaded example: loan-approval under failure¶

Return to the loan-approval graph. Suppose pull_credit calls an external bureau API that returns HTTP 503. What happens?

Without retry logic in the node:

verify_identity → ✓ checkpoint 1
pull_credit     → ✗ raises exception
                → graph fails, checkpoint 1 is the last good state

The graph can be resumed from checkpoint 1. verify_identity won't re-execute (already checkpointed). pull_credit retries from the same state.

With retry logic inside the node:

def pull_credit_node(state: LoanState) -> dict:
    for attempt in range(3):
        try:
            score = credit_bureau.pull(state["applicant_id"])
            return {"credit_score": score}
        except ServiceUnavailable:
            if attempt == 2:
                raise  # let graph-level recovery handle it
            time.sleep(2 ** attempt)

With a fallback edge:

graph.add_conditional_edges(
    "pull_credit",
    lambda s: "fallback_scoring" if s.get("credit_score") is None else "compliance_check",
    {"fallback_scoring": "fallback_scoring", "compliance_check": "compliance_check"},
)

This is the pattern hierarchy from file 04 made concrete: retry inside the node for transient failures, fallback edges for persistent failures, graph-level resume for crashes.

When LangGraph is the wrong tool¶

LangGraph adds real overhead: state serialisation on every step, schema coupling across all nodes, reducer complexity for parallel branches, checkpoint storage cost. That overhead pays for itself only when you need:

Explicit control-flow visibility (audit, compliance)
Crash recovery for long-running workflows (minutes to hours)
Human-in-the-loop pause and resume
Multi-step workflows with conditional branching
Replay and debugging from intermediate state

When you don't need these: a simple function chain, a single LLM call, or a linear pipeline of prompt → tool → format is lighter and faster.

Use LangGraph when:                    Use plain functions when:
├── steps > 3                          ├── steps ≤ 3
├── any step can fail and must resume  ├── stateless transformations
├── human gates exist                  ├── no pause/resume needed
├── audit trail required               ├── no audit requirement
└── parallel branches with merge       └── linear, no branching

Teacher voice. "Should I use LangGraph?" is a control-plane question, not a model question. If your workflow needs explicit edges, durable state, and resume semantics, yes. If it's a single-shot generation, the framework is overhead without value.

Operational signals — healthy graph, degrading graph, broken graph¶

Healthy behaviour: - Checkpoint writes complete in <20ms (PostgresSaver) - State size per checkpoint stays under 10KB - Conditional edges resolve deterministically on replay - Human interrupts resume within expected SLA

First degrading signal: - Checkpoint write latency climbing above 50ms → state bloat or backend pressure - Graph invocations timing out → likely a node with unbounded external call - Reducer conflicts appearing in logs → parallel branches writing same field without proper merge

Misleading metric: - "Graph compile time" — teams optimise for faster .compile() when the real cost is per-node execution and checkpoint serialisation - "Number of nodes" — more nodes ≠ worse. The issue is node granularity (too coarse = opaque, too fine = overhead)

Expert signal: - Checkpoint size growth rate over time — correlates with state management quality - Resume success rate — percentage of crashed workflows that resume cleanly without duplicate side effects

Boundary of applicability¶

Works unusually well: - Approval-heavy enterprise workflows (legal, compliance, finance) where pause-resume is the primary value - Research workflows with intermediate state that survives hours of human review - Multi-step agents where replay from intermediate state accelerates debugging

Becomes pathological: - High-throughput, low-latency pipelines (checkpoint serialisation per step adds 5–15ms × N) - Workflows where state is > 100KB per step (checkpoint storage explodes) - Simple chains with no branching and no failure modes (overhead without value)

Scale that invalidates naive intuition: - At 100K concurrent workflows, PostgresSaver becomes the bottleneck unless you shard by tenant or workflow-type - At graph depth > 50 nodes, state schema evolution becomes the main maintenance cost (every node must handle schema changes)

Failure-prone assumption: "LangGraph makes my workflow reliable"¶

The seductive wrong idea: "If I put my workflow in LangGraph, checkpointing handles all failures automatically."

The correction: LangGraph makes execution resumable, not correct. A node that sends duplicate emails, a routing function that hallucinates, a state schema that drops fields — all still break inside LangGraph. The framework provides structure; design quality still determines reliability.

The framework's responsibility: durable state, resume from checkpoint, typed edges, interrupt semantics.

Your responsibility: node idempotency, state compression, routing determinism, checkpoint placement relative to side effects, schema evolution strategy.

Real-world implementations¶

LangGraph Platform (LangChain Inc.) — managed hosting for compiled graphs with built-in PostgresSaver, thread management, and interrupt/resume APIs. Used by teams building customer-facing agent products that need SLA-grade durability.
Elastic Security agent workflows — security investigation graphs with conditional branching: triage → enrich → score → escalate/auto-resolve. Checkpointing preserves evidence chain across analyst review pauses.
Replit Agent — code-generation workflows use graph-structured execution for plan → implement → test → fix loops where each iteration checkpoints, enabling resume after user feedback.
Klarna support automation — customer service workflows branch on ticket complexity, pause for human review on refunds above threshold, and resume with full conversation context after agent handoff.
Weights & Biases eval pipelines — model evaluation workflows run parallel scoring branches, merge results through reducers, and checkpoint intermediate scores for debugging failed evaluations.
Thomson Reuters legal research — document analysis graphs pause at attorney review gates and resume without re-running expensive extraction nodes.
Notion AI — workspace automation uses graph-structured flows for multi-step document operations (summarize → translate → format) with interrupt points for user approval.
GitLab Duo workflows — code review and merge request automation structured as graphs with conditional edges for CI status, approval gates, and rollback branches.

Recall checkpoint¶

What LangGraph primitive corresponds to "the workflow graph" from our vocabulary?
Why must routing functions be deterministic?
What problem do reducers solve in parallel branches?
Why does checkpoint state size matter at scale?
When is LangGraph overhead without value?
What's the difference between interrupt_before and node-level retry?
How does a checkpointer enable resume after process crash?

Interview Q&A¶

Q: Why does LangGraph use TypedDict state schemas rather than passing messages between nodes? A: A typed schema makes the handoff contract explicit — every node declares what it reads and writes, enabling compile-time validation, deterministic routing, and safe parallel merges through reducers. Common wrong answer to avoid: "Because TypedDict is Pythonic." The real value is contract enforcement across nodes, not language convention.

Q: Why should routing functions avoid calling LLMs? A: Routing functions are control-plane decisions. They must be deterministic for the graph to be replayable, testable, and auditable. An LLM call makes the same state produce different paths on different runs. Common wrong answer to avoid: "Because LLM calls are slow." Latency matters, but the deeper issue is non-determinism destroying replay safety.

Q: Why does LangGraph checkpoint after every node rather than only at the end? A: Long workflows can fail mid-run. Per-node checkpoints ensure resume granularity — you restart from the last successful step, not from the beginning. This also enables interrupt semantics for human-in-the-loop. Common wrong answer to avoid: "For debugging convenience." Debugging benefits, but crash recovery and interrupt support are the primary motivators.

Q: When would you choose a custom orchestrator over LangGraph? A: When the workflow requires sub-millisecond latency per step (checkpoint overhead is unacceptable), when you need a non-Python runtime, when your team already has a battle-tested workflow engine (Temporal, Airflow), or when the graph abstraction doesn't match your execution model (event-driven, streaming). Common wrong answer to avoid: "When the workflow is complex." Complexity is exactly when LangGraph helps most. The reasons to avoid it are performance constraints, language constraints, or existing infrastructure.

Q: How do reducers prevent data loss in parallel branches? A: Without reducers, the last branch to write a field wins — earlier branch outputs are silently overwritten. Reducers define merge semantics (append, union, max, custom) so all branches contribute to the final state. Common wrong answer to avoid: "Reducers just concatenate lists." Concatenation is one option; the design choice is which merge semantic matches the field's intent.

Q: Why is state size a production concern with checkpointers? A: Every checkpoint serialises the full state. Bloated state (raw documents, conversation history) multiplied by nodes per workflow multiplied by concurrent workflows creates storage pressure and write latency. The fix is the state compression strategy from file 05. Common wrong answer to avoid: "Because storage costs money." Cost is one factor, but write latency affecting workflow throughput is often the binding constraint.

Design/debug exercise (10 min)¶

Modeled: Take the loan-approval graph above. The pull_credit node fails with HTTP 503. Trace what happens: which checkpoint is the last valid one, what state it contains, and what happens when the graph resumes. Answer: checkpoint 1 (after verify_identity) is the last valid state; it contains {applicant_id: "...", identity_verified: True, ...}; resume re-executes pull_credit with that state.

Your turn: Add a parallel branch to the loan-approval graph: after verify_identity, run pull_credit and sanctions_check in parallel. Both must complete before compliance_check. Define a reducer for a risk_signals: list[str] field that both branches write to. Write the add_conditional_edges call for the merge point.

From memory: Close this file and sketch: the StateGraph definition, three nodes, one conditional edge, the checkpointer configuration, and one interrupt point. Label which concept from files 01–05 each piece implements.

Operational memory¶

LangGraph is a framework, not a solution. It maps our workflow vocabulary — graphs, typed state, conditional edges, durable checkpoints — to Python primitives you can compile, run, and resume. The mapping is precise: StateGraph is the workflow graph, TypedDict is the handoff contract, reducers handle parallel merge, and checkpointers make execution durable. What the framework gives you is structure and resume semantics. What it doesn't give you is decomposition quality, routing correctness, state compression, or node idempotency — those remain design decisions from files 01–05.

The practical value is clearest in approval-heavy workflows where a graph may pause for hours waiting for human input and must resume without re-executing completed steps. The practical cost is clearest in high-throughput pipelines where checkpoint serialisation adds latency that simple function chains avoid.

Remember: - StateGraph + TypedDict = your workflow graph with typed handoff contracts - Reducers prevent lost updates in parallel branches — choose merge semantics per field - Checkpointers write full state after every node — keep state small or pay storage/latency tax - Routing functions must be deterministic for replay and testing to work - interrupt_before / interrupt_after = the approval gate mechanism - LangGraph doesn't fix bad decomposition — a bloated node in a graph is still a bloated node - Use LangGraph when you need durability, branching, and resume; use plain functions when you don't

Bridge. A graph engine gives us executable structure. But who decides the plan before the graph runs — and who watches execution to detect when the plan itself is wrong? That job belongs to the plan-execution manager, which sits above the graph engine and governs the route. → 07-plan-execution-manager.md