Frameworks and patterns — same ideas, different wrappers¶

~16 min read. Pick a framework before understanding the loop and you have bought a packaging choice that will outlive your design. Understand the loop first, and frameworks become small decisions about ergonomics, observability, and lock-in — not commitments to a philosophy. Every agent framework is a wrapper around the same five primitives. This page is the architect's map of what each wrapper costs and what it actually buys you.

Built on the first-principles overview in 00-first-principles.md. The toolbelt, the leash, the give-up rule, the loop with judgement between steps — every framework provides some packaging of these. The difference is not whether the agent has them. The difference is who writes them: you, or the framework's authors.

1) Same task, three frameworks, three different cognitive loads¶

A refund-eligibility agent. The job: given a customer ID and an order ID, decide whether the refund qualifies under current policy, and either issue the refund or generate an explanation. Three implementations of the same loop.

Implementation A — raw Anthropic tool use. ~150 lines of Python. A while loop. A list of tool definitions. A messages.create call inside the loop. A dispatcher that runs the tool the model picked and appends the result. A break condition on stop_reason == "end_turn" or a hard iteration cap. Every decision the loop makes is visible in the code. Adding a feature means adding code; debugging means reading the loop.

Implementation B — LangGraph. ~200 lines plus a state schema and a graph definition. Nodes for each step (fetch_order, check_policy, decide, human_review, send_reply). Edges expressing transitions. Conditional edges expressing branches. The graph runs; the framework handles persistence, checkpointing, and resume. Visualising the agent's state machine is one CLI command. Adding a feature means adding a node and wiring its edges; debugging means reading the graph plus the framework's internal state log.

Implementation C — CrewAI. ~80 lines of declarative role definitions. An investigator agent with the order-and-policy tools. An analyst agent that decides eligibility. A communicator agent that drafts the message. A Crew that orchestrates the three. Each role has its own prompt, its own toolbelt, its own goals. Adding a feature means adding (or refining) a role; debugging means reading the conversation between roles.

Three implementations. Same business logic. Same model. Same tools. The architect's question is not "which is right?" — all three ship in production today. The question is what each costs:

Dimension	Raw tool use	LangGraph	CrewAI
Lines of code	150	200	80
Time to first prototype	1 day	2 days	4 hours
Cognitive load per debug session	low (all visible)	medium (graph + framework)	medium-high (multi-agent transcripts)
Persistence / checkpointing	DIY	built-in	partial
Visualisation / observability	DIY	built-in	partial
Lock-in to framework patterns	none	medium	high
Per-turn token overhead (vs raw)	0%	~5-10%	~30-60% (role coordination)

The fast prototype is not always the right ship. The framework that wraps the loop also wraps the cost surface, the lock-in surface, and the future debugging surface.

Teacher voice. A framework is not magic — it is a set of decisions the maintainers have made on your behalf. If you cannot name those decisions, you cannot judge whether they fit your problem.

2) The five primitives every framework provides¶

Every agent framework, no matter how it markets itself, ships some form of the same five primitives. Knowing the primitives is what lets you compare frameworks honestly.

┌───────────────────────────────────────────────────────────────┐
│  1. LOOP CONTROL       — who calls the model again?           │
│                                                                │
│  2. TOOL DISPATCH      — how does a tool call become a result?│
│                                                                │
│  3. STATE              — what survives between iterations?    │
│                                                                │
│  4. STOPPING RULES     — when does the loop end?              │
│                                                                │
│  5. OBSERVABILITY      — what gets logged, when?              │
└───────────────────────────────────────────────────────────────┘

Loop control. In a raw API setup, your while loop calls the model. In LangGraph, the graph runtime calls nodes (each of which may call the model). In CrewAI, the Crew orchestrator decides when roles take turns. Same primitive, three packagings. The cost of misunderstanding the packaging is reaching for the wrong abstraction to hot-fix a bug.

Tool dispatch. Raw setups call your own dispatcher function. LangGraph nodes are tools. CrewAI has roles with their own tool lists. Same primitive — the function that runs when the model picks a tool — three packagings. Frameworks that abstract this well make adding a tool easy; frameworks that abstract it badly make debugging a tool failure painful.

State. Raw: a dict you manage. LangGraph: a typed State schema with reducers. CrewAI: implicit conversation history plus role memory. Same primitive — what survives turn N into turn N+1 — three packagings. The framework's state model is what determines whether memory is auditable (see 12-memory-architecture.md).

Stopping rules. Raw: your break condition. LangGraph: END nodes and conditional edges. CrewAI: a max-iterations parameter plus role-internal signalling. Same primitive — the give-up rule that makes the agent finite — three packagings. A framework without an enforceable stopping rule (cap in prose, not orchestration) is the runaway risk from 09-stopping-rules-budgets.md.

Observability. Raw: whatever you log. LangGraph: built-in trace API plus LangSmith integration. CrewAI: conversation transcripts plus optional callbacks. Same primitive — what the on-call engineer can see when something fires wrong — three packagings. Framework choice is partly a bet on what observability you will need at 3 AM during an incident.

A framework that does not give you all five primitives is not really an agent framework — it is a chat-completion helper.

3) The running example — Karthik's refund eligibility check¶

One concrete task threads through the rest of this chapter. Read it twice.

Task: "A fintech support agent. Customer Karthik (account_id: 33-9182, tier: standard, region: IN) writes: 'Please refund my purchase of ₹6,400 from 22-Apr — it never arrived.' The agent should: (1) look up the order; (2) check the refund policy for the customer's region and tier; (3) decide eligibility based on amount, delivery status, and time elapsed; (4) if eligible and amount ≤ ₹50,000, issue the refund; if eligible and amount > ₹50,000, route to a human (per 13-approval-gates.md); (5) draft a customer-facing reply. Available tools: get_order(order_id), get_refund_policy(region, tier), issue_refund(account_id, amount), escalate_to_human(reason, packet), compose_reply(content)."

The next three sections implement this same task in raw API, LangGraph, and CrewAI form. Same business logic, same tools, same outcome — different framework cognitive loads.

Implementation A — raw Anthropic tool use¶

def run_refund_agent(customer_message: str, account_id: str) -> str:
    tools = [GET_ORDER_TOOL, GET_POLICY_TOOL, ISSUE_REFUND_TOOL,
             ESCALATE_TOOL, COMPOSE_REPLY_TOOL]
    messages = [{"role": "user", "content": customer_message}]
    for iteration in range(MAX_ITERS):  # MAX_ITERS = 6
        response = client.messages.create(
            model="claude-sonnet-4-6", tools=tools, messages=messages,
            system=SYSTEM_PROMPT, max_tokens=2000,
        )
        if response.stop_reason == "end_turn":
            return response.content[0].text
        if response.stop_reason == "tool_use":
            tool_block = next(b for b in response.content if b.type == "tool_use")
            tool_result = dispatch_tool(tool_block.name, tool_block.input)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": [
                {"type": "tool_result", "tool_use_id": tool_block.id,
                 "content": str(tool_result)}
            ]})
    raise IterationCapExceeded("hit MAX_ITERS without end_turn")

The whole agent is in 25 lines plus the tool definitions and the dispatcher. Iteration cap is enforced at the Python level. Stop condition is explicit. State is the messages list. Logging is whatever you add (typically a structured log per iteration). Persistence is whatever you build (typically: serialise messages after each iteration, resume by re-instantiating).

The reward is total visibility. The cost is that every feature beyond the basic loop — checkpointing, parallel tool calls, retries, human-in-the-loop, observability — is code you write. For a 5-tool agent that ships once and stays simple, that cost is reasonable. For a 25-tool agent with three deployment environments and an evolving toolbelt, it accumulates.

Implementation B — LangGraph¶

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    order: dict | None
    policy: dict | None
    decision: str | None

def fetch_order(state): ...
def fetch_policy(state): ...
def decide(state): ...
def route_after_decision(state) -> str:
    if state["decision"] == "ineligible": return "compose_reply"
    if state["order"]["amount"] > 50_000: return "human_gate"
    return "issue_refund"

graph = StateGraph(AgentState)
graph.add_node("fetch_order", fetch_order)
graph.add_node("fetch_policy", fetch_policy)
graph.add_node("decide", decide)
graph.add_node("human_gate", human_gate_node)
graph.add_node("issue_refund", issue_refund_node)
graph.add_node("compose_reply", compose_reply_node)
graph.add_edge(START, "fetch_order")
graph.add_edge("fetch_order", "fetch_policy")
graph.add_edge("fetch_policy", "decide")
graph.add_conditional_edges("decide", route_after_decision)
graph.add_edge("issue_refund", "compose_reply")
graph.add_edge("human_gate", "compose_reply")
graph.add_edge("compose_reply", END)
app = graph.compile(checkpointer=PostgresSaver(...))

The shape of the agent is now a graph that an engineer can .draw_png() and put in a Notion doc. The state schema is typed. Checkpointing is one constructor argument. Human-in-the-loop is a built-in interrupt() primitive at the human_gate node. Resume after a crash reads from the checkpointer.

The cost is the abstraction tax. Adding a tool means adding a node and editing two edges. Adding a branch means adding a conditional edge. Debugging means understanding both your code and LangGraph's internal state-machine semantics. The first time the state schema's reducer behaves unexpectedly, the engineer learns the framework whether they want to or not.

LangGraph earns its tax on agents that are genuinely graph-shaped: explicit branches, checkpoint-resume requirements, multi-step approval flows, long-running tasks. For an agent that is just a tool-using loop, the graph adds friction the raw loop did not have.

Implementation C — CrewAI¶

investigator = Agent(
    role="Order and policy investigator",
    goal="Gather order details and refund policy",
    tools=[get_order_tool, get_refund_policy_tool],
    backstory="Senior support specialist focused on facts",
)
analyst = Agent(
    role="Refund eligibility analyst",
    goal="Decide if the refund qualifies under policy",
    backstory="Risk-and-policy analyst with finance background",
)
communicator = Agent(
    role="Customer communicator",
    goal="Draft a clear customer-facing reply",
    tools=[compose_reply_tool],
    backstory="Customer-experience writer",
)

tasks = [
    Task(description="Investigate order {order_id} for account {account_id}",
         agent=investigator),
    Task(description="Decide eligibility given the investigator's facts",
         agent=analyst),
    Task(description="Draft the reply for the customer",
         agent=communicator),
]

crew = Crew(agents=[investigator, analyst, communicator], tasks=tasks,
            process=Process.sequential)
result = crew.kickoff(inputs={"order_id": "...", "account_id": "..."})

The cognitive frame is dramatically different. The agent is now three "people" with distinct roles, each with their own prompt and tools. Reading the code, the architect can almost imagine a small team handling the request. Adding a reviewer role is genuinely one block of code.

The cost shows up in two places. Token overhead. Each role has its own system prompt and goal. The investigator's findings get passed to the analyst as a message; the analyst's decision gets passed to the communicator. Each handoff is a fresh model call with the prior context. A task that was 3 large-model turns in implementation A becomes 5-7 turns split across three roles. Cost can be 30-60% higher for the same business outcome. Lock-in. The role abstraction is CrewAI-specific. Moving to a different framework later means rewriting the agent's mental model, not just renaming a few imports.

CrewAI earns its overhead when role separation reflects real boundaries: distinct expertise, distinct tool authorities, distinct review responsibilities (e.g., investigator has no write access, analyst has read-only policy access, communicator has the customer-message tool only). When the roles are just narrative flavour over a single agent's reasoning, you have added cost without buying anything.

Mini-FAQ. "If multi-agent overhead is 30-60%, why does anyone use it?" Because some problems genuinely are multi-agent — the team structure mirrors a real division of labour, the roles need different tool authorities (a security review agent that cannot write code; a compliance review agent that cannot send messages), or the prompts are different enough that one-agent-many-hats fails on quality. The honest framing: use multi-agent when role boundaries are enforced, not when they are decorative.

4) When raw tool use is the right choice¶

Three signals that say "stay with raw API":

The agent is the product. Your team is building an agent that ships to customers as part of the core product. Owning the loop end-to-end is competitive surface area. Framework abstractions that simplify the easy parts also obscure the hard parts where your differentiation lives. Cursor, Claude Code, and Cognition Devin all chose this path; their loops are the product, not a wrapper over a framework's loop.

You will run this in 5+ environments. A framework's dependency tree, version constraints, and runtime requirements are a tax paid in every environment you deploy to. For an agent that runs in lambda + container + edge + on-prem + air-gapped, the framework can become the blocker for adoption in the awkward environment. Raw tool use travels lighter.

You need exotic observability. Your on-call wants per-tool latency histograms keyed by tool argument shape, your security team wants every model input logged to an immutable store with a 7-year retention, your finance team wants per-turn cost attributed to a specific customer line of business. Frameworks ship their own observability story, which may or may not extend to your exotic requirements. With raw tool use, the observability is whatever you write.

The downside of raw is durable: more code to write, more bugs to hit yourself (the framework authors have probably hit most of them already), no community of users debugging the same abstraction you debug. The choice is whether that downside is paid in your own engineering time or your framework's quirks.

5) When the graph-state abstraction earns its overhead¶

LangGraph and similar graph-state frameworks (LangChain LCEL, Mastra workflows, Pydantic AI graphs) earn their tax on agents with three properties.

Persistence and resume are first-class. Your agent might run for hours, get interrupted by a deploy, and need to resume from where it left off. Graph-state frameworks ship checkpointing as a constructor argument; raw tool use makes you build serialisation, recovery, and the resume contract yourself.

Human-in-the-loop is structural, not occasional. Your agent has named pause points where a human reviews before continuing — and you might have ten of these across the flow. Graph-state frameworks expose interrupt() as a primitive; raw tool use makes you build the wait-and-resume mechanism yourself. For agents with one or two pause points, the raw approach is fine; for agents with many, the framework's primitive is genuinely useful.

Branches and retries are the bulk of the loop. Your agent is less "a loop with judgement" and more "a state machine with retries, fallbacks, and conditional next-steps." The state-machine framing is a better mental model than the loop framing, and graph frameworks express it natively. Raw tool use makes you express the state machine in if statements scattered through the loop — possible, but harder to reason about as the branch count grows.

The cost stays real even when these properties hold. The framework's abstractions become the surface you debug. The framework's version upgrades become a maintenance commitment. The framework's choices (state schema reducers, edge semantics, persistence backends) become decisions you live with. None of this is a reason to avoid the framework — it is the price you pay for the primitives, and the price is often worth it.

Teacher voice. A graph framework is not "more advanced than" raw tool use. It is a different shape of problem. Pick the shape that matches the agent you are building, not the framework that sounds more sophisticated.

6) Multi-agent / role frameworks — real boundary vs theatre¶

The hardest framework decision is multi-agent. Frameworks like CrewAI, AutoGen, and OpenAI's swarm pattern make role-based decomposition easy, which is a mixed blessing — they make it easy whether or not the decomposition reflects reality.

The test for real boundary, not theatre, is whether the roles need distinct authorities:

Distinct tools. The investigator has read-only tools; the communicator has the message-sending tool; the analyst has neither. Cross-role tool calls are blocked at the framework layer. Real boundary.
Distinct prompts. The investigator's system prompt is 800 tokens of "focus on facts, do not opine." The communicator's is 900 tokens of "match the customer's tone, be concise." Combining them into one prompt produces a worse agent because the instructions interfere. Real boundary.
Distinct review. The analyst's output is reviewed before the communicator runs. The communicator's output is reviewed before sending. Roles are check-and-balance, not just a sequence. Real boundary.

If two roles share tools, share most of their prompt, and have no review between them, they are theatre. The same agent could do the work in one loop at lower cost. The "two roles working together" narrative is for the architect, not for the agent.

Done well, multi-agent is the right architecture for genuinely team-structured problems: a code-review agent with a separate security-review sub-agent that has different policies; a sales-research agent with a separate compliance-check sub-agent that has read-only access to the regulatory KB. Done badly, multi-agent triples the cost of a problem one good agent could solve.

7) Lock-in, observability, and the framework upgrade tax¶

A framework is not just a way to write the agent — it is a relationship. The relationship has costs that show up after the initial enthusiasm.

Lock-in. Migrating from CrewAI to LangGraph is not "rename some imports" — it is rebuilding the mental model. The role definitions, the task descriptions, the implicit coordination between roles all need to become typed state, nodes, and edges. Estimate the rebuild as 50-100% of the original implementation cost, depending on how deeply the framework's abstractions wormed into the design. Lock-in is the implicit tax you pay for the easy ergonomics of the framework you picked.

Observability. The framework ships its trace format, its visualisation, its integration with the framework's hosted observability service (LangSmith for LangGraph, AgentOps for many, native for OpenAI Agents SDK). Adopting the framework's observability story is the path of least resistance. Adopting your own (Datadog, Honeycomb, in-house) means writing adapters. Most teams pick the framework's tooling because it is there — and then discover six months later that the framework's tooling does not answer the question their on-call has at 3 AM.

Upgrade tax. Frameworks at the agent layer in 2026 ship fast — major version bumps every 3-6 months, breaking changes in minor versions. Each upgrade is a chunk of engineering work, and each upgrade can land at an awkward time (right before a launch, during a freeze). Raw tool use upgrades when the model SDK upgrades, which is a much shorter dependency chain. Framework agents take the model SDK upgrade plus the framework upgrade plus the framework's transitive deps — three sources of breaking changes instead of one.

None of this kills frameworks as a category. The right question is what you are buying with the lock-in. For a team building one agent that will ship for a year and then be retired, the framework's ergonomics easily justify the relationship cost. For a team building agents as core product surface, owning the loop is often the better long-term call.

8) A fast design test for framework choice¶

When the team is debating framework choice, ask three questions. Five minutes. The answers tell you whether the choice is well-grounded.

Name the five primitives the framework provides, and explain how each one differs from raw tool use. If "the framework just makes it easier," the team has not actually looked at what the framework does. If they can name loop control, tool dispatch, state, stopping rules, and observability as concrete things the framework handles, the choice is informed.
What is your agent's branch count, persistence requirement, and role-boundary count? If branches < 5, persistence not required, role boundaries are imaginary — raw tool use. If branches ≥ 5 and persistence required — graph framework. If role boundaries are real (distinct tools / prompts / authorities) — multi-agent framework. The match is mechanical once the agent's actual shape is known.
What is your migration cost if this framework is the wrong choice in 12 months? If "we would rewrite from scratch," accept that and move on — every choice has this cost. If "we have not thought about it," you are committing to a relationship without understanding its escape cost.

Three questions. A framework choice that cannot answer all three is being made by the framework's marketing, not the team's design.

Where this lives in the wild¶

The split below is the cleanest contrast: production agent products that own their loop versus production agent products that rely on a framework for the orchestration layer.

Own-the-loop / raw tool use (or thin internal wrapper):

Claude Code — Anthropic's own agent, raw tool use over the Anthropic API; the loop is the product.
Cursor (agent mode) — proprietary loop over multiple LLM providers; no public framework dependency.
Cognition Devin — proprietary long-horizon planner-executor with custom state and resume.
OpenAI ChatGPT agent mode / Operator — built on OpenAI's own internal primitives, not on a third-party agent framework.
GitHub Copilot coding agent — own loop, integrates with GitHub-native primitives (issues, PRs, checks).
Replit Agent — own loop over a custom execution sandbox.
Vercel v0 — own loop with critique-and-revise tied to a live preview.
Bolt.new — own loop for full-stack app generation.
Aider — minimal Python loop; the simplicity is the design.
Goose by Block — own loop with MCP for tool integration; open-source.
Harvey (legal) — proprietary loop with strong audit and citation guarantees.
Hebbia — proprietary loop tuned for financial-document research.

Framework-mediated agents (graph-state, multi-agent, or hosted SDK):

OpenAI Agents SDK — first-party OpenAI agent framework; explicit Agent, Runner, Handoff primitives; growing adoption since launch.
LangGraph — Anthropic/OpenAI/Google-agnostic graph-state framework; production examples include Klarna's customer support stack and Elastic's AI Assistant.
LangChain — older chain-style framework; still common for prototype-to-production paths.
CrewAI — popular role-based multi-agent framework; production deployments in sales-ops and research-assistant verticals.
AutoGen (Microsoft) — conversational multi-agent framework; research-leaning with production adopters.
Pydantic AI — typed agent framework leaning on Pydantic schemas for structure.
Mastra — TypeScript-first framework with workflows and agent primitives.
LlamaIndex agents — graph and workflow primitives bundled with LlamaIndex retrieval.
DSPy — prompt-and-program compiler; not a framework in the traditional sense, but lives in the same decision space.
Bedrock Agents (AWS) — AWS-hosted agent service with action groups, knowledge bases, and IAM integration.
Vertex AI Agent Builder — Google's hosted agent platform with built-in observability and ADK integration.
Azure AI Agent Service — Microsoft's hosted agent service tied to Azure Functions and AI Foundry.
Letta (MemGPT) — agent framework with strong memory architecture as the differentiator.
Pipecat / LiveKit Agents — voice-realtime agent frameworks for streaming use cases.

The pattern across the own-the-loop column: products where the agent is the differentiating surface, where the engineering investment in the loop is worth the maintenance cost, and where exotic observability or environmental constraints would have made framework adoption painful. The pattern across the framework-mediated column: products where the loop is plumbing under a higher-value product surface (the CRM, the support workflow, the security console), where the framework's primitives buy real time, and where the team is comfortable with the lock-in tax.

Pause and recall¶

Name the five primitives every agent framework must provide.
In the Karthik task implementations, what was identical across raw, LangGraph, and CrewAI versions?
What is the typical token overhead of a multi-agent role framework compared to a single-loop raw implementation, and what causes it?
Under what three conditions does a graph-state framework earn its abstraction tax?
What is the test for "real role boundary" vs "theatre" in a multi-agent design?
Name three signals that say "stay with raw API instead of adopting a framework."
What is the framework upgrade tax, and how is it different from the model SDK upgrade tax?
What three questions does the fast design test in section 8 ask, and what does each one diagnose?

Interview Q&A¶

Q1. Why is framework choice a design decision rather than a tooling decision? A. Because the framework decides the shape of every primitive your agent uses — loop control, tool dispatch, state, stopping rules, observability. Once you adopt it, every later design decision lives inside its abstractions. The migration cost out of the framework is typically 50-100% of the original implementation, which means the choice is approximately irreversible on normal product timescales. Treating it as a tooling decision ("we'll pick whichever has the nicest README") commits the team to a relationship without examining what is being committed to.

Common wrong answer to avoid: "Pick the most popular framework so you can hire for it." Hiring matters, but hiring for the wrong abstraction produces engineers who can ship the wrong agent fast. The agent's shape determines the right abstraction, not the talent pool.

Q2. Walk me through when raw tool use is the right choice and when LangGraph is. A. Raw when: the agent is the product (your differentiation is the loop), you need to deploy in unusual environments where framework dependencies are awkward, or you need exotic observability the framework cannot provide. LangGraph when: persistence and resume are first-class requirements, human-in-the-loop is structural with many pause points, or the agent is more state-machine than loop (5+ branches, conditional next-steps as the bulk of the design). The test is mechanical once the agent's actual shape is known — branch count, persistence requirement, observability needs. Raw is not "less advanced," it is a different shape of problem.

Common wrong answer to avoid: "LangGraph for serious agents, raw for prototypes." Cursor, Claude Code, and Cognition Devin are serious agents on raw API. The choice has nothing to do with seriousness.

Q3. When does multi-agent (CrewAI, AutoGen) earn its 30-60% token overhead? A. When role boundaries reflect real distinct authorities: different tools per role, different prompts that would interfere if combined, structural review between roles. A code-review agent with a separate security-review sub-agent that has read-only access to the security KB is a real boundary. An "investigator role" that hands findings to an "analyst role" that hands to a "communicator role" — all with the same tools and similar prompts — is theatre. The token overhead is the same in both cases; the value differs by an order of magnitude. The honest test is whether removing the role decomposition would degrade quality. If not, the decomposition is decorative and the cost is paid for nothing.

Common wrong answer to avoid: "Multi-agent is the future, single-agent is the past." The opposite is also wrong. Multi-agent is the right architecture for problems with real role boundaries — and adds cost otherwise.

Q4. What is the framework upgrade tax, and why does it matter for agent products? A. Agent frameworks in 2026 ship fast — major version bumps every 3-6 months, breaking changes in minor versions. Each upgrade is engineering work: read the migration guide, update the call sites, fix the tests that broke, re-validate the eval suite. Raw tool use upgrades when the model SDK upgrades — a much shorter dependency chain. Framework agents take the model SDK upgrade plus the framework upgrade plus the framework's transitive deps. For a product running for years, that compounding tax can dominate the engineering effort spent on the agent. Knowing the tax exists is the architect's job; budgeting for it is the engineering manager's.

Common wrong answer to avoid: "Just pin the framework version and never upgrade." That works for 12 months and then breaks badly when the framework drops support for the model version you depend on. Pinning is deferring the tax, not avoiding it.

Q5. How would you evaluate a framework before adopting it for a new agent? A. Five things. First, can the framework provide all five primitives (loop, dispatch, state, stopping, observability)? If it ships fewer, you are still writing the missing ones. Second, what is its persistence story — does it match your durability requirements? Third, what is its observability story — does it integrate with your existing telemetry, or does it expect you to use its hosted offering? Fourth, what is the lock-in surface — how much of your design would have to be rewritten to migrate off? Fifth, who maintains it and at what cadence — abandoned frameworks become a liability. The evaluation is concrete and answerable; teams that skip it are picking based on README quality.

Common wrong answer to avoid: "Pick whatever the eng team already knows." Familiarity matters less than fit. A team that knows CrewAI well will still ship a worse agent than a team that picks raw tool use for the right reasons.

Q6. Your team is migrating from CrewAI to LangGraph. What is the realistic cost? A. Roughly 50-100% of the original implementation cost. CrewAI's mental model is conversational role-based; LangGraph's is typed state-machine. The migration is not "rename imports" — it is rebuilding the state model (CrewAI's implicit conversation history becomes LangGraph's typed State), rebuilding the coordination (CrewAI's Process.sequential becomes explicit edges), reworking the prompts (role-specific prompts merge or split based on graph node design), and re-validating the eval suite under the new orchestration. Schedule the migration as a real project with eval-gate checkpoints, not as a "quick refactor." The reason the cost is high is precisely the reason CrewAI felt easy to start with — the abstractions are quite different.

Common wrong answer to avoid: "It's just a different framework, should be fast." Different frameworks express different mental models. The model migration is the cost.

Q7. Why are observability and audit considerations often the deciding factor in framework choice for regulated industries? A. Because the framework's trace format and integration points determine what an auditor can reconstruct six months after a session. A framework with a strong trace API and a stable schema (every node entry/exit logged with inputs, outputs, and version) supports audit reconstruction cleanly. A framework that logs conversation transcripts without structured event types makes audit reconstruction a parsing exercise. For agents in finance, healthcare, legal, or regulated SaaS, the audit cost over a product's life can exceed the engineering cost of the loop. The framework that owns the audit story owns a disproportionate share of the long-term value.

Common wrong answer to avoid: "Just bolt on Datadog later." You can — and the bolt-on will be incomplete unless the framework gives you the structured events to instrument against. Observability has to be in the framework's primitives, not just available downstream.

Q8. Your CTO asks "should we just write our own agent framework?" — how do you respond? A. With three questions. One, what is your agent's loop differentiation that no framework provides? If the answer is "we have unusual observability and environmental requirements" — that justifies raw tool use, not "write our own framework." Two, how many agents are you planning to ship in the next 18 months? If the answer is one or two, framework-building is over-engineering. Three, what is your team's experience with maintaining open-source-quality abstractions over years? Framework-building is a long-tail commitment — the easy part is shipping v1; the hard part is the next four years of upgrades, deprecations, and community support. The honest answer is usually "use raw tool use plus thin internal helpers for your specific patterns." Real framework-building only makes sense when you ship many agents and your common patterns are unique to your domain.

Common wrong answer to avoid: "Yes, we'll do it better than the open-source frameworks." You probably won't, because the maintainers of the open-source frameworks have been at it for years. The right question is whether you need to be in the framework-building business at all.

Apply now (10 min)¶

Step 1 — model the exercise. Take the Karthik refund task from section 3. Here is a one-row framework-fit audit for one implementation choice. Copy the shape.

Audit question	Raw tool use	LangGraph	CrewAI
Branch count in the agent	3 (eligible/ineligible/escalate)	3	3
Persistence required?	no (single-session)	no	no
Real role boundaries?	n/a	n/a	weak — same tools, similar prompts
Recommended choice	✅ best fit	overkill (no persistence need)	overhead without payoff

For this specific task, the fit audit lands on raw tool use. Different agents land differently.

Step 2 — your turn. Pick any agent workflow you have built or used. Fill the same four rows (branch count, persistence, role boundaries, recommended choice) for the three columns. Mark which row was hardest to answer — that is the dimension your team has not actually decided.

Step 3 — sketch from memory. Redraw, from memory, the five-primitives diagram from section 2 — loop control, tool dispatch, state, stopping rules, observability — and label, for each primitive, who handles it in raw tool use vs LangGraph vs CrewAI. If you can do this cold, framework choice will never feel like guesswork again.

Operational memory¶

This chapter explained that picking a framework before understanding the loop buys a packaging choice that outlives the design, and that the right question is which of the five primitives — loop control, tool dispatch, state, stopping rules, observability — the framework handles versus which the team owns. The important idea is that the agent's shape determines the right abstraction, not the framework's marketing: raw tool use wins when the loop is the product or environments are exotic; graph-state frameworks win when persistence and branches are first-class; multi-agent frameworks win when role boundaries are enforced rather than decorative.

You learned the five primitives every framework provides in some form, the cost split across implementations (raw at 150 lines but DIY everything, LangGraph at 200 lines plus framework abstractions, CrewAI at 80 lines plus 30–60% token overhead from role coordination), the test for real role boundaries (distinct tools, distinct prompts, structural review between roles), and the framework upgrade tax (major version every 3–6 months for many agent frameworks). That solves the Karthik refund implementation comparison because the fit audit lands deterministically once branch count, persistence requirement, and role boundaries are named honestly.

Carry this diagnostic forward: when a team debates framework choice, demand the five-primitives audit before any commitment. Without it, the choice is being made by the README quality, not by the design.

Remember:

Five primitives (loop, dispatch, state, stopping, observability) — every framework provides them in some form; you write the rest.
Raw tool use wins when the loop is the product, environments are unusual, or observability is exotic.
Graph-state earns its tax when persistence is first-class, HITL is structural, and branches dominate the design.
Multi-agent overhead (30–60%) is earned only when role boundaries are enforced (distinct tools, distinct prompts, structural review) — otherwise decorative.
Framework upgrade tax compounds; migration from one to another typically costs 50–100% of the original implementation.

Bridge. The framework decision is one of many decisions a new agent must make before it ships. Loop type, tool design, authority, stopping rule, memory, approval gates, cost budget, multi-tenancy, recovery, observability, eval gates, rollout strategy — eighteen architectural decisions, all linked. The next file is the synthesis: the architect's punch list every new agent must clear before it goes anywhere near a real user. → 23-architect-checklist.md