01. Build vs buy — the boundary you agree to before you feel it¶

~18 min read. The bank's support agent shipped in four days on a SaaS platform. Everyone celebrated. The wall arrived in month eight, and by then the decision was load-bearing. This file is about making that decision with eyes open.

Built on the first-principles overview in 00-first-principles.md. The dominant pressure is fit-before-lock-in, and the first move is choosing among the three families — framework, hyperscaler runtime, SaaS vertical. This file turns that three-way choice into a decision lens and shows the wall that forms when the choice is made by demo instead of by need.

What single-agent design left unanswered¶

Module 01 taught you to pick an agent's shape — single call, ReAct loop, orchestrator, multi-agent — and to wire its loop, tools, memory, and leash. That work assumed you already owned the runtime: that there was a process somewhere executing the loop, a place to register tools, a store for state, a tracer watching it. Module 16 taught multi-agent coordination; module 19 taught tool contracts. All of it assumed the substrate existed.

This file asks the prior question. Who builds and runs the substrate? You can write the loop yourself on a framework, rent a managed runtime that runs the loop for you, or buy a finished agent where the loop is invisible and pre-built. The agent's shape is the same in all three. What changes is the boundary — how much of the system you own versus how much the vendor owns — and that boundary determines what you can change later when the use case moves.

What this file solves¶

A team scopes an agent, sees a great demo from one vendor, and ships on it. Months later a real requirement arrives — a private network call, a model swap, an audit signature, a residency constraint — and the platform structurally cannot serve it. The team is not missing a feature; it agreed to a boundary it never named. This file gives you a decision lens that names the boundary on day one: what capability you need, what control you must retain, what the exit costs, and which of the three families fits before the choice hardens.

Why "just pick the best platform" is the wrong frame¶

There is no best platform, the same way there is no best database. The instinct to rank platforms on a single axis — "which is most capable?" — fails because the four levers move against each other. A SaaS vertical agent is the most capable for its domain on day one and the least capable the moment your need leaves that domain. A framework is the most capable eventually and the least capable this week, because you have to build everything the others gave you free.

So the question is never "which platform is best." It is "which boundary can this use case live behind for the next two years, given that the use case will move?" That reframing is the whole skill.

When a four-day demo hides a two-year boundary¶

The bank's support agent on a SaaS vertical platform handled card-balance and statement questions perfectly in week one. The visible artifact looked like this:

Customer (WhatsApp): "what's my credit card balance?"
SaaS agent:          [retrieves from connected core-banking API]
                     "Your current balance is ₹42,310, due 18 June."
Latency: 1.4s   Handled end-to-end, no human.   CSAT on pilot: 4.6/5

Nobody could see the boundary in that trace. The boundary became visible only when month eight asked for this:

New requirement: route loan-eligibility questions to the bank's
internal fraud-scoring service (inside the bank VPC, no public endpoint),
sign every model call with the bank's KMS key for the regulator's audit,
and serve the 70% FAQ traffic on a cheaper model to cut cost.

SaaS platform response:
  - private VPC tool calls: not supported (tools must be public HTTPS or our connectors)
  - per-call signing with your key: not exposed; we own the model calls
  - model routing by intent: not configurable; we choose the model

The root cause is not that the SaaS platform is bad. It is excellent at what it does. So the real problem is not a missing feature; it is a boundary agreed to without a name. The bank bought a closed orchestration layer, and closed orchestration is exactly what made the four-day demo possible. The speed and the wall are the same property seen from two sides.

So how do you choose so the boundary is named before it becomes load-bearing? You score the use case against the three families on four levers, and you compute the exit cost before you sign, not after you are trapped.

The boundary rule. Every platform choice is a boundary between what you control and what the vendor controls. The boundary is cheap to draw on day one and expensive to redraw later, because code, data, and habits accumulate on the vendor's side. Name the boundary before you commit, or you will discover it during an incident.

Why this rule exists. The primitive is ownership: someone must own the orchestration, the model calls, the state, and the identity. A platform's value comes from owning some of these for you. The constraint is that whatever the vendor owns, you cannot change without their roadmap or without leaving. The naive approach — pick by demo — optimizes day-one capability and silently maximizes the vendor's ownership, which is the lock-in you feel later.

1) The three families as three boundaries — how the choice actually works¶

The three families are not three products. They are three positions of the boundary. Read each as "what do I own, what does the vendor own."

Hyperscaler runtime (AWS Bedrock AgentCore, Google Vertex AI Agent Builder / Agent Engine, Microsoft Foundry Agent Service). You own the agent logic — your orchestration code or framework runs on their runtime. The vendor owns the infrastructure: scaling, the secure runtime sandbox, managed memory, identity integration, observability plumbing. The boundary sits between your code and the machine. You keep your logic; you rent the operations and the cloud-native security. As of 2026, all three explicitly run open frameworks (LangGraph, CrewAI, LlamaIndex, Strands) on their managed runtime, which moves the boundary in your favor — your orchestration stays portable, their runtime carries the ops.

AI-native framework (LangGraph, CrewAI, LlamaIndex, Microsoft Agent Framework, OpenAI Agents SDK). You own everything: the loop, the state store, the tracer, the servers, the on-call. The vendor owns only a library. The boundary sits at the edge of your own infrastructure. Maximum control and portability, maximum operational burden. This is "build."

SaaS vertical agent (Sierra, Decagon, Salesforce Agentforce, NiCE Cognigy, Glean). You own configuration: knowledge bases, connected tools, prompts within their templates, branding. The vendor owns the orchestration, the model choice, the runtime, the state, often the data. The boundary sits at the configuration surface. Maximum speed and packaged domain expertise, minimum control. This is "buy."

            you own ◀─────────────────── boundary ───────────────────▶ vendor owns

FRAMEWORK   loop, state, tools, tracing, servers, on-call            │ a library
            ───────────────────────────────────────────────────────┤
HYPERSCALER agent logic / orchestration code                        │ runtime, scaling, memory,
            ────────────────────────────────────┤                     identity, security plumbing
SAAS        config: KB, connectors, prompts-in-template, branding   │
            ──────┤                                                    orchestration, model, runtime,
                                                                       state, often the data

The diagram is the whole file in one picture. As you move down, the boundary slides left: you own less, you build less, and you can change less. Speed goes up; control and portability go down. The bank chose the bottom row and felt the left-most boundary in month eight.

2) Picture first — the decision as a leash on a slider¶

flowchart TD
    A[Use case scoped] --> B{Is the domain a solved<br/>SaaS category, and will it<br/>stay inside that domain?}
    B -->|Yes, and unlikely to drift| C[SaaS vertical agent<br/>fastest value, least control]
    B -->|No, or will drift| D{Do you need control over<br/>orchestration, model routing,<br/>private tools, or audit?}
    D -->|No, ops burden is the enemy| E[Hyperscaler runtime<br/>rent ops, keep logic portable]
    D -->|Yes, deep control is the point| F{Do you have the team to<br/>run servers, tracing, on-call?}
    F -->|Yes| G[AI-native framework<br/>max control, you own ops]
    F -->|No| E

Read this top to bottom and stop at the first leaf that fits. The first gate is the most decisive: if a SaaS category solves your exact problem and the problem will stay in that category, buying is almost always right — you cannot out-build a vendor who has spent years on that one domain. The danger is the parenthetical "and will stay." The bank's support use case looked like a solved SaaS category, but it was destined to drift into private-network calls and custom audit, which no support-SaaS vendor owns.

3) The bank's two agents through the lens — one running example¶

The bank has two agents to place. They look similar — both are "customer-facing-ish LLM agents" — but the lens places them in different families. This is the example carried through the whole module.

Agent 1 — customer-support agent (card / loan / account questions, web + WhatsApp)¶

Attempt A — the tempting move: buy the SaaS vertical¶

A support-CX SaaS like Sierra or Decagon is purpose-built for exactly this: transactional support workflows, omnichannel, deflection metrics, packaged guardrails. Time to first working bot: days. The demo is the product.

Where it breaks: the bank's support agent must eventually call the internal fraud-scoring service in the VPC, sign model calls for the audit, and route FAQ traffic to a cheaper model. None of these live inside the support-SaaS boundary. The four-day win becomes the month-eight wall.

Attempt B — the right move: hyperscaler runtime with a portable framework on top¶

Run a LangGraph orchestration (which you own and can move) on a hyperscaler runtime — say Bedrock AgentCore or Vertex Agent Engine. You get managed scaling, managed memory, cloud-native identity, and a secure runtime sandbox, and your orchestration code stays portable because the runtimes explicitly host open frameworks. Private VPC tool calls work because the runtime sits inside your cloud network. Per-call signing works because you own the model-invocation code. Intent-based model routing works because you wrote the router.

Cost is higher than SaaS on day one (you build the orchestration) but the boundary sits at "your code vs their machine," which is exactly the boundary the month-eight requirements need to cross.

Verdict for Agent 1: hyperscaler runtime + owned framework. The use case looks like a SaaS fit but is destined to drift out of the SaaS boundary. The drift is the deciding factor.

Agent 2 — internal-ops agent (RM tool: pull KYC docs, draft compliance memos)¶

Low volume (a few thousand calls a day across relationship managers), deep integration with internal document stores and the bank's own identity system, sensitive data that must never leave the bank's region, and bespoke logic that no SaaS vendor sells. There is no "compliance-memo-for-Indian-retail-banks" SaaS category.

SaaS: no product exists for this. Eliminated by the first gate.
Hyperscaler runtime: strong fit — managed ops, in-VPC tool calls, cloud identity, residency controls.
Framework self-hosted: viable, but the bank's platform team is six people and does not want to run agent infrastructure and on-call for a low-volume internal tool.

Verdict for Agent 2: hyperscaler runtime. Same family as Agent 1, which is a feature — one operational model, one observability stack, one identity story for both agents. We will keep both agents on the hyperscaler family and spend the rest of the module deciding which hyperscaler and whether any piece should drop to a framework or rise to a SaaS component.

Teacher voice. Notice what the lens did. Both agents looked like buy candidates because the demos are seductive. The lens placed both on the hyperscaler runtime — not because runtimes are "best," but because each use case had a need that crossed the SaaS boundary: drift for Agent 1, no-category-exists for Agent 2. The family follows the boundary, not the demo.

4) Why the runtime, not the framework, under this workload¶

The plausible alternative to "hyperscaler runtime" is "self-host a framework." Both keep your orchestration portable. So why rent the runtime for a bank with a six-person platform team?

Because the workload is two agents, modest volume, high compliance, small team. The framework path means the team builds and operates: autoscaling, the session/memory store, the trace pipeline, secrets and identity wiring, region pinning, patching, and 24×7 on-call — for a system whose business value is support deflection and RM productivity, not infrastructure. The runtime gives all of that as managed services. The team's scarce resource is engineer-attention, and the runtime spends money to buy attention back.

If the workload were different — say a high-volume, latency-critical, cost-sensitive product agent at 50M calls a month where the per-call platform fee dominates — the math could flip toward self-hosting the framework on raw compute to kill the platform markup. That is a cost decision, and module file 05 walks exactly that curve. For this workload, the runtime wins because the team is small and the compliance bar is high.

5) The property that flips the decision: drift¶

The single dimension that most changes the build-vs-buy answer is drift — how far the use case will move from where it starts. A use case that will stay inside one SaaS domain forever should be bought. A use case that will grow tentacles into your private systems, your audit, your cost controls, and your model choices should sit on a boundary you control.

Drift over 2 years:        LOW ───────────────────────────────▶ HIGH

Right family:        SaaS vertical    Hyperscaler runtime    Framework
                     (buy the domain)  (rent ops, keep logic) (own everything)

The bank's error was estimating drift at zero ("it's just a support bot") when the real drift was high (private calls, audit, routing). Drift is hard to estimate because day-one requirements are always the simplest the use case will ever be. The safe bias: assume more drift than the demo implies, especially in a regulated domain where compliance requirements arrive late and non-negotiably.

6) The wall, walked through deeply — one failure end to end¶

Replay the felt failure as a timeline, because the shape of this failure recurs across every bad platform choice.

Week 1   SaaS support agent ships. CSAT 4.6. Leadership thrilled. Boundary invisible.
Month 3  Volume grows; FAQ is now 70% of traffic. Cost per conversation flat at ~$2.
         Nobody can route cheap traffic to a cheap model — the vendor owns the model.
Month 6  Regulator asks: "show us a signed audit trail of every automated decision."
         The vendor owns the model call; per-call signing with the bank's key is not exposed.
Month 8  Product wants loan-eligibility answers via the internal fraud service in the VPC.
         The vendor's tools must be public HTTPS or vendor connectors. No VPC path.
Month 9  Migration project opens. Exit cost estimated at 5–7 engineer-months:
         re-implement orchestration, re-connect tools, re-home conversation data,
         re-train the support team on a new console. Data export is partial.

The failure is not a bug. Every step is the platform working exactly as sold. The damage is that each new requirement hits the same wall — the vendor-owned boundary — and the requirements were always going to arrive because the domain is regulated banking. Not a vendor-quality problem; a boundary-placement problem. The fix is upstream: place the boundary where the foreseeable requirements can cross it.

Mini-FAQ. "If we'd just asked the vendor for a roadmap commitment, wouldn't that have solved it?" Rarely. A roadmap commitment moves a feature, not the boundary. Private VPC calls, your-key signing, and free model routing are not features a closed-orchestration SaaS adds without becoming a different product. The boundary is structural, not a backlog item.

7) Cost movement — what each family fixes and what it bills you for¶

	Day-1 build effort	Ongoing ops	Per-task cost	Control retained	Exit cost
SaaS vertical	days	vendor runs it	high & fixed (~$2/conversation, Agentforce list)	configuration only	high — re-home data, rebuild flows
Hyperscaler runtime	weeks (build orchestration)	managed services	medium (tokens + runtime fees)	logic, routing, tools, audit	medium — logic is portable, identity/memory less so
Framework self-host	weeks–months	you run it all	lowest at scale (tokens + your infra)	everything	low — you own it, but you also owe the ops forever

Read the cost columns as a movement, not a ranking. SaaS moves build cost and ops cost to near zero and pays for it with per-task price and exit cost. Framework moves per-task cost to the floor and pays for it with build effort and permanent ops burden. The hyperscaler runtime sits in the middle on every axis — which is exactly why it fits a small team with a high compliance bar and uncertain drift. The subsystem that absorbs the runtime's "medium" everywhere is the cloud bill plus a modest build investment, in exchange for keeping the boundary movable.

For the bank: SaaS would have been ~$2/conversation with near-zero build, but the exit cost (the wall) made that the most expensive path in the end. The runtime costs more upfront and per task but keeps the month-eight requirements possible, which is the only cost that mattered.

8) Operational signals — is the build-vs-buy choice still right?¶

Healthy. The platform serves every new requirement without a "the vendor can't do that" conversation. New tools, new model routing, and audit changes land inside your team's control. Cost per task is flat or falling as volume grows.

First metric that degrades. The count of requirements blocked by "the platform doesn't support it" per quarter. When this rises above zero and the items are structural (not backlog features), the boundary is in the wrong place. For the bank, this counter went 0 → 1 → 3 across months 3, 6, 8.

Misleading metric people watch. Demo success rate and pilot CSAT. These measure day-one capability inside the current boundary, which is exactly what every family aces and exactly what hides the wall. A 4.6 CSAT told the bank nothing about month eight.

The signal an experienced lead checks first. "How many of next year's foreseeable requirements cross the vendor boundary?" — asked at selection time, not after. A lead reads the regulatory roadmap and the product roadmap, lists the requirements, and checks each against the boundary diagram. That list, not the demo, is the decision input.

9) Boundary of applicability — when buying really is right¶

Strong fit for buying (SaaS vertical). A well-defined, stable domain where a vendor has years of specialized work you cannot match — front-line customer support deflection, enterprise knowledge search, contact-center voice. If the use case will live inside that domain and your differentiation is elsewhere, buy. Building your own support-CX engine to compete with Sierra or Decagon on their turf wastes years.

Where buying becomes pathological. When the bought agent becomes a dependency for systems it was never scoped to touch, and data gravity has made leaving a multi-quarter project. The agent quietly becomes core infrastructure with a vendor's boundary running through the middle of it.

Scale / regime that invalidates the naive intuition. "Buy is always faster" holds at the pilot and inverts at the wall. "Build is always more flexible" holds in capability and inverts in operations — a framework you cannot keep running is less flexible than a runtime you can. The intuition that breaks most often is "this is just a simple bot," which underestimates drift in regulated and integration-heavy domains.

10) The wrong model to drop: "buy to move fast, build for control" as a clean binary¶

The seductive wrong idea is that build-vs-buy is a single slider between speed and control, and you pick a point. Reality has three positions and the middle one (hyperscaler runtime) is the one most teams skip, because it is less exciting than "we built it ourselves" and less easy than "we bought it." The runtime lets you buy operations while keeping logic — speed and a movable boundary. The bank's mistake was treating the choice as binary (fast SaaS vs slow build) and never seeing the middle position that fit both its agents.

Replace the binary with the boundary diagram from section 1: three positions, each owning a different slice. Pick by what your foreseeable requirements must cross, not by where you land on a two-ended slider.

11) Other failure shapes around the build-vs-buy line¶

Demo-driven selection — choosing on a polished demo that exercises only day-one, in-boundary capability.
Drift denial — scoping the use case as it is today and assuming it never grows tentacles into private systems or audit.
Resume-driven building — self-hosting a framework because the team wants to, then drowning in on-call for a low-value internal tool.
Buy-then-customize trap — buying a SaaS and trying to bend it into a platform via configuration until the config is harder to maintain than owned code would have been.
Hidden mandatory add-ons — the SaaS price excludes a required data platform (e.g., Agentforce needing Data Cloud), tripling real Year-1 cost.
Two families for one job — running a SaaS agent and a framework agent that must share state, paying integration tax forever because the boundaries don't meet.
Roadmap faith — committing on a promised feature that, even if shipped, doesn't move the structural boundary you actually need crossed.

12) Pattern transfer — where this same decision recurs¶

Model vendor strategy (module 12) — same fit-before-lock-in pressure one layer down: choosing a model provider is also a boundary and an exit-cost calculation, and it nests inside the platform choice.
Legacy AI modernization (module 14) — the wall here is the same failure geometry as a legacy system you cannot patch: a boundary agreed to long ago that the current requirement cannot cross.
Build-vs-buy in classic infra — managed database vs self-hosted vs DBaaS is the identical three-family boundary problem; the levers (control, cost, ops, lock-in) are the same.
Tool integration contracts (module 19) — the private-VPC-call requirement is a tool-contract problem the SaaS boundary forbids; owning the boundary is what lets you honor the contract.

13) The build-vs-buy audit — five yes/no questions¶

Have you listed next year's foreseeable requirements and checked each against the vendor boundary, not just the demo?
Is the use case a stable, solved SaaS domain that will not drift into your private systems, audit, or cost controls?
If you self-host a framework, does the team actually have capacity to run servers, tracing, and on-call for this agent's value tier?
Have you computed exit cost in engineer-months before signing, including data export and team retraining?
Is the boundary you are choosing placed so the requirements in question 1 can cross it without leaving?

If question 1 or 4 is "no," you are choosing by demo, and the wall is in your future.

Where this appears in production¶

Hyperscaler runtimes (rent ops, keep logic portable): - AWS Bedrock AgentCore — runs open frameworks (LangGraph, CrewAI, LlamaIndex, Strands) on a managed runtime, so orchestration stays portable while AWS owns scaling, memory, and identity. - Google Vertex AI Agent Builder / Agent Engine — managed runtime, GA Sessions and Memory Bank, and ADK for portable orchestration; the boundary sits at your code vs Google's runtime. - Microsoft Foundry Agent Service — hosted agents let you run Microsoft Agent Framework or LangGraph on Foundry's managed runtime with built-in scaling, observability, and governance. - A regulated bank — places both a support agent and an internal-ops agent on one hyperscaler family to get one identity, one trace stack, one residency story.

SaaS vertical agents (buy the domain): - Sierra — packaged customer-service agent for transactional CX (refunds, billing, account changes); the right buy when the domain stays put. - Decagon — omnichannel support automation; bought when speed and consistency in CX outrank deep control. - Salesforce Agentforce — buy-the-domain CX/sales agent priced per conversation or per action; the hidden cost is a mandatory data platform. - NiCE Cognigy — contact-center voice and messaging agents; bought by enterprises whose center is already on that stack. - Glean — enterprise knowledge search agent; bought when the problem is information fragmentation, not workflow control.

AI-native frameworks (own everything): - LangGraph — chosen by teams that need cyclic, controllable orchestration and will run the ops themselves. - CrewAI — role-based crews for teams that want readable multi-agent structure and accept less low-level control. - LlamaIndex — built upon when the agent is RAG-first and retrieval is the core competency. - OpenAI Agents SDK — lightweight primitives (agents, handoffs, guardrails) for teams standardizing on OpenAI and owning their runtime. - Microsoft Agent Framework — the open framework that also runs hosted on Foundry, bridging build and rent.

Where the wall shows up: - A fintech that bought a closed support SaaS — blocked from in-VPC fraud-service calls and per-call audit signing exactly like the running example. - A healthcare provider on a vertical agent — hit data-residency and BAA boundaries the vendor's shared cloud could not satisfy. - A retailer that self-hosted a framework — shipped fast, then spent more on agent on-call than the deflection saved, and migrated to a runtime.

Pause and recall¶

Name the three families and, for each, where the boundary sits (what you own vs the vendor).
Why is "pick the best platform" the wrong frame? What is the right question?
What property most flips the build-vs-buy answer, and why is it hard to estimate?
Walk the bank's wall: which requirement hit the SaaS boundary first, and why was it structural?
Why did the lens place both bank agents on a hyperscaler runtime rather than SaaS or self-hosted framework?
When is buying a SaaS vertical clearly right?
What is the "misleading metric" that hides the wall at selection time?
What is the middle position teams skip, and what does it buy you?

Interview Q&A¶

Q1. A PM wants to ship a support agent on Sierra because the demo was perfect. The domain is a regulated lender. What do you ask before agreeing? A. I list next year's foreseeable requirements against the vendor boundary: private-network tool calls, per-call audit signing, model routing for cost, data residency. In regulated lending these arrive non-negotiably and late. If any cross the closed-orchestration boundary, the demo's speed is buying a future wall. I'd likely place it on a hyperscaler runtime with a portable framework instead. Common wrong answer to avoid: "Ship on Sierra — fastest time to value wins." That optimizes the demo and ignores structural drift in a regulated domain.

Q2. Why not just self-host LangGraph for full control and lowest cost? A. Control and per-task cost favor self-hosting, but the scarce resource is the team. A six-person platform team running autoscaling, memory stores, trace pipelines, identity wiring, and 24×7 on-call for a modest-volume agent spends its attention on infrastructure instead of the product. A hyperscaler runtime rents that ops while keeping the orchestration portable. Self-hosting wins when volume is huge and the per-call platform fee dominates the bill. Common wrong answer to avoid: "Self-hosting is always cheaper." It's cheaper per token at scale and far more expensive in engineer-months for a small team.

Q3. What does "the boundary" mean in a build-vs-buy decision, and why does it matter more than features? A. The boundary is the line between what you control and what the vendor controls — orchestration, model calls, state, identity. Features are inside-the-boundary capabilities every family aces at demo time. The boundary is structural: whatever the vendor owns, you cannot change without their roadmap or without leaving. Walls form at the boundary, not at missing features. Common wrong answer to avoid: "Pick by feature checklist." Checklists score day-one capability and miss the structural line that causes the wall.

Q4. The bank has two agents that look similar. Why might they belong to different families — and why did they end up in the same one? A. You place by boundary, not by appearance. Both looked like SaaS buys. Agent 1 (support) was destined to drift past the SaaS boundary into private calls and audit; Agent 2 (internal compliance memos) has no SaaS category at all. Both needed a controllable boundary, so both landed on the hyperscaler runtime — and keeping them in one family gives one identity, one trace stack, one residency story. Common wrong answer to avoid: "Customer-facing buys, internal builds." The customer-facing/internal split doesn't track the boundary; drift and category existence do.

Q5. How do you compute exit cost before signing, and why before? A. Estimate engineer-months to re-implement orchestration, re-connect tools, re-home conversation and memory data (checking what's exportable), and retrain the operating team — plus the calendar time the data gravity adds. Compute it before signing because after signing the data has accumulated, the team has habits, and the number only grows. Exit cost is the price of the boundary; you should know it when you draw the line. Common wrong answer to avoid: "We'll figure out migration if we ever need to." That's exactly when it's most expensive and least optional.

Q6. A vendor promises your blocking requirement on their Q3 roadmap. Do you commit? A. Only if the requirement is a feature inside their existing boundary. If it requires moving the boundary — exposing model calls for your-key signing, allowing private-VPC tools, opening model routing — a roadmap promise rarely lands because it would make them a different product. I distinguish backlog features from structural boundary changes and only trust the former. Common wrong answer to avoid: "Yes, roadmap commitment de-risks it." Roadmaps move features, not boundaries; the structural need stays blocked.

Q7. Is this a build-vs-buy mistake, a topology mistake (module 01), or a cost mistake (file 05)? A. Diagnose by symptom. If the agent's shape is wrong (single-call for multi-step work), it's a topology bug. If the per-task cost is fine but new requirements keep hitting "vendor can't do that," it's a build-vs-buy boundary mistake. If everything works but the bill explodes at scale, it's a cost-curve mistake. The bank's wall — structural "can't do that" items — is squarely build-vs-buy. Common wrong answer to avoid: "It's a model problem, upgrade the model." Model quality doesn't move a vendor boundary or a cost curve.

Design/debug exercise (10 min)¶

Step 1 — Model it. Place the bank's support agent with the lens:

Use case: support agent, regulated lender, web + WhatsApp
Foreseeable requirements (next 24 mo): private fraud-service call, per-call audit
  signing, model routing for cost, data residency in-region
First gate (stable SaaS domain that won't drift?): NO — drift is high, regulated
Second gate (need control over orchestration/model/audit?): YES
Third gate (team can run servers + on-call?): NO (6-person team)
→ Family: hyperscaler runtime + portable framework. Boundary: my code vs their machine.
Exit cost if wrong: medium — orchestration portable, memory/identity less so.

Step 2 — Your turn. Take the bank's internal-ops agent (KYC pull + compliance memos) and run the same five lines: foreseeable requirements, three gates, chosen family, and exit cost. Then take one agent from your own backlog and do it a third time. The win is naming the boundary and the exit cost before any demo.

Step 3 — Reproduce from memory. Redraw the boundary diagram from section 1 (you own ◀ boundary ▶ vendor owns, three rows) cold, and connect it to module 01: the agent shape lives inside whichever row you pick, and a closed SaaS row hides the shape entirely. If you can draw the three rows and explain which requirement crosses each boundary, you own this chapter.

Operational memory¶

This chapter explained the felt failure where a team ships fast on a platform and then hits a wall the platform structurally cannot patch. The important idea is that every platform choice places a boundary between what you control and what the vendor owns, and the wall forms at that boundary — not at a missing feature. Demos test only inside-the-boundary capability, which is why a 4.6 CSAT pilot tells you nothing about month eight.

You learned to make the decision with a lens: list next year's foreseeable requirements, check each against the vendor boundary, weigh the four levers, and compute exit cost before signing. That solves the opening failure because the bank's blocking requirements (private calls, audit signing, model routing) were all foreseeable and all crossed the SaaS boundary — visible on day one if anyone had drawn the line.

Carry this diagnostic forward: when a platform "can't do that," ask whether it's a backlog feature or a structural boundary. If structural, no roadmap fixes it; the boundary was placed wrong. Bias toward more drift than the demo implies, especially in regulated domains.

Remember:

Three families = three boundary positions: framework (own all), hyperscaler runtime (own logic, rent ops), SaaS vertical (own config only).
The wall forms at the vendor boundary, not at a missing feature; demos hide it because they test only inside the boundary.
Drift is the property that flips the decision; assume more of it than the pilot shows, especially when regulated.
The middle position (hyperscaler runtime) is the one teams skip — it buys ops while keeping logic portable.
Compute exit cost in engineer-months before signing; data gravity only makes it grow.
A roadmap promise moves a feature, never the boundary.

Bridge. We chose the family by boundary — both bank agents land on a hyperscaler runtime. But "hyperscaler runtime" is three very different products, and "framework" and "SaaS" are a dozen more, each optimizing for something different. Before we can score one against another, we need the map: who the real players are in each family and what each one is actually built to do. The next file draws that landscape. → 02-platform-landscape-map.md