Skip to content

04. Lock-in and portability — scoring the cost of leaving, not the cost of staying

~18 min read. The rubric cleared AgentCore and gravity picked it. But the rubric never asked the one question that decides whether a good platform becomes a trap: when you want to leave, what does it cost, and can you even export your own data?

Built on 03-capability-evaluation-rubric.md. The rubric scored what each platform can do and deliberately excluded one dimension: exit cost. This file scores that excluded dimension. The pressure is fit-before-lock-in seen from the far side — once the platform is chosen, the boundary quietly hardens as data gravity accumulates, and the only defense is the escape hatch you build on day one.


What the rubric refused to score

By now you can place a platform in one of the three families, read where it sits on the map by data gravity, and produce a defensible capability rubric that clears poor fits. The rubric measured nine things the agent can do. It said nothing about what happens when month eighteen brings a price hike, a better platform, an acquisition that kills the product, or a regulator who demands you move state to a new region the vendor doesn't serve.

The build-vs-buy file named exit cost as a lever and even estimated it (5–7 engineer-months for the bank's hypothetical SaaS migration). It did not show you how that number is built, or which architecture choices shrink it. That is this file. You will learn to decompose exit cost into the specific surfaces that fuse you to a platform, and to design the escape hatch — the choices that keep leaving cheap even on a platform you intend to stay on for years.

What this file solves

A team picks a capable platform, ships, and a year later finds it cannot leave: the orchestration is written in the vendor's closed flow language, the prompts and tools use a proprietary format, conversation logs and long-term memory live in the vendor's store with no full export, and the agent's identity is wired to the vendor's IAM. None of these blocked anything until the day they wanted out — and then all four blocked at once. This file decomposes lock-in into four named surfaces, shows how to compute exit cost in engineer-months before signing, and gives the concrete architecture moves that keep the escape hatch open. The first move is to stop asking "how good is this platform" and start asking "if this platform doubled its price tomorrow, how many engineer-months and how much un-exportable data stand between us and the exit."

Why "we can always migrate later" is the lie that traps you

The instinct after a clean rubric win is relief: the platform is good, we can always move later if it disappoints. That sentence is true on day one and false on day three hundred, for the same reason a database migration is trivial with ten rows and a quarter-long project with ten billion. Lock-in is not a switch the vendor flips. It accumulates. Every conversation logged, every long-term memory written, every tool wired in the vendor's format, every prompt tuned against the vendor's quirks raises the wall a little. By the time you want to leave, the wall is built — and you built it, one trajectory at a time.

So the question is never "can we migrate later." Migration is always possible; the only variable is the price. The real question is "what is that price, what makes it grow, and which choices today keep it small?"

When a portable orchestration is still trapped by its data

The bank did the right thing on orchestration. It runs LangGraph — a framework it owns and can move — on AgentCore. File 02 called this the smart composition: portable logic on rented ops. So the bank assumed it was safe from lock-in.

Then a residency rule changed. The regulator now requires the internal-ops agent's long-term memory — every KYC interaction and compliance-memo draft — to move to a new sovereign-cloud region the bank was standing up. The orchestration moved in an afternoon: redeploy the same LangGraph graph elsewhere. But the memory did not.

Migration attempt: move internal-ops agent to sovereign region

Orchestration (LangGraph graph):   moved in 1 afternoon. Portable. ✓
Tool wiring (Gateway connectors):  re-pointed in 2 days. Mostly portable. ✓
Long-term memory (AgentCore Memory): 14 months of accumulated memory records.
   Export API returns processed memory summaries, NOT the raw event stream
   they were derived from. Re-deriving on the new platform changes the
   summaries → agent behavior shifts → re-validation required. ✗
Identity (IAM roles, KMS key bindings): re-issued; every tool's auth
   re-bound to new region's KMS. 3 weeks. Partial. ~

The orchestration was portable and the data was not. So the real lock-in was never the code; it was the state that accumulated behind the boundary. A portable graph that reads from a non-portable memory store is still trapped — it just hid the trap in a different surface. The escape hatch the bank built (owned orchestration) covered one of four surfaces and left the other three open.

So how do you keep all four surfaces movable, on a platform you fully intend to keep using? You name each surface, price its exit, and design each to stay portable before the data piles up behind it.

The exit-cost rule. Evaluate a platform by the cost of leaving it, not only the cost of using it. Exit cost is the sum across four surfaces — orchestration, prompt/tool formats, data, identity — and it grows with every trajectory you run. The escape hatch is the set of day-one choices that keep each surface portable; build it while exit cost is still zero.

Why this rule exists. The primitive is that a platform's value comes from owning state and infrastructure for you, and owned state is exactly what cannot be cheaply un-owned. The constraint is asymmetry: a platform makes adoption frictionless and exit expensive, because frictionless adoption sells and expensive exit retains. The naive approach — "migrate later" — assumes exit cost is constant when it grows monotonically with usage, so the cheapest moment to make leaving cheap is before you have anything to take with you.


1) The four lock-in surfaces — where you actually fuse to a platform

Lock-in is not one thing. It is four distinct surfaces, each with a different exit cost and a different escape hatch. Score them separately, because a platform can be wide open on one and welded shut on another.

1. Orchestration lock-in. How the agent's logic is expressed. The trap is a proprietary, closed flow language — a visual builder or vendor DSL where your branching, loops, and state transitions exist only inside the vendor's product and cannot be exported as portable code. The escape hatch is owning the orchestration in a portable framework (LangGraph, an open Agent Framework) so the logic is your code, runnable elsewhere. This is the surface the bank got right.

2. Prompt and tool-format lock-in. How prompts and tool/function schemas are written. The trap is vendor-specific prompt templates, tool-definition formats, and connector schemas that don't transfer — re-writing every tool definition and re-tuning every prompt for a new platform's quirks. The escape hatch is standard, portable formats: tool schemas expressed as MCP servers, prompts kept as plain templated text you own, not buried in the vendor's builder.

3. Data lock-in (data gravity). Where conversation logs, session state, long-term memory, and embeddings live, and whether you can export them whole. The trap is a store you cannot fully export — the partial export the bank hit, where processed summaries come out but the raw events that produced them do not. This is the heaviest surface because it grows every second the agent runs and it is the hardest to fake-fix later. The escape hatch is owning the source of truth: write trajectories and memory to your store and treat the platform's memory as a derived cache, or insist on full raw export as a contract term before signing.

4. Identity and infra lock-in. How the agent authenticates, signs calls, and reaches tools — the vendor's IAM, KMS bindings, networking, and billing identity. The trap is that every tool's auth, every signed model call, and every network path is wired to one cloud's identity model. The escape hatch is hard here, because identity is inherently cloud-specific; the realistic move is to keep the abstraction (a thin auth layer your code calls) rather than scattering vendor identity calls through the agent.

SURFACE              TRAP (welded shut)                ESCAPE HATCH (kept open)
──────────────────   ───────────────────────────────  ────────────────────────────────
Orchestration        closed visual flow / vendor DSL   own logic in portable framework
Prompt/tool format   vendor templates & tool schemas   MCP tool servers + owned prompts
Data (gravity)       store with partial/no export      own source of truth, platform = cache
Identity/infra       IAM/KMS/network welded in         thin auth abstraction, isolate calls

   exit cost  =  Σ(engineer-months to rebuild each surface)  +  un-exportable data penalty
   and every one of these grows with usage — heaviest on the data row

The diagram is the file. Four surfaces, four traps, four hatches. The bank built the hatch on row one and left rows two through four exposed — which is why a portable orchestration still couldn't move its memory.


2) Picture first — exit cost as a wall that rises while you stay

flowchart TD
    A[Day 1: exit cost ~ 0<br/>nothing accumulated yet] --> B[Build escape hatches now<br/>portable orch, MCP tools, owned data, auth abstraction]
    A --> C[Skip hatches<br/>use vendor flow, vendor tools, vendor store]
    B --> D[Month 18: exit cost stays LOW<br/>logic moves, data is yours, tools re-point]
    C --> E[Month 18: exit cost HIGH<br/>rebuild flow, re-write tools, data half-stuck]
    E --> F{Price hike / vendor pivot /<br/>residency change arrives}
    F --> G[Trapped: leaving costs quarters<br/>so you accept the new terms]
    D --> H{Same event arrives}
    H --> I[Free to leave or negotiate<br/>from strength]

The two paths start identically — a working agent on a chosen platform — and diverge only when an external event forces the question. The team that built hatches negotiates from strength; the team that didn't accepts whatever terms arrive, because the alternative is a multi-quarter rebuild. Notice the escape hatch costs almost nothing on day one and is nearly impossible to retrofit once data has piled up. That asymmetry is the whole reason to build it early.


3) The bank's two agents through the four surfaces — one running example

The bank intends to stay on AgentCore. The point of scoring lock-in is not to leave; it is to keep leaving possible so the bank negotiates renewals from strength and survives a forced move. Walk each surface for both agents.

Attempt A — the tempting move: "we own the orchestration, so we're portable"

The bank's first lock-in review checked one box: orchestration is LangGraph, owned, portable. Conclusion: low lock-in, move on. This is the trap the residency change exposed — it scored one surface and assumed the other three followed.

Attempt B — the right move: score all four surfaces and build the missing hatches

Orchestration. LangGraph on AgentCore. Hatch already open — the graph runs on any runtime that hosts LangGraph, or self-hosted. Exit cost: low (days). ✓

Prompt/tool format. The bank wired its in-VPC fraud and KYC tools through AgentCore Gateway. Gateway speaks MCP, so the tool schemas are portable — re-pointing them at another MCP-aware runtime is configuration, not a rewrite. But the bank's prompts were being tuned inside trial-and-error against Bedrock model quirks. The fix: keep prompts as owned, versioned templates in the repo, model-agnostic where possible. Exit cost after fix: low-medium (days to re-point tools, some prompt re-tuning per model).

Data. The heavy surface. AgentCore Memory accumulates short-term events and long-term memory records. The export gives processed summaries, not the raw event stream. The bank's fix: dual-write — every trajectory and every memory-relevant event is written to the bank's own store (S3 + a database it owns) as the source of truth, and AgentCore Memory is treated as a fast derived cache. If the bank leaves, it re-derives memory on the new platform from its own raw events. Exit cost after fix: medium and bounded (re-derivation is mechanical), versus high-and-unbounded before.

Identity/infra. Wired to AWS IAM and KMS — inherently AWS-specific. The bank cannot make IAM portable, but it isolates auth behind a thin internal auth module so a future migration changes one module, not every tool call. Exit cost: medium, and unavoidable for any cloud (this is the price of cloud-native security).

Surface scored for both bank agents (1 = welded shut, 5 = fully portable)

                      support agent    internal-ops agent   exit cost after hatch
Orchestration              5                  5              low (days)
Prompt/tool format         4                  4              low-med (re-point MCP tools)
Data (gravity)             4*                 4*             medium, BOUNDED (dual-write)
Identity/infra             2                  2              medium (cloud-specific, isolated)
                                                             ──────────────────────────────
                                          *4 only AFTER dual-write; was 2 before

Verdict: the bank stays on AgentCore, but now with three hatches it didn't have before — owned prompts, dual-written data, isolated auth. The lock-in review didn't change the platform; it changed the architecture on top of the platform so the platform is no longer a trap.

Teacher voice. Read what the dual-write did. It cost the bank a little extra write traffic and one S3 bucket. In return it moved the data surface from "score 2, exit cost unbounded" to "score 4, exit cost bounded." The single highest-leverage lock-in move is almost always owning the source of truth for your state, because data is the surface that grows forever and exports worst. Build that hatch first.


4) Why dual-write, not "demand full export" — under this workload

The plausible alternative to dual-writing your own state is simpler: just require full raw export as a contract term and pull the data when you leave. Why carry the dual-write cost continuously instead of a one-time export at exit?

Because the contract term and the export reality diverge exactly when you need them most. A "full export" clause is only as good as the vendor's export API on the day you invoke it — and that API is built to satisfy the letter of the clause (here is an export) not the spirit (here is everything, in a form you can re-ingest). The bank's residency event proved this: the export returned processed summaries, technically a "full export of memory," missing the raw events that made the summaries reproducible. By the time you discover the gap, the data is already trapped and you have no leverage.

Dual-write pays a small continuous cost to remove that risk entirely: the source of truth never lived behind the boundary, so there is nothing to export. For the bank — regulated, with residency rules that arrive late and non-negotiably — paying a steady small premium to guarantee the data is always yours beats betting a quarter-long migration on a vendor's export API behaving generously under pressure. If the workload were a throwaway internal prototype with no residency risk, the contract-term-and-hope approach would be fine; the dual-write earns its cost only when the data is regulated, valuable, or large.


5) The property that flips exit cost: data gravity over time

The single dimension that most changes exit cost is how much state has accumulated behind the boundary — data gravity as a function of time. Orchestration exit cost is roughly flat (the graph is the same size whether you've run it once or a million times). Identity exit cost is roughly flat. But data exit cost climbs with every trajectory, and it climbs fastest exactly where export is worst.

Exit cost over time, by surface:

Orchestration  ────────────────────────────────  flat (logic doesn't grow)
Identity/infra ────────────────────────────────  flat (cloud-specific, fixed)
Prompt/tool    ──────╮ then flat after tools wired
Data gravity   ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱  climbs forever, exports worst
                t0                            t+18mo

This is why "we can always migrate later" inverts over time. At t0 every surface is cheap. By t+18mo three surfaces are still cheap and one — data — has quietly become the entire cost. The bank's residency migration took an afternoon for orchestration and weeks for memory, and the ratio only worsens with scale. Bias every lock-in review toward the data surface, because it is the one that betrays the "migrate later" assumption.


6) The lock-in failure walked through deeply — a vendor pivot, end to end

Replay a different forcing event than residency, because the shape recurs: a vendor pivot.

Month 0   Team ships on a capable SaaS-adjacent platform. Orchestration in the
          vendor's visual builder (no portable export). Tools in vendor format.
          Conversation + memory in vendor store. Identity = vendor IAM.
          Exit cost ~0. Nobody scores it. "We can migrate later."
Month 6   Heavy usage. 4M conversations logged in vendor store. Memory rich.
Month 11  Vendor announces it's pivoting the product line and sunsetting the
          tier the team is on in 9 months. Forced migration, not optional.
Month 12  Migration scoped:
            - orchestration: visual flow has no code export → rebuild from scratch
              in a portable framework. ~4 engineer-months.
            - tools: re-write every tool definition in MCP. ~1.5 engineer-months.
            - data: export API returns conversations but not the structured
              memory; 4M conversations export as flat JSON, re-ingestion lossy.
              ~3 engineer-months + behavior re-validation.
            - identity: re-wire every auth path to new platform. ~2 eng-months.
          Total: ~10 engineer-months, on a 9-month forced clock. Slips.
Month 20  Migration limps in late. Several months of memory effectively lost.

Nothing here is a vendor betrayal — products get sunset, prices change, companies get acquired. The failure is that the team scored zero of the four surfaces and built zero hatches, so a foreseeable external event (vendors pivot; it is not rare) turned into a ten-engineer-month fire drill on a clock the team didn't control. Not a vendor-reliability problem; a hatch-absence problem. The fix is upstream and cheap: build the four hatches when exit cost is still zero.

Mini-FAQ. "Isn't building escape hatches just premature optimization we'll never need?" No — because two of the four hatches (owned prompts, MCP tools) cost almost nothing and the third (owning your data source of truth) pays off the first time you need any analytics, eval, or audit on your own trajectories, migration or not. You build the hatches for daily value; portability is the bonus you get for free. The only one you build purely for exit insurance is the auth abstraction.


7) Cost movement — what the escape hatch costs and what it buys

Surface Hatch Day-1 cost of hatch Exit cost WITHOUT hatch Exit cost WITH hatch
Orchestration own logic in portable framework small (you'd write logic anyway) rebuild flow: 3–5 eng-months days
Prompt/tool format MCP tools + owned prompts near zero re-write all tools: 1–2 eng-months days (re-point)
Data (gravity) dual-write your own source of truth small continuous write cost + 1 store rebuild/re-derive: 3+ eng-months, lossy medium, bounded
Identity/infra thin auth abstraction small (one module) re-wire every call: 2+ eng-months medium, one module

Read it as a movement. Each hatch trades a small, mostly one-time day-one cost for a large, growing exit cost avoided. The data row is the standout: a steady trickle of extra writes against one owned store converts an unbounded, lossy, multi-quarter migration into a bounded mechanical one. The subsystem that absorbs the hatch cost is your own infrastructure (one extra store, one auth module, a little discipline on prompt and tool formats) — and what it buys is negotiating leverage and survival of a forced move. For the bank, the dual-write's continuous cost is trivial against the certainty that residency rules will change again in a regulated domain.


8) Operational signals — is the escape hatch still open?

Healthy. Your orchestration runs unchanged on at least one alternative runtime in a periodic portability test. Your trajectories and memory exist in a store you own, independent of the platform. Tool schemas are MCP and re-point with config. Nobody can name a surface that would take more than "weeks" to move.

First metric that degrades. The fraction of agent state that exists only behind the vendor boundary. The moment long-term memory, structured logs, or tuned prompts live solely in the vendor's store with no owned copy, the data hatch is closing and exit cost is climbing silently. Track "percent of state with an owned source of truth"; healthy is ~100%, and any drop is the hatch leaking.

Misleading metric people watch. "Do we use a portable framework?" — a yes here feels like portability but covers only the orchestration surface, exactly the bank's first mistake. Orchestration portability with welded data is still a trap.

The signal an experienced lead checks first. A live exit drill: "if we had to leave this platform in 90 days, write the migration plan and the engineer-month estimate per surface, today." A lead runs this at selection and re-runs it annually. The estimate is the exit cost; if it has grown since last year, find which surface grew and re-open its hatch.


9) Boundary of applicability — when lock-in barely matters

Strong fit for aggressive escape-hatch design. Regulated domains, long-lived agents, valuable accumulated state, and any platform whose pricing or roadmap you don't control. The bank — regulated, residency-exposed, multi-year — is the textbook case for owning all four surfaces.

Where escape-hatch obsession becomes pathological. A throwaway prototype, a six-week experiment, or a low-value internal tool where the agent will be deleted before any forced event arrives. Building a dual-write store and an auth abstraction for a prototype that dies in a month is the inverse waste — paying continuous portability cost for state you'll never need to move. Match hatch investment to the agent's expected lifespan and the value of its accumulated state.

Scale / regime that invalidates the intuition. "We own the orchestration, so we're portable" holds at zero accumulated data and inverts at scale — at millions of trajectories, the data surface dominates exit cost and the orchestration surface is a rounding error. The intuition that breaks most often is treating lock-in as a single number ("how locked in are we?") when it is four numbers that grow at different rates; the data number is the one that ambushes you.


10) The wrong model to drop: "lock-in is a property of the platform"

The seductive wrong idea is that lock-in is something the platform has — that "open" platforms are safe and "proprietary" ones are traps, and your job is to pick an open one. Reality: lock-in is a property of your architecture on the platform. The bank ran an open framework on a managed runtime and still trapped its data, because portability is decided surface by surface by your choices — where you keep the source of truth, what format your tools use, how you wire auth — not by a label on the vendor.

Replace "pick an open platform" with "build open architecture on whatever platform you pick." A proprietary platform with your data dual-written to your own store and your orchestration owned is less locked-in than an open platform where you let all four surfaces fuse to the vendor. The hatch is yours to build or skip, regardless of the vendor's openness.


11) Other lock-in failure shapes

  • Orchestration-only portability — owning the framework and assuming that means portable, while data, tools, and identity weld silently (the bank's first error).
  • Export-clause faith — trusting a "full export" contract term that the vendor's API satisfies in letter, not spirit, discovered only at exit.
  • Memory that won't round-trip — exporting processed summaries that can't be re-ingested to reproduce the same agent behavior, forcing re-validation.
  • Prompt drift into the vendor's quirks — tuning prompts so tightly to one model's behavior that they must be re-tuned wholesale to move, an invisible prompt-format lock-in.
  • Tool sprawl in vendor format — dozens of tools defined in a proprietary connector schema, each a rewrite to move, instead of MCP servers that re-point.
  • Identity scattered through the code — vendor IAM/KMS calls sprinkled across the agent instead of behind one auth module, turning a migration into a thousand edits.
  • Gravity-blind cost reviews — re-estimating exit cost without noticing the data surface grew 10× since last year while the others stayed flat.
  • Hatch-for-a-prototype — over-investing in portability for a short-lived agent that will be deleted before any forced event.

12) Pattern transfer — where exit-cost thinking recurs

  • Build-vs-buy (file 01) — exit cost is the lever file 01 named and this file decomposes; the wall a team hits is exactly a high exit cost discovered at the worst moment.
  • Data gravity on the map (file 02) — the "which cloud does your data live in" gravity that decided AgentCore vs Vertex is the same force as data lock-in; gravity that picks a platform also traps you in it.
  • Model vendor strategy (module 12) — choosing a model provider has the identical four-surface shape one layer down: prompt format, fine-tune artifacts, embeddings, and API identity all fuse to a model vendor, and dual-writing eval sets is the same hatch.
  • Database migration in classic infra — the "migrate later is cheap" lie is the same shape as a schema/storage migration that's trivial at ten rows and a quarter at ten billion; data gravity is the shared constraint.
  • Tool contracts as MCP (module 19) — keeping tools as MCP servers is both a contract pattern and a portability hatch; the same choice serves two pressures.

13) The lock-in audit — five yes/no questions

  1. Have you scored all four surfaces (orchestration, prompt/tool format, data, identity) separately, not just orchestration?
  2. Do your trajectories and long-term memory have an owned source of truth, so the platform's store is a derived cache you could rebuild from?
  3. Are your tools defined in a portable format (MCP) and your prompts owned and versioned, so re-pointing is config not rewrite?
  4. Have you written today's exit-cost estimate in engineer-months per surface, and is it smaller than the cost of staying on bad terms?
  5. Is the data surface — the one that grows forever and exports worst — the one you've hardened most?

If question 2 is "no," your portability is the bank's first illusion: a movable graph reading from an immovable store.


Where this appears in production

Escape hatches that keep leaving cheap: - A regulated bank dual-writing trajectories to its own S3 + database — treats AgentCore Memory as a derived cache so a residency-forced move re-derives memory instead of losing it. - MCP tool servers — tools defined as Model Context Protocol servers re-point at any MCP-aware runtime (AgentCore Gateway, Foundry tool catalog) as configuration, not a rewrite. - LangGraph orchestration on a managed runtime — owned graph runs on AgentCore, Agent Engine, Foundry, or self-hosted, keeping the orchestration surface portable (the bank's one hatch from file 02). - A thin internal auth module — isolates AWS IAM/KMS calls so a future cloud move edits one module instead of every tool call. - Owned, versioned prompt templates — prompts in the repo, model-agnostic where possible, so a model swap is a re-test not a rewrite.

Traps that lock-in scoring catches: - A team on a closed visual flow builder — orchestration has no code export, so a forced migration means rebuilding the agent's logic from scratch in 3–5 engineer-months. - Partial-export memory stores — return processed summaries but not the raw events, so re-ingestion is lossy and behavior must be re-validated (the bank's residency wall). - Vendor-format tool sprawl — dozens of connectors in a proprietary schema, each a rewrite, versus MCP servers that re-point. - Scattered vendor identity calls — IAM/KMS sprinkled through the agent, turning one migration into a thousand edits.

Where data gravity decided the whole story: - A fintech that "owned its orchestration" — found at exit that 4M logged conversations and structured memory couldn't round-trip, the heaviest surface it never scored. - A SaaS-platform sunset — a vendor pivot forced a 10-engineer-month migration on a 9-month clock because no surface had a hatch. - A healthcare provider — kept its own copy of every trajectory from day one, so a BAA-driven platform change was a re-point, not a rebuild.


Pause and recall

  1. Name the four lock-in surfaces and the escape hatch for each.
  2. Why does "we can always migrate later" invert over time? Which surface drives the inversion?
  3. What did the bank's residency event reveal about owning orchestration but not data?
  4. Why dual-write your own source of truth instead of trusting a "full export" contract clause?
  5. Why is lock-in a property of your architecture, not of the platform?
  6. Which surface has roughly flat exit cost over time, and which climbs forever?
  7. What is the live exit drill, and what does its engineer-month estimate represent?
  8. When is aggressive escape-hatch design a waste?

Interview Q&A

Q1. Your team picked a capable platform and says "we can always migrate later." What's wrong with that? A. Migration is always possible; the only variable is its price, and that price grows monotonically with usage. Exit cost is near zero on day one and high by month eighteen, driven mostly by accumulated state behind the boundary. "Migrate later" assumes a constant cost that actually climbs, so the cheapest moment to make leaving cheap is now — by building escape hatches across the four surfaces before any data piles up. Common wrong answer to avoid: "Migration is a one-time project we'll scope when needed." It's cheapest before you have state to move and most expensive exactly when you're forced to move.

Q2. The bank owns its LangGraph orchestration. Is it portable? A. Only on one of four surfaces. Orchestration is portable, but prompt/tool format, data, and identity are separate surfaces that lock in independently. The residency event proved it: the graph moved in an afternoon while 14 months of long-term memory wouldn't round-trip because the export returned summaries, not the raw events. Owning the orchestration and assuming portability is the most common lock-in mistake. Common wrong answer to avoid: "Yes — they own the framework, so they're portable." Orchestration portability with welded data is still a trap; score all four surfaces.

Q3. Why dual-write your own state instead of requiring a full-export clause? A. A full-export clause is only as good as the vendor's export API on the day you invoke it, and that API satisfies the letter of the clause, not the spirit — it returns an export, often processed summaries that won't re-ingest to reproduce behavior. You discover the gap at exit, with no leverage. Dual-writing pays a small continuous cost so the source of truth never lived behind the boundary; there's nothing to export because you already have it. For regulated, valuable, or large state, that certainty beats betting a quarter on a vendor's export behaving generously. Common wrong answer to avoid: "Just negotiate a strong export clause." Clauses are satisfied in letter; the data still won't round-trip when you need it.

Q4. Which lock-in surface should you harden first, and why? A. Data. Orchestration and identity exit costs are roughly flat over time, but data grows with every trajectory and exports worst, so it's the surface that betrays the "migrate later" assumption. Owning the source of truth for trajectories and memory — dual-writing to a store you control — converts an unbounded, lossy migration into a bounded mechanical one, and it pays off immediately for analytics, eval, and audit regardless of any migration. Common wrong answer to avoid: "Harden orchestration first — it's the agent's brain." Orchestration is cheap to move; data is the surface that grows forever and traps you.

Q5. A teammate says lock-in is bad because the platform is proprietary. Correct or reframe. A. Reframe. Lock-in is a property of your architecture on the platform, not of the platform's openness. A proprietary platform with your data dual-written to your own store, MCP tools, owned prompts, and isolated auth is less locked-in than an open framework where you let all four surfaces fuse to the vendor. Pick a platform for fit; build open architecture on top of whatever you pick. Common wrong answer to avoid: "So we should only use open-source platforms." Openness of the vendor doesn't decide portability; your surface-by-surface choices do.

Q6. The rubric (file 03) cleared AgentCore and gravity picked it. Where does lock-in change that decision? A. It doesn't change the platform choice; it changes the architecture on top of it. The rubric deliberately excluded exit cost, so lock-in scoring is the separate pass that, for AgentCore, keeps the bank on the platform but adds three hatches it lacked — owned prompts, dual-written data, isolated auth. Lock-in would only override the rubric if a cleared platform had a fatally un-exportable data surface and no dual-write was possible; then a within-noise rubric tie should break toward the more portable candidate. Common wrong answer to avoid: "Lock-in means we should have picked the open framework." The framework was excluded on team capacity in file 03; the fix is hatches on the chosen platform, not a different platform.

Q7. Is a stuck migration a build-vs-buy bug (file 01), a landscape/gravity bug (file 02), or a lock-in bug (this file)? A. Diagnose by what's stuck. If new requirements keep hitting "the vendor can't do that," it's a build-vs-buy boundary bug (file 01). If two clouds are fighting because data lives in the wrong one, it's a gravity bug (file 02). If you want to leave and can't because state won't move, it's a lock-in bug — and within lock-in, name which of the four surfaces is welded. The bank's residency stall is squarely a data-surface lock-in bug. Common wrong answer to avoid: "It's all the same vendor problem." The three are distinct: boundary blocks new needs, gravity fights clouds, lock-in blocks exit.


Design/debug exercise (10 min)

Step 1 — Model it. Score the bank's support agent across the four surfaces:

Surface            Current state                          Hatch to build         Exit cost after
Orchestration      LangGraph on AgentCore (owned)         already open           low (days)
Prompt/tool        MCP tools via Gateway; prompts tuned   own + version prompts  low-med
                   ad-hoc to Bedrock model
Data               AgentCore Memory; export = summaries   dual-write to own S3   medium, BOUNDED
                   only (unbounded exit cost!)            + DB as source of truth
Identity           IAM + KMS, scattered calls             thin auth module       medium (one module)
Live exit drill (90 days?): YES after hatches — re-derive memory, re-point tools, swap auth module.

Step 2 — Your turn. Score the bank's internal-ops agent across the same four surfaces. It has the heavier residency exposure (KYC + compliance memos must stay in-region), so weight the data surface hardest and decide whether dual-write alone suffices or you also need region-pinned owned storage. Then take one agent from your own backlog and run the four-surface scoring plus a 90-day exit drill, writing an engineer-month estimate per surface.

Step 3 — Reproduce from memory. Redraw the four-surface table (surface / trap / escape hatch) cold, then connect it to file 02: the same data gravity that chose AgentCore by where the bank's data lives is the force that would trap the bank if it let memory accumulate behind the boundary — gravity selects and gravity imprisons. If you can name the four surfaces, the hatch for each, and which one grows forever, you own this chapter.


Operational memory

This chapter scored the dimension the rubric refused to score: exit cost. The important idea is that lock-in is not one number and not a property of the platform — it is four surfaces (orchestration, prompt/tool format, data, identity) on your architecture, each with its own exit cost and its own escape hatch, and the data surface grows forever while the others stay roughly flat. The bank owned its orchestration and still trapped its memory, which is why "we own the framework, so we're portable" is the seductive lie this file dismantles.

You learned to keep the escape hatch open: own the orchestration in a portable framework, define tools as MCP and own your prompts, dual-write trajectories and memory to a store you control so the platform's store is a derived cache, and isolate identity behind one auth module. That solves the opening failure because when the bank's residency rule changed, the surfaces with hatches moved in days and the surface without one (memory) cost weeks — so the fix is to build the missing hatches while exit cost is still near zero.

Carry this diagnostic forward: run a live 90-day exit drill at selection and annually, and track "percent of state with an owned source of truth." If that percent drops, the data hatch is leaking and exit cost is climbing silently. When you want to leave and can't, name which of the four surfaces is welded before blaming the vendor.

Remember:

  • Score exit cost, not just usage cost; it's the cost of leaving, summed across four surfaces.
  • The four surfaces — orchestration, prompt/tool format, data, identity — lock in independently; score each separately.
  • Data gravity is the surface that grows forever and exports worst; harden it first by owning the source of truth.
  • Dual-write beats an export clause: a clause is satisfied in letter, your own data is yours in fact.
  • Lock-in is a property of your architecture, not the platform's openness; build open architecture on whatever you pick.
  • Build the hatches when exit cost is near zero; they're nearly impossible to retrofit once state piles up.

Bridge. We kept leaving cheap by owning the four surfaces — but notice the new pressure that creates. Owning your data source of truth, dual-writing trajectories, and isolating identity all add infrastructure and cost, and the platform itself bills per runtime-hour, per memory record, per conversation. We scored what each platform can do and how hard it is to leave; we have not yet scored what it costs to run — and the three families' cost curves diverge violently at 10× and 100× volume, sometimes inverting the day-one cheapest choice. The next file builds the cost-and-scaling model. → 05-cost-and-scaling-model.md