Skip to content

05. Cost and scaling model — the curve that re-opens the decision at 100×

~19 min read. The pilot ran 10,000 conversations a month and the bill was a rounding error. Then the agent caught on, traffic hit 1,000,000 conversations a month, and the finance review pulled the platform decision back open. The cheapest choice at pilot is rarely the cheapest choice at scale.

Built on 04-lock-in-and-portability.md. You can now keep the escape hatch open across four surfaces. This file uses that portability to do something the rubric and the lock-in pass could not: read the 100× curve. The pressure is cost — and the three families' curves diverge so sharply at scale that the cheapest day-one choice can become the most expensive one at volume, re-opening fit-before-lock-in as a live decision.


What every earlier file priced at pilot scale

The build-vs-buy file gave rough per-task numbers (~$2/conversation for SaaS). The rubric excluded cost entirely. The lock-in file added a small continuous cost for the dual-write hatch. All of it was implicitly priced at pilot volume — small enough that the numbers were footnotes. That is exactly when cost mistakes are made, because at pilot scale every family is cheap and the differences are invisible.

This file makes cost the subject. You will learn to decompose an agent's bill into its three layers, project each layer to 10× and 100× volume, and read where the three families' curves cross — because the crossing point is where the platform decision re-opens. The reason the bank's earlier choice (AgentCore) survives scrutiny or doesn't is a number you compute, not a vibe you carry from the pilot.

What this file solves

A team ships on the platform that was cheapest and fastest at pilot, traffic grows two orders of magnitude, and the bill grows faster than traffic because a fee the team ignored at pilot scale now dominates. Finance pulls the decision back open mid-flight, when migration is expensive and the team is already locked in on three surfaces. This file gives you a three-layer cost model, the curve shape for each family, and a worked projection of the bank's traffic from 10K to 1M conversations a month. The first move is to stop quoting a single per-conversation number and start projecting which layer dominates at which volume, because the layer that's negligible at 10K can be the entire bill at 1M.

Why a single per-task cost number lies

The naive cost model is one number: cost per conversation, measured at pilot, multiplied by projected volume. It lies because an agent's cost is not one number — it is three layers that scale differently. Model tokens scale roughly linearly with traffic. Platform/runtime fees scale with traffic but with their own per-unit shape (per runtime-hour, per memory record, per API call) that can be sub-linear or super-linear. Your own infra (the dual-write store, the trace pipeline, the owned auth) scales mostly with data volume, not conversation count. Multiply a pilot per-task number by 100 and you assume all three layers scale linearly and proportionally — and the platform-fee layer almost never does. The single number hides which layer will dominate, which is the only thing that matters at scale.

When the cheapest pilot becomes the most expensive production

The bank's finance team models the support agent at three volumes. At pilot (10K conversations/month) all three options are within noise of each other — tens to low hundreds of dollars in platform fees, dwarfed by model tokens either way. The team almost picks by pilot cost. Then they project to 1M/month:

Support agent, ~5 turns/conversation. Model tokens roughly equal across options
(same models, same traffic), so compare the NON-token layers that diverge:

                          10K conv/mo    100K conv/mo    1M conv/mo
SaaS vertical             ~$20K          ~$200K          ~$2,000,000
(Agentforce $2/conv,                                    ← linear, fixed, brutal
 customer-facing model)                                   at scale

Hyperscaler runtime       ~$50–200       ~$500–2K        ~$5K–20K
(AgentCore: runtime                                      ← sub-linear; runtime-hours
 $0.0895/vCPU-hr +                                         + memory records, not /conv
 $0.00945/GB-hr +
 memory + gateway calls)

Framework self-host       ~$300         ~$1.5K          ~$8K–15K
(your EC2/GKE + your                                    ← your raw infra, lowest at
 trace + your memory;                                     scale but you run it all
 fixed ops floor)

At 10K the SaaS option ($20K) looks fine and ships in days. At 1M the same per-conversation price produces a $2M/year platform bill — while the runtime, billed by runtime-hours and memory records rather than per conversation, sits two orders of magnitude lower. So the real cost question is not "what does a conversation cost"; it is "which fee scales with my traffic, and does it scale linearly, sub-linearly, or with a cliff." Agentforce's $2/conversation is a fixed linear fee: double the traffic, double the platform bill, forever. The runtime's fee is consumption of compute and memory, which amortizes as utilization rises. That structural difference, invisible at 10K, is a $2M/year fact at 1M.

So how do you see the crossing point before traffic forces it on you? You build the three-layer model once, project each layer to 10× and 100× separately, and find the volume where the family curves cross.

The cost-curve rule. Project cost as three layers that scale differently — model tokens (linear), platform fees (family-specific shape), your own infra (data-driven) — not as one per-task number times volume. The platform-fee layer's shape (per-conversation linear vs per-resource sub-linear) decides the family at scale, and its crossing point is where the decision re-opens.

Why this rule exists. The primitive is that different fee structures integrate differently over volume: a fixed per-event price integrates to a straight line, while a consumption price on a shared resource integrates to a flattening curve as utilization rises. The constraint is that pilot volume is below every crossing point, so all curves look identical there. The naive single-number model multiplies the wrong layer by the wrong factor, hiding the crossing point exactly where the decision is cheapest to make.


1) The three cost layers — what scales with what

Every agentic platform's bill is the sum of three layers, and the whole skill is knowing which layer dominates at your volume.

Layer 1 — Model tokens. The inference cost: input + output tokens per turn × turns per conversation × conversations. Scales roughly linearly with traffic and is largely the same across families if you use the same models — a conversation costs about the same in tokens whether it runs on a runtime, a framework, or a SaaS. This layer usually dominates the total bill at every scale, which is why it's tempting to ignore the other two — but it doesn't differentiate the families, so it's not where the decision lives.

Layer 2 — Platform / runtime fees. What the platform charges to run the agent, on top of tokens. This is where the families diverge, and the divergence is in the shape: - SaaS vertical: typically per-conversation or per-action — a fixed price per unit of business work (Agentforce: $2/conversation, or Flex Credits at $500/100K credits ≈ $0.10/action). Scales linearly and never amortizes. - Hyperscaler runtime: consumption of compute and managed services — AgentCore bills $0.0895/vCPU-hour, $0.00945/GB-hour, plus per-memory-record and per-1K-gateway-call; Vertex Agent Engine bills $0.0864/vCPU-hour and $0.0090/GB-hour with a free tier. Scales with resource-seconds, which amortizes as utilization rises. - Framework self-host: your raw cloud compute (EC2/GKE) plus what you build — a fixed ops floor plus near-marginal cost per conversation. Lowest at scale, highest in engineer-months.

Layer 3 — Your own infra. The escape-hatch and operational cost from file 04: the dual-write store, the trace pipeline, owned memory, the auth module, on-call. Scales mostly with data volume and ops surface, not conversation count. Modest for a runtime (managed services carry most of it), large for self-host (you run everything).

                  scales with        family divergence    dominates total at
LAYER 1 tokens    traffic (linear)    none (same models)   every scale (but doesn't decide)
LAYER 2 platform  traffic, FAMILY     EVERYTHING           — this layer decides the family —
        fees      SHAPE
LAYER 3 own infra data + ops surface  some (self-host big) low for runtime, high for self-host

   total bill = tokens (same everywhere) + platform fee (diverges) + your infra
   the DECISION lives in layer 2's shape, not in the total

The diagram is the file. Layer 1 is the biggest number and decides nothing. Layer 2 is the smaller number that decides everything, because its shape is what diverges across families at scale.


2) Picture first — the curves crossing

flowchart TD
    A[Pilot: 10K conv/mo] --> B[All three families within noise<br/>platform fees tiny vs tokens]
    B --> C{Project to 100K then 1M}
    C --> D[SaaS: per-conv fee linear<br/>→ straight line up, brutal at 1M]
    C --> E[Runtime: per-resource fee<br/>→ sub-linear, amortizes with utilization]
    C --> F[Self-host: fixed ops floor + marginal<br/>→ lowest slope, but you run it all]
    D --> G{Find crossing points}
    E --> G
    F --> G
    G --> H[SaaS cheapest below ~low-volume crossover;<br/>runtime cheapest in the middle band;<br/>self-host cheapest only above high-volume<br/>crossover where ops floor amortizes]

The decisive node is G — the crossing points. Below the first crossover, SaaS wins on total cost-of-ownership because its near-zero build and ops outweigh its per-conversation fee. In the middle band, the runtime wins: managed ops without the linear per-conversation tax. Only above a high crossover, where volume is large enough to amortize a self-hosted ops floor, does framework self-host win on raw cost — and even then only if you can staff the ops. The bank must find which band 1M conversations/month lands in.


3) The bank's traffic projected — one running example

The bank's support agent runs ~5 turns per conversation. The mix matters: file 01 established that 70% of volume is simple FAQ, and the bank routes FAQ to a cheaper model (the model-flexibility axis it weighted at 4 in file 03). Project the support agent from pilot to scale.

Attempt A — the tempting move: project the pilot number forward

The bank's first model takes the pilot's blended cost — say $0.40/conversation all-in at 10K/month — and multiplies: 1M × $0.40 = $400K/month, $4.8M/year. Alarming, and wrong in both directions, because it (a) assumes the platform-fee layer scales linearly when the runtime's doesn't, and (b) ignores that the FAQ-routing cuts the token layer. A single blended number projected linearly is the error this file exists to prevent.

Attempt B — the right move: project three layers separately

SUPPORT AGENT at 1,000,000 conv/mo, 5 turns each = 5M turns/mo
70% FAQ (3.5M turns) routed to cheap model, 30% (1.5M turns) to strong model

LAYER 1 — tokens (the big number, but same across families):
  FAQ turns:    3.5M × ~$0.002/turn (cheap model)  ≈ $7,000/mo
  Complex turns:1.5M × ~$0.02/turn  (strong model)  ≈ $30,000/mo
  Token layer total                                 ≈ $37,000/mo
  [Routing FAQ to the cheap model saved ~$23K/mo vs all-strong — file 03's
   model-flexibility axis paying off in layer 1.]

LAYER 2 — platform fee (the number that DECIDES, projected by family shape):
  SaaS @ $2/conv:                    1M × $2          = $2,000,000/mo  ← linear cliff
  AgentCore @ resource consumption:  est. compute-hrs
    + memory records + gateway calls                 ≈ $8,000–15,000/mo ← sub-linear
  Self-host raw infra:               your GKE + ops   ≈ $6,000–12,000/mo ← lowest, you run it

LAYER 3 — own infra (dual-write store, traces, on-call):
  Runtime:  managed carries most                      ≈ $3,000/mo
  Self-host: you run memory, traces, autoscale, on-call ≈ $20,000/mo + 2-3 engineers

Now the decision is visible. The token layer ($37K) is the same whichever family runs the agent, so it doesn't decide. The platform layer is where the families split by a factor of ~150×: SaaS at \(2M/month against the runtime's ~\)10K/month. Layer 3 keeps self-host from winning despite its low platform fee, because the bank's six-person team would need 2–3 more engineers to run memory, tracing, and on-call at this volume.

TOTAL monthly bill at 1M conv/mo (tokens + platform + own infra):
  SaaS vertical:        $37K + $2,000K + ~$1K  ≈ $2,038,000/mo   (platform fee = the whole story)
  Hyperscaler runtime:  $37K + ~$11K  + $3K    ≈ $51,000/mo      ← the bank's pick survives
  Framework self-host:  $37K + ~$9K   + $20K   ≈ $66,000/mo + 2-3 FTE

Verdict for the support agent: the runtime (AgentCore) wins decisively at 1M/month — ~\(51K/month against SaaS's ~\)2M and self-host's ~$66K-plus-headcount. The per-conversation SaaS fee that was a $20K footnote at pilot is a $2M/year wall at scale. And the runtime beats self-host not on platform fee (close) but on layer 3: managed ops save the bank three engineers it doesn't have. The pilot-era choice holds — but only because the bank computed the curve instead of trusting the pilot number.

Teacher voice. Look at where the decision actually lived. The token layer was the biggest line (\(37K) and it decided nothing — same on every family. The platform layer was smaller in absolute terms for the runtime (\)11K) but it spread the families across two orders of magnitude. The number you stare at is rarely the number that decides; find the layer that diverges, not the layer that's largest.


4) Why the runtime and not self-host — under this workload

At 1M conversations/month the framework-self-host platform fee (~\(9K) is *lower* than the runtime's (~\)11K). On raw infra cost, self-host wins. So why does the bank stay on the runtime?

Because layer 3 dominates the difference. At this volume the bank would self-host the memory store, the trace pipeline, the autoscaler, the region-pinning, the secrets and identity wiring, and 24×7 on-call for a regulated agent — which the projection prices at ~\(20K/month *plus* 2–3 additional engineers the six-person team does not have. The runtime folds all of that into its managed fee. The ~\)2K/month the runtime charges above self-host's raw infra buys back three headcount and the on-call burden, which for a small team facing a high compliance bar is the cheapest money the bank spends.

The crossover where self-host actually wins is higher and steeper than its platform-fee line suggests, because the win requires also amortizing the ops floor across enough volume and engineers. For a hyperscaler-class product agent at tens of millions of conversations with a dedicated platform team, self-host's marginal-cost advantage finally swamps the ops floor and it wins — that is the regime file 01 flagged where the cost math flips. The bank is not in that regime; its volume is high enough to kill SaaS but not high enough to justify owning the ops.


5) The property that flips the curve: fee shape, not fee size

The single dimension that most changes the cost decision at scale is the shape of the platform fee, not its size at pilot. A small per-conversation fee and a small per-resource fee look identical at 10K and diverge violently by 1M, because one integrates to a line and the other to a flattening curve.

Platform fee vs volume:

SaaS per-conversation  ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱  straight line, no amortization
Runtime per-resource   ╱╱╱╱╱──────────────────────  rises then flattens as utilization climbs
Self-host fixed+marginal ▁▁▁▁▁╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱  high floor, low slope after

  pilot                    crossover band              scale
  (all equal)              (runtime wins)              (self-host can win if ops amortize)

This is the counterintuitive result: the family with the highest day-one platform fee (self-host's ops floor) can be the cheapest at extreme scale, while the family with the lowest day-one fee (SaaS, near-zero build) can be the most expensive at scale. Pilot cost ranking and production cost ranking can fully invert. The bank's $20K SaaS pilot would have been the most expensive path at volume; reading the fee shape at selection is what kept the decision honest.


6) The cost failure walked through deeply — the per-conversation cliff

Replay the failure as a timeline, because this shape — a linear per-unit fee invisible at pilot — recurs across every SaaS-priced agent.

Month 0   Ship support agent on SaaS @ $2/conversation. Pilot 10K/mo = $20K.
          Finance: "rounding error." Nobody projects the fee shape.
Month 4   Agent is good; marketing routes more traffic to it. 100K/mo = $200K.
          First eyebrow raised. Still "acceptable, it deflects human agents."
Month 8   Viral moment + WhatsApp rollout. 600K/mo = $1.2M/mo platform fee alone.
          Finance pulls the platform decision back open mid-flight.
Month 9   Migration scoped — but the team is locked in on three surfaces
          (file 04): SaaS orchestration not portable, data partial-export,
          identity welded. Exit cost: ~8 engineer-months. On a budget clock.
Month 10  Two bad options: pay the linear fee that grows with every success,
          or eat an 8-month migration. Success is now a cost liability.

The failure is structural, not a billing surprise — the price per conversation never changed. What changed is that nobody projected the shape of a linear fee against a traffic curve that, if the agent succeeds, only goes up. Not an over-spending problem; a fee-shape-blindness problem. Worse, success causes the cost: the better the agent deflects, the more it's used, the higher the linear bill. The fix is upstream — read the fee shape at selection and reject a linear per-conversation fee for any agent expected to grow into high volume.

Mini-FAQ. "Isn't SaaS per-conversation pricing fine because it includes everything — no ops, no build?" At low and stable volume, yes — that's the band where SaaS wins on total cost. The trap is a per-conversation fee on an agent expected to scale: the fee is fixed per unit and never amortizes, so the more successful the agent, the more the pricing punishes you. Bundled convenience is worth a premium at 10K/month and a $2M/year tax at 1M/month.*


7) Cost movement — what each family's fee shape costs at scale

Family Platform-fee shape Cheapest at Most expensive at What you pay for the win
SaaS vertical per-conversation / per-action, linear low + stable volume high volume (no amortization) convenience now, linear tax forever
Hyperscaler runtime per-resource (vCPU/GB-hr, memory, calls), sub-linear middle volume band extreme volume (managed markup) managed ops, amortizing fee
Framework self-host fixed ops floor + low marginal extreme volume low volume (floor not amortized) lowest marginal cost, full ops burden + headcount

Read it as a movement across volume. As volume rises, the cheapest family walks from SaaS (low) to runtime (middle) to self-host (extreme) — if you can staff the ops at the high end. The subsystem that absorbs each win differs: SaaS pays with a linear fee, the runtime pays with a managed markup over raw infra, self-host pays with engineer headcount and on-call. For the bank at 1M/month, the runtime's managed markup (~$2K/month over self-host's raw infra) is far cheaper than the three engineers self-host would demand — so the runtime wins the total cost of ownership, not the raw infra line.


8) Operational signals — is the cost model still honest?

Healthy. Cost is tracked as three separate layers, each projected to 10× and 100× of current volume, and the platform-fee layer's shape is known (linear vs per-resource). The bill grows slower than or proportional to traffic, and the layer that dominates the total (tokens) is being optimized separately from the layer that decides the family (platform fee).

First metric that degrades. Cost-per-conversation as volume rises. On a healthy runtime it falls (fixed costs amortize); on a linear SaaS fee it stays flat forever; if it rises, a fee with a cliff or a super-linear component (per-memory-record growth, retry storms multiplying token cost) is kicking in. Watch the slope of cost-per-conversation against volume, not the absolute number.

Misleading metric people watch. Total monthly bill. The total is dominated by the token layer, which is the same across families and doesn't decide anything — staring at it hides the platform-fee layer where the family decision actually lives. A team can obsess over a $37K token bill while a $2M platform fee shape goes unmodeled.

The signal an experienced lead checks first. "What does the platform-fee layer look like at 10× and 100× current volume, by fee shape?" — asked at selection, projected separately from tokens. A lead reads the shape of layer 2 first, because that's the line that crosses and re-opens the decision; the token layer is real money but it's a known proportional cost, not a decision input.


9) Boundary of applicability — when cost curves don't decide

Strong fit for curve modeling. Any agent expected to grow in volume, especially customer-facing agents whose success drives traffic up. The bank's support agent — 70% FAQ, growing, WhatsApp-bound — is the textbook case where the fee shape at 1M decides the platform.

Where curve modeling becomes pathological. A low-volume, stable internal agent that will never leave the pilot band — the bank's internal-ops agent at a few thousand calls/day. There, the curves never cross within the agent's lifetime, time-to-value and control dominate, and an elaborate 100× projection is wasted effort. Match the modeling depth to the expected volume growth; flat-volume agents don't need curve analysis.

Scale / regime that invalidates the intuition. "The cheapest pilot is the cheapest platform" holds only below the first crossover and inverts above it — SaaS cheapest at 10K becomes most expensive at 1M. The other inversion: "self-host is always cheapest at scale" holds on raw infra but breaks when ops headcount is priced in, because below the volume that amortizes the ops floor and the on-call team, the runtime's managed markup is cheaper than the engineers self-host demands.


10) The wrong model to drop: "cost per conversation × volume"

The seductive wrong idea is that an agent's cost is a per-conversation number you measure at pilot and multiply by projected volume. It fails on two counts: it treats three differently-scaling layers as one, and it projects the platform-fee layer linearly when only the SaaS shape is linear. A runtime's per-resource fee amortizes; a self-host's fixed floor amortizes; multiplying a blended pilot number by 100 assumes neither does.

Replace "per-conversation × volume" with "three layers, each projected by its own scaling shape, with the platform-fee layer's shape read explicitly." The bank's correct projection — token layer flat across families, platform layer diverging 150×, infra layer deciding runtime-vs-self-host — is what a single multiplied number could never have shown.


11) Other cost failure shapes at scale

  • Fee-shape blindness — modeling a linear per-conversation fee as if it amortizes, so a successful agent becomes a runaway bill.
  • Token-layer fixation — optimizing the biggest line (tokens) while ignoring the deciding line (platform fee shape).
  • Retry-storm amplification — a flaky tool or a loop bug multiplying turns per conversation, inflating the token layer super-linearly with no traffic increase.
  • Memory-record creep — a runtime's per-memory-record fee growing faster than conversations because long-term memory accumulates without pruning.
  • Hidden mandatory add-on — a SaaS fee that excludes a required data platform (Agentforce + Data Cloud), so real per-conversation cost is higher than the headline.
  • Pilot-band selection — choosing the family that's cheapest at 10K and shipping it into a 1M future without finding the crossover.
  • Ops-floor denial — assuming self-host's low marginal cost wins at scale while ignoring the 2–3 engineers and on-call it demands below the amortization volume.
  • Success-as-liability — a pricing shape where the agent costing more is the consequence of it working better, with no path to amortize.

12) Pattern transfer — where cost-curve thinking recurs

  • Lock-in (file 04) — the cost curve re-opens the decision exactly when exit cost is high; the two files combine into the synthesis decision (rubric clears, gravity picks, exit cost and cost curve decide whether to stay).
  • Build-vs-buy (file 01) — file 01 flagged that "self-host wins when volume is huge and the per-call platform fee dominates"; this file is that crossover, computed.
  • Model vendor strategy (module 12) — the token layer is a model-pricing decision nested inside the platform decision; routing FAQ to a cheap model is the same cost-shape thinking one layer down.
  • Capacity planning / autoscaling in classic infra — fixed-floor-plus-marginal vs per-request pricing is the identical reserved-vs-on-demand crossover; the fee-shape-decides-at-scale insight transfers wholesale.
  • Retry storms (distributed systems) — the retry-storm amplification of the token layer is the same failure geometry as a retry storm saturating a backend; bounded retries cap both.

13) The cost-model audit — five yes/no questions

  1. Have you split cost into three layers (tokens, platform fee, own infra) instead of one per-task number?
  2. Do you know the shape of the platform-fee layer for each candidate — linear per-conversation, per-resource sub-linear, or fixed-floor-plus-marginal?
  3. Have you projected each layer separately to 10× and 100× of current volume and found where the family curves cross?
  4. For self-host, have you priced the ops floor and the headcount, not just the raw infra?
  5. For any growing agent, have you checked whether its pricing makes success a cost liability (a linear per-unit fee that never amortizes)?

If question 2 is "no," you are projecting a blended pilot number forward, and the crossover will surprise you mid-flight.


Where this appears in production

Fee shapes that decide the family at scale: - Salesforce Agentforce at \(2/conversation** — a linear per-conversation fee that's a footnote at 10K/mo and ~\)2M/year at 1M/mo; Flex Credits ($500/100K ≈ $0.10/action) are the same linear shape per action. - AWS Bedrock AgentCore consumption billing — $0.0895/vCPU-hour, $0.00945/GB-hour, plus per-memory-record and per-1K-gateway-call; a per-resource shape that amortizes as utilization rises. - Google Vertex Agent Engine — $0.0864/vCPU-hour, $0.0090/GB-hour with a free tier (50 vCPU-hrs, 100 GB-hrs/mo); the same amortizing per-resource shape as AgentCore. - Self-hosted LangGraph on GKE/EC2** — raw compute plus a fixed ops floor; lowest marginal cost at extreme volume, but only after the floor and on-call team amortize.

Where the layers dominate differently: - A high-volume FAQ support agent — token layer dominated by FAQ traffic, cut sharply by routing 70% to a cheap model (the file-03 model-flexibility axis paying off in layer 1). - A long-running research agent — memory-record fees creep as long-term memory accumulates; pruning the per-record growth is the cost lever. - A buggy agent in a retry loop — token layer balloons super-linearly with no traffic increase; bounded retries cap the amplification. - A regulated bank at 1M conv/mo — runtime wins on total cost of ownership ($51K/mo) because managed ops save three engineers self-host would demand.

Where the curve re-opened the decision: - A startup on per-conversation SaaS — viral success drove traffic up and the linear fee turned the agent's success into a budget crisis, forcing a mid-flight migration. - An enterprise that modeled only the token layer — obsessed over the biggest line and missed the platform-fee shape that diverged 150× at scale. - A team that self-hosted "for cost" — found the ops headcount erased the marginal-cost win below its actual volume and moved to a managed runtime.


Pause and recall

  1. Name the three cost layers and what each scales with.
  2. Which layer dominates the total bill, and why does it not decide the family?
  3. Why does a single per-conversation number projected linearly lie at scale?
  4. What is the difference in shape between a SaaS per-conversation fee and a runtime per-resource fee?
  5. At 1M conv/mo, why did the runtime beat self-host despite self-host's lower platform fee?
  6. What is "success as a cost liability," and which fee shape causes it?
  7. When is elaborate cost-curve modeling a waste?
  8. What signal does an experienced lead check first, and why not the total bill?

Interview Q&A

Q1. A team wants to ship on a SaaS agent at $2/conversation because the pilot bill is trivial. The agent is expected to grow 100×. What do you say? A. The pilot bill is trivial because it's at pilot volume, below every crossover. A \(2/conversation fee is linear and never amortizes, so at 1M/month it's ~\)2M/year while a per-resource runtime fee at the same volume is ~$10K/month — two orders of magnitude apart. For a growing agent, a linear per-conversation fee makes success a cost liability. I'd model the platform-fee shape to 100× before shipping and likely reject SaaS for a high-growth agent. Common wrong answer to avoid: "Ship it — the pilot cost is fine and it deflects human agents." Pilot cost is below the crossover; the linear fee is a $2M/year wall at scale.

Q2. Which cost layer should you optimize, and which should you watch to decide the platform? A. They're different layers. Optimize the token layer — it dominates the total bill — with model routing, prompt trimming, and bounded retries. But decide the platform on the platform-fee layer's shape, because that's the line that diverges across families at scale. The token layer is the same on every family using the same models, so it's real money but not a decision input. Staring at the total (mostly tokens) hides the deciding layer. Common wrong answer to avoid: "Optimize and decide on the total monthly bill." The total is dominated by tokens and hides the platform-fee shape where the family decision lives.

Q3. At 1M conversations/month, self-host's raw infra is cheaper than the runtime. Why might you still pick the runtime? A. Because layer 3 — your own infra and ops — dominates the difference at that volume. Self-host means running the memory store, trace pipeline, autoscaler, region-pinning, and 24×7 on-call, which a projection prices at ~\(20K/month plus 2–3 engineers a small team doesn't have. The runtime folds that into a managed fee ~\)2K/month above raw infra. That markup buys back three headcount and the on-call burden — the cheapest money a small, compliance-bound team spends. Self-host wins only above the volume that amortizes both the ops floor and the team. Common wrong answer to avoid: "Self-host is cheaper at scale, always pick it." Cheaper on raw infra, not on total cost of ownership; the ops headcount erases the win below the amortization volume.

Q4. Why does projecting "blended cost per conversation × volume" give the wrong answer? A. It collapses three differently-scaling layers into one and projects the platform-fee layer linearly when only the SaaS shape is linear. Tokens scale linearly, a runtime's per-resource fee amortizes sub-linearly, and your own infra scales with data not conversations. Multiplying one blended pilot number by 100 assumes all three scale linearly and proportionally, which hides the crossing point — exactly the number that decides the platform. Common wrong answer to avoid: "It's a fine first approximation." It systematically hides the fee-shape divergence that is the whole decision at scale.

Q5. The bank's token layer is $37K/month and its platform fee is $11K. Which number decides the platform, and why? A. The $11K platform fee, even though it's smaller. The $37K token layer is the same across all three families using the same models, so it can't differentiate them — it's a fixed proportional cost. The platform-fee layer is what diverges: $11K on the runtime against \(2M on SaaS and ~\)9K-plus-headcount on self-host. The decision lives in the layer that diverges, not the layer that's largest. Common wrong answer to avoid: "Tokens — it's the biggest number." Biggest isn't deciding; tokens are equal across families, so they're irrelevant to the choice.

Q6. Cost just pulled the bank's platform decision back open at 600K conversations/month. Is this a build-vs-buy bug (file 01), a lock-in bug (file 04), or a cost-curve bug (this file)? A. It's a cost-curve bug that becomes a lock-in problem. The decision re-opened because nobody projected the platform-fee shape to scale (this file). It hurts because the team is locked in on three surfaces (file 04), so the forced migration costs ~8 engineer-months. And it traces to a build-vs-buy choice (file 01) that picked a linear-fee SaaS for a growing agent. The root cause is fee-shape blindness; the pain multiplier is the absent escape hatch. Common wrong answer to avoid: "Just renegotiate the per-conversation rate." The rate isn't the problem; the linear shape is, and a discount on a linear fee still scales linearly.


Design/debug exercise (10 min)

Step 1 — Model it. Project the bank's support agent platform fee by family to 1M conv/mo:

1M conv/mo, 5 turns each. Compare LAYER 2 (platform fee) by shape:
  SaaS @ $2/conv:        1M × $2                    = $2,000,000/mo  (linear, no amortization)
  AgentCore per-resource: compute-hrs + memory recs
                          + gateway calls            ≈ $11,000/mo     (sub-linear, amortizes)
  Self-host raw infra:    your GKE + autoscale       ≈ $9,000/mo      (low slope, high ops)
Then add LAYER 1 (tokens ≈ $37K, same all families) and LAYER 3 (infra: runtime ~$3K,
self-host ~$20K + 2-3 FTE). Total: SaaS ~$2.04M, runtime ~$51K, self-host ~$66K + headcount.
Decision: runtime — kills SaaS on fee shape, beats self-host on layer-3 ops/headcount.

Step 2 — Your turn. Project the bank's internal-ops agent (a few thousand calls/day, ~100K conversations/month, low growth, heavy RAG over documents). Build all three layers, find whether the family curves cross within its volume band, and decide whether cost even matters here or whether control and residency dominate (hint: at flat low volume, the curves never cross and cost is not the decision). Then take one growing agent from your own backlog and find its crossover volume.

Step 3 — Reproduce from memory. Redraw the three-layer model (tokens / platform fee / own infra — what each scales with, which decides) and the fee-shape curves (SaaS linear, runtime sub-linear, self-host floor-plus-marginal) cold, then connect it to file 04: the cost curve re-opens the decision exactly when the escape hatch is most needed, because a forced migration under a runaway fee is only survivable if exit cost is low. If you can draw the three layers, the three curve shapes, and name which layer decides, you own this chapter.


Operational memory

This chapter made cost the subject and showed that the cheapest pilot is rarely the cheapest production. The important idea is that an agent's bill is three layers that scale differently — model tokens (linear, same across families), platform fees (family-specific shape, where the decision lives), and your own infra (data and ops driven) — not one per-task number times volume. The platform-fee shape, not its pilot size, decides the family at scale: a linear per-conversation fee integrates to a straight line that never amortizes, while a per-resource fee flattens as utilization rises.

You learned to project each layer separately to 10× and 100× and find where the family curves cross. That solves the opening failure because the bank's support agent at 1M conversations/month would cost ~\(2M/year in SaaS platform fees against ~\)51K/month total on the runtime — a 150× divergence in the platform-fee layer that was a $20K footnote at pilot. The runtime also beat self-host, not on raw infra but on layer 3, where managed ops saved three engineers the small team didn't have.

Carry this diagnostic forward: when someone quotes a cost-per-conversation, ask for the three layers and the platform-fee shape projected to your scale. Watch the slope of cost-per-conversation against volume, not the absolute total — a flat slope means a linear fee that will punish success. If success makes the agent more expensive with no amortization, the fee shape is wrong for a growing agent.

Remember:

  • Project three layers, each by its own scaling shape; never multiply one blended per-task number by volume.
  • The token layer is the biggest line and decides nothing — it's the same across families.
  • The platform-fee shape decides the family at scale: linear per-conversation never amortizes; per-resource flattens.
  • Self-host's low marginal cost wins only above the volume that amortizes its ops floor and on-call headcount.
  • A linear per-conversation fee makes success a cost liability; reject it for a growing agent.
  • Find the crossover at selection; finding it mid-flight means re-opening the decision while locked in.

Bridge. We've now scored what a platform can do, what it costs to leave, and what it costs to run at scale — three quantitative passes that produce a defensible decision. But every one of them can be overruled by a single non-negotiable fact: a regulator's residency rule, a PII-handling requirement, a tenant-isolation demand, or an audit obligation that the platform simply cannot satisfy. These gates don't get weighed against the rubric; they override it — a platform that fails a governance gate is out regardless of capability, portability, or cost. The next file walks the security, governance, and compliance gates that kill platforms before a single agent ships. → 06-security-governance-and-compliance.md