Skip to content

03. Capability rubric — turning "the demo felt great" into a defensible score

~19 min read. Two runtimes are on the table. Both demoed well. Procurement wants a number, not a vibe. This file builds the rubric that produces the number — and weights it so it answers the bank's question, not a generic one.

Built on 02-platform-landscape-map.md. You can now place players by the three families and position them by what they optimize for. This file converts that positioning into the capability rubric — a weighted score across nine axes — so two real candidates can be compared on the dimensions this use case cares about, not on a generic checklist.


What the map left as a slogan

The map told you AgentCore optimizes for cloud-native security, Vertex for model breadth and memory, Foundry for the Microsoft estate. Useful for narrowing. Useless for choosing between two finalists, because "optimizes for security" does not tell you whether its memory survives a crash, whether its tool calling handles your private services, or whether its eval hooks let you gate a release. Slogans rank intentions; a rubric ranks fit.

This file assumes you have narrowed to two or three candidates (the bank narrowed to AgentCore and Vertex Agent Engine, with self-hosted LangGraph as a control) and now must produce a comparison that survives a procurement review and a six-month "why did we pick this" retro.

What this file solves

A team picks a platform on demo impression, then discovers in production that the axis it never tested — say, eval hooks or human-in-the-loop — is exactly the one the use case needed. This file gives you nine capability axes, a weighting method tied to the use case, and a worked scoring of the bank's real candidates. The first move is to stop scoring platforms on "how good is it" and start scoring on "how good is it at the things this use case will stress," because a uniform checklist over-weights what the vendor demos and under-weights what your traffic will break.

Why an unweighted feature checklist lies

The naive rubric is a feature checklist: list capabilities, tick boxes, sum the ticks, highest score wins. It lies for two reasons. First, it treats all axes as equally important, so a platform that aces ten axes you don't need and fails the one you do beats a platform that nails the one that matters. Second, it scores presence ("has memory") not fit ("has the kind of memory that survives a mid-trajectory crash and pins to my region"). A checklist rewards breadth; production rewards depth on the few axes your traffic stresses.

When the highest-scoring platform is the wrong one

The bank's first analyst builds an unweighted checklist. Eighteen features, one point each. Vertex Agent Engine scores 16/18 (huge model menu, GA memory, code execution, eval, governance). AgentCore scores 14/18. Vertex wins on the slide.

Unweighted checklist (excerpt):
                          Vertex   AgentCore
  200+ model menu            ✓         ~        (AgentCore: fewer but enough)
  GA managed memory          ✓         ✓
  Code execution             ✓         ✓
  Built-in eval              ✓         ✓
  Multi-agent A2A            ~         ~
  Native KMS per-call sign   ~         ✓        ← weight 1, same as model menu
  In-VPC private tool call   ✓         ✓
  IAM identity bank uses     ✗         ✓        ← weight 1, same as model menu
  ...
  TOTAL                      16        14

The checklist says Vertex. But the two axes where AgentCore wins — native KMS per-call signing and the IAM identity the bank already runs — are the exact axes the regulator and the bank's security team will block on. The 200+ model menu, where Vertex shines, is worth almost nothing to a bank that will use two or three approved models. So the real problem is not that the platforms are hard to compare; it is that an unweighted sum drowns the two decisive axes in fourteen irrelevant ones. The fix is to weight each axis by how hard this use case will stress it, then score fit, not presence.

So how do you weight without fooling yourself into justifying the answer you already wanted? You derive weights from the use case's stress profile before you score any vendor, and you write the weights down so the procurement review can challenge them.

The rubric rule. Score platforms on weighted fit to the use case's stress profile, not on unweighted feature presence. Set the weights from the use case before scoring any vendor, and make every score a fit judgment ("survives a crash, pins to region") not a checkbox ("has memory").

Why this rule exists. The primitive is that capability only matters under load: a feature you never stress is free to be mediocre, and a feature your traffic hammers must be excellent. The constraint is that vendors demo breadth and hide the depth of the few axes you'll stress. An unweighted checklist inherits the vendor's framing — breadth — and inverts your priorities, which is why the highest checklist score can be the wrong platform.


1) The nine capability axes — what each one actually tests

These are the dimensions every agentic platform lives or dies on. For each, the question is not "does it have this" but "does it have the kind this use case will stress."

  1. Orchestration — can it run the topology you need (single call, ReAct loop, orchestrator, multi-agent) with the control you need over the loop, branching, and state transitions? Test: does it expose the graph, or hide it behind a closed flow?
  2. Tool / function calling — can it call the tools you actually have, including private in-VPC services, with reliable schema enforcement and parallel calls? Test: can it reach a service with no public endpoint?
  3. Memory / state — short-term session state and long-term memory: does it persist, survive a crash, pin to a region, and isolate per tenant? Test: kill the process mid-trajectory — what survives?
  4. RAG integration — can it ground answers in your corpus with the retrieval quality, permission-awareness, and freshness you need? Test: does it respect per-user document permissions?
  5. Multi-agent — if you need peers or planner/worker, does it support handoffs, agent-to-agent protocols, and shared state cleanly? Test: can two agents coordinate without you hand-rolling the message bus?
  6. Observability / tracing — can you see every step, tool call, token, and decision, and export traces to your own stack? Test: can you reconstruct a failed trajectory from the platform's logs alone?
  7. Eval hooks — can you wire offline eval sets, online scoring, and release gates so a regression blocks a deploy? Test: can a failing eval stop a rollout automatically?
  8. Human-in-the-loop — can you insert approval gates at named steps above named thresholds, with a clean handoff to a human? Test: can you require human sign-off on a refund over ₹50,000?
  9. Model flexibility — can you choose, route between, and swap models without rewriting the agent, including bringing your own approved models? Test: can you route cheap traffic to a cheap model and swap a model in a config change?
                 ORCHESTRATION ── TOOLS ── MEMORY ── RAG ── MULTI-AGENT
                       │            │         │       │         │
        the loop & ────┘            │         │       │         └──── peers/handoffs
        topology control            │         │       │
                       private  ────┘         │       └──── grounding + permissions
                       in-VPC calls           │
                       survive-a-crash, ──────┘
                       region-pinned state

                 OBSERVABILITY ── EVAL HOOKS ── HUMAN-IN-LOOP ── MODEL FLEXIBILITY
                       │             │              │                  │
        export ───────┘             │              │                  └── route + swap, BYO model
        traces to                   │              └── approval gates at thresholds
        your stack         release-gating on eval regression

The nine axes split naturally: the first five are what the agent can do (capability), the last four are whether you can trust it in production (operability). A demo exercises the first five and skips the last four — which is exactly why teams under-weight observability, eval, and human-in-the-loop until an incident.


2) Picture first — the rubric as a weighted radar, not a sum

flowchart TD
    A[Use case stress profile] --> B[Assign weight 1-5 to each of 9 axes]
    B --> C[Score each candidate 1-5 for FIT on each axis]
    C --> D[Weighted score = sum of weight x fit per axis]
    D --> E{Any axis with high weight<br/>scored 1-2?}
    E -->|Yes| F[Disqualifier — investigate before trusting total]
    E -->|No| G[Compare weighted totals]
    G --> H[Sanity-check: does the winner pass the gates<br/>that file 06 will impose?]

The decisive node is E. A platform can have a great weighted total and still be disqualified by a single high-weight axis scored 1-2 — because in production, one critical capability at zero sinks the whole agent regardless of how the rest score. A weighted sum can hide a fatal hole; the disqualifier check is what catches it. For the bank, "in-VPC private tool call" is a high-weight axis; any platform scoring 1-2 there is out, no matter its total.


3) The bank's weights — derived from the stress profile, before any vendor

The bank writes the weights first, from the use case, so the scoring can't be reverse-engineered to a favorite. Weight 1 = barely stressed, 5 = heavily stressed and decisive.

Axis Weight Why this weight for the bank
Orchestration 4 ReAct loop with branching on policy/tier; needs real loop control, not a closed flow
Tool / function calling 5 Must call the in-VPC fraud-scoring and KYC services — no public endpoint
Memory / state 4 Conversation continuity + must survive crash and pin to ap-south-1
RAG integration 3 Grounds support answers in policy docs; permission-aware but not the core
Multi-agent 1 Both agents are single-domain ReAct; no peer negotiation needed
Observability / tracing 4 Regulator + on-call need full trace export to the bank's stack
Eval hooks 3 Want release-gating on a support-quality eval set
Human-in-the-loop 5 Refunds/limit changes over thresholds must require human sign-off
Model flexibility 4 Route 70% FAQ to a cheap model; BYO approved models; swap without rewrite

Two axes are weight 5 (tools, human-in-the-loop) because the regulator and the security team will block on them. Multi-agent is weight 1 because file 01 already established both agents are single-domain. The model menu — Vertex's headline strength — folds into "model flexibility" at weight 4, but the bank values routing and swap, not menu size, so a 200-model menu and a 5-model menu can score the same fit here.

Teacher voice. The weights are the whole argument. Once you write "tools = 5, multi-agent = 1," the comparison is half-decided, and it's decided by the use case rather than by whichever vendor demos hardest. Spend your scrutiny on the weights; they encode your judgment about what production will stress. A reviewer who disagrees with the pick should attack the weights, not the scores.


4) Scoring the candidates — fit, not presence

Now score each candidate 1-5 for fit on each axis (5 = excellent fit for the bank's specific stress, 1 = poor). These scores reflect the verified 2026 capabilities from the landscape file.

Axis Weight AgentCore Vertex Agent Engine Self-host LangGraph
Orchestration 4 4 (hosts LangGraph, full loop control) 4 (ADK + hosts frameworks) 5 (you own the graph)
Tool / function calling 5 5 (Gateway + in-VPC via AWS net) 4 (in-VPC needs cross-cloud care) 5 (you wire any tool)
Memory / state 4 4 (managed memory, region-pinned) 5 (Memory Bank GA) 3 (you build + operate it)
RAG integration 3 4 4 (strong) 4 (LlamaIndex/own)
Multi-agent 1 3 4 (A2A-ish) 4
Observability / tracing 4 4 (CloudWatch + export) 4 (Cloud Trace) 3 (you build the pipeline)
Eval hooks 3 3 4 (built-in eval, GA) 3 (you wire it)
Human-in-the-loop 5 4 (gate in your owned logic) 4 (gate in your owned logic) 5 (you own every gate)
Model flexibility 4 4 (Bedrock model access, route in your code) 5 (200+ menu, route in your code) 5 (any model, any router)

Now the weighted total: multiply weight × fit per axis, sum.

AgentCore:
  4×4 + 5×5 + 4×4 + 3×4 + 1×3 + 4×4 + 3×3 + 5×4 + 4×4
= 16 + 25 + 16 + 12 + 3 + 16 + 9 + 20 + 16 = 133

Vertex Agent Engine:
  4×4 + 5×4 + 4×5 + 3×4 + 1×4 + 4×4 + 3×4 + 5×4 + 4×5
= 16 + 20 + 20 + 12 + 4 + 16 + 12 + 20 + 20 = 140

Self-host LangGraph:
  4×5 + 5×5 + 4×3 + 3×4 + 1×4 + 4×3 + 3×3 + 5×5 + 4×5
= 20 + 25 + 12 + 12 + 4 + 12 + 9 + 25 + 20 = 139

Vertex tops the weighted total (140) over self-host LangGraph (139) and AgentCore (133). If you stopped here, you'd pick Vertex — and you might still be wrong, because the rubric is one input, not the verdict.


5) Why the weighted winner still isn't the pick — the disqualifier and the gravity

Two things override the raw total.

First, the disqualifier check from section 2. No candidate scored 1-2 on a high-weight axis, so none is auto-disqualified. But the gap between 133 and 140 is seven points out of ~140 — five percent. A five-percent rubric gap is noise, not signal. The scores carry judgment error larger than five percent. When candidates land within a few percent, the rubric has done its job (it eliminated bad fits) and the decision passes to factors the rubric can't capture.

Second, the gravity from file 02. The bank's core-banking data and identity live on AWS. On Vertex, the weight-5 "tool / function calling" axis scored 4 specifically because in-VPC calls to AWS-resident services require cross-cloud networking and identity — a standing integration cost and a second control plane. That single fact, worth one point on the rubric, is worth far more in operations: it means the bank runs two clouds for one agent forever.

Rubric says:        Vertex 140  >  LangGraph 139  >  AgentCore 133   (5% spread = noise)
Gravity says:       AgentCore — data + identity already on AWS, in-VPC calls free
Team-capacity says: not LangGraph — 6-person team can't own agent ops
Decision:           AgentCore

The pick is AgentCore, even though it has the lowest weighted total of the three. The rubric didn't pick it; the rubric cleared it (no disqualifiers, within noise of the top) and then gravity and team capacity decided. This is the right use of a rubric: it is a filter and a forcing function for honest weighting, not an oracle that emits the answer.

Mini-FAQ. "Then why bother scoring at all if gravity decides?" Because the rubric proves AgentCore is good enough on every axis the use case stresses — it scored 4-5 on all five weight-4-and-5 axes. Without the rubric you wouldn't know that; you might have rejected AgentCore on the unweighted checklist (14/18) where it looked worst. The rubric's job was to show the within-noise tie so gravity could legitimately break it.


6) The failure walked through: trusting the total blindly

Replay what happens if the bank's analyst ships the rubric's raw winner.

Decision: Vertex Agent Engine (rubric total 140, the highest).
Month 1   Works. Memory Bank is excellent. Model menu impresses leadership.
Month 2   In-VPC calls to AWS-resident fraud service require a cross-cloud
          identity bridge + private interconnect. Standing cost, new control plane.
Month 4   Security review flags the cross-cloud egress path as extra attack surface;
          residency proof now spans two clouds' audit trails.
Month 6   On-call now reasons about two clouds' failure modes for one agent.

Nothing here is a Vertex defect. Vertex is excellent. The failure is methodological: the analyst let a five-percent rubric gap override a structural gravity fact worth far more than five percent. Not a platform-quality problem; a rubric-misuse problem. The rubric is an input that eliminates poor fits and forces honest weighting. It is not the decision. The decision weighs the cleared candidates against gravity, team capacity, cost (file 05), and governance gates (file 06).


7) Cost movement — what the rubric buys and what it costs

Approach Effort to produce What it catches What it misses When it misleads
Vibe / demo impression minutes nothing rigorous every axis not demoed always over-weights demoed breadth
Unweighted checklist hours feature gaps relative importance drowns decisive axes in irrelevant ones
Weighted rubric (this file) a day poor fit on stressed axes; forces honest weights gravity, team capacity, exact cost curve when a small total gap is read as signal
Weighted rubric + gravity + cost + gates days the actual decision nothing major if done in order only if any single input is treated as the whole answer

The movement: each step up the ladder costs more effort and catches a different class of error. The weighted rubric's specific contribution is to convert demo impression into a defensible, attackable artifact — a reviewer can challenge a weight or a score, which they cannot do to a vibe. The subsystem that pays is your team's time (about a day), and the payoff is a decision that survives the procurement review and the six-month retro because every number has a reason written next to it.


8) Operational signals — is the rubric being used well?

Healthy. Weights were written from the use case before any vendor was scored, every score is a fit judgment with a one-line reason, and the rubric is used to eliminate poor fits and surface within-noise ties — then gravity, cost, and gates decide.

First metric that degrades. The spread between candidates' totals. When the top candidates land within ~5%, the rubric is telling you it has done all it can; treating that spread as a verdict is the first misuse. Watch for the spread shrinking and the analyst still declaring a "winner."

Misleading metric people watch. The raw total alone, divorced from weights and disqualifiers. A total of 140 means nothing without seeing that two weight-5 axes drove it and that gravity sits outside the rubric. People quote the total like a benchmark score; it isn't one.

The signal an experienced lead checks first. "Did anyone score a 1 or 2 on a weight-4-or-5 axis?" — the disqualifier scan — and "are the top candidates within noise, and if so what breaks the tie?" A lead reads the high-weight rows first and the total last.


9) Boundary of applicability — when the rubric helps and when it misleads

Strong fit. Comparing two-to-four narrowed candidates in the same or adjacent families, where you need a defensible artifact for procurement and a record for the retro. The rubric shines at "we've shortlisted three; justify the pick."

Where the rubric becomes pathological. When teams treat the total as an oracle and let a noise-level gap override structural facts (gravity, exit cost, governance). The rubric then launders a weak decision in quantitative clothing — the most dangerous failure, because the number looks objective.

Scale / regime that invalidates the intuition. "Higher total = better platform" holds only when the gap exceeds the judgment error in the scores (call it ~10%). Below that, totals are ties and the decision is elsewhere. The intuition that breaks most often: that quantifying makes a decision objective. Weighted scores are structured judgment, not measurement — their value is forcing the judgment into the open, not removing it.


10) The wrong model to drop: "the rubric outputs the decision"

The seductive wrong idea is that a good rubric is a function — use case in, platform out — so once you've scored, you're done. Reality: the rubric is a filter and a forcing function. It filters out poor fits (anything scoring 1-2 on a high-weight axis) and forces you to write down what production will stress. It does not capture cloud gravity, exit cost, team capacity, the cost curve at 100×, or the governance gates — all of which can override a within-noise total.

Replace "rubric outputs the decision" with "rubric clears the field, then gravity, cost, and gates decide among the cleared candidates." The bank's pick — AgentCore, the lowest total — is the proof: the rubric cleared it, gravity chose it.


11) Other failure shapes around capability scoring

  • Reverse-engineered weights — picking the favorite first, then setting weights to make it win; defeated by writing weights before scoring any vendor.
  • Presence-not-fit scoring — ticking "has memory" without testing whether the memory survives a crash or pins to your region.
  • Demo-axis bias — high scores on the five capability axes the demo exercises, blanks on the four operability axes (observability, eval, human-in-loop) it skips.
  • Total-as-oracle — reading a 5% gap as a verdict and overriding gravity or exit cost.
  • Disqualifier blindness — summing past a weight-5 axis scored 2, so a fatal hole hides inside a healthy total.
  • Stale scores — scoring against last year's capabilities (e.g., before Vertex Memory Bank went GA), so the rubric encodes outdated facts.
  • Single-scorer bias — one analyst's scores never challenged; weights and scores should be reviewed by someone who can attack them.

12) Pattern transfer — where weighted scoring recurs

  • Model vendor strategy (module 12) — choosing a model provider uses the same weighted-fit-not-presence logic: weight axes (latency, cost, context, safety) by the workload, score fit, and never read a small total gap as signal.
  • AI product requirements (module 00) — the use case's stress profile that sets the weights is exactly the requirements artifact; the rubric is requirements made quantitative.
  • Lock-in & portability (file 04) — exit cost is a factor the rubric deliberately excludes; file 04 scores it separately so it can override a within-noise total, just as gravity did here.
  • Eval gates (module 03, agent observability) — the eval-hooks axis is where this module touches eval-gating: a platform that can't block a rollout on a regression scores low on the axis a quality-sensitive use case weights heavily.

13) The rubric audit — five yes/no questions

  1. Were the weights written from the use case's stress profile before any vendor was scored, and are they recorded?
  2. Is every score a fit judgment with a one-line reason, not a presence checkbox?
  3. Did you scan for disqualifiers — any weight-4-or-5 axis scored 1-2 — before reading the total?
  4. If the top candidates are within ~10%, did you stop treating the total as a verdict and bring in gravity, cost, exit cost, and gates?
  5. Are the scores current to this year's verified capabilities, not last year's?

Where this appears in production

Platforms scored on the capability axes: - AWS Bedrock AgentCore — scores high on tools (Gateway + in-VPC via AWS networking) and identity for AWS-native shops; the bank's pick despite the lowest raw total. - Google Vertex Agent Engine — scores high on memory (Memory Bank GA), model flexibility (200+ Model Garden), and built-in eval; tops the bank's raw total but loses on gravity. - Self-hosted LangGraph — scores max on orchestration, tools, and human-in-the-loop (you own every gate) but low on memory/observability (you build them) and disqualified by team capacity. - Microsoft Foundry Agent Service — scores high on observability and governance for Microsoft-estate enterprises; the natural finalist when identity is Entra, not IAM.

Axis-by-axis, where each capability decides a real choice: - Tool calling (in-VPC) — a bank or insurer ruling out any platform that can't reach a private fraud or claims service with no public endpoint. - Memory survives-a-crash — a long-running research agent where losing mid-trajectory state means redoing expensive work. - Human-in-the-loop thresholds — a fintech requiring human sign-off on refunds over a limit, scoring platforms on clean approval gates. - Model flexibility (routing) — a high-volume support agent routing FAQ traffic to a cheap model to cut cost without rewriting the agent. - Eval hooks (release gating) — a regulated lender blocking a deploy automatically when a support-quality eval regresses. - Observability export — a team needing every trace in its own stack for the regulator, scoring down any platform that locks traces inside its console.

Where rubric misuse bit teams: - A retailer that shipped the rubric's raw winner — ignored cloud gravity worth more than the 5% total gap and ran two clouds for one agent. - A team that scored presence not fit — picked a platform with "memory" that didn't survive a crash and lost trajectories in production. - A vendor-led scorecard — weights set by the sales engineer over-weighted the demoed axes; the team re-ran it with use-case weights and flipped the pick.


Pause and recall

  1. Name the nine capability axes. Which four are operability rather than capability, and why does a demo skip them?
  2. Why does an unweighted checklist lie? Give the two reasons.
  3. What is the disqualifier check, and why can a high total still be fatally flawed?
  4. The bank's rubric totals: Vertex 140, LangGraph 139, AgentCore 133. Which did the bank pick, and why?
  5. What does "score fit, not presence" mean for the memory axis specifically?
  6. When does a total-gap become noise rather than signal?
  7. Why must weights be written before scoring any vendor?
  8. What is the rubric's real job, if it isn't to output the decision?

Interview Q&A

Q1. How do you compare two agentic platforms without picking by demo? A. Build a weighted rubric across nine axes — orchestration, tools, memory, RAG, multi-agent, observability, eval, human-in-the-loop, model flexibility. Write weights from the use case's stress profile before scoring any vendor, score each axis for fit (not presence) with a one-line reason, scan for disqualifiers on high-weight axes, then read the total last. The total filters poor fits; it doesn't decide. Common wrong answer to avoid: "Tick the features and sum them." Unweighted sums over-weight demoed breadth and drown the decisive axes.

Q2. Your rubric says Platform A (140) beats Platform B (133), but you picked B. Defend that. A. A 5% gap is within the judgment error of the scores — it's a tie, not a verdict. The rubric cleared both (no disqualifiers, both strong on every high-weight axis). The tie-breaker is a structural fact the rubric deliberately excludes: B's data and identity already live where ours do, so its weight-5 tool-calling axis is free in operations while A's costs a second cloud forever. The rubric's job was to prove B is good enough; gravity broke the tie. Common wrong answer to avoid: "The higher total should win." Below ~10% the totals are noise; reading them as signal launders a weak decision.

Q3. Why weight multi-agent at 1 for the bank but tools and human-in-the-loop at 5? A. Weights come from what production will stress. Both bank agents are single-domain ReAct loops — no peer negotiation — so multi-agent is barely exercised (weight 1). Tools must reach private in-VPC services the regulator's use case demands, and refunds over a threshold legally require human sign-off — both are decisive and heavily stressed (weight 5). Weighting is where use-case judgment enters; it's the part a reviewer should attack. Common wrong answer to avoid: "Weight everything equally to stay objective." Equal weights inherit the vendor's breadth framing and invert your priorities.

Q4. What does "score fit, not presence" change in practice? A. It turns "has memory" (presence, 1 point either way) into "memory that survives a mid-trajectory crash and pins to ap-south-1" (fit, scored 1-5). A platform can have memory that loses state on a crash — present but unfit. Fit scoring requires a test per axis: kill the process and see what survives; call a private service and see if it reaches; require sign-off over a threshold and see if it gates. Common wrong answer to avoid: "Presence is good enough; if it has the feature it works." Presence is what the demo shows; fit is what production needs.

Q5. A weight-5 axis scored 2 but the platform has the highest total. Ship it? A. No — that's a disqualifier. A capability the use case heavily stresses scoring 2 means the agent fails where it matters most, regardless of how the other axes inflate the total. The disqualifier scan runs before the total is read. Either the platform is out, or you have a mitigation that raises the real fit and you re-score with the reason written down. Common wrong answer to avoid: "The total is highest, ship it." A weighted sum can hide a fatal hole; one critical capability at 2 sinks the agent.

Q6. Is the bank's final pick a rubric outcome (file 03), a landscape/gravity outcome (file 02), or a build-vs-buy outcome (file 01)? A. It's all three in sequence: file 01 chose the runtime family by boundary, file 02 narrowed by gravity to AWS, and file 03's rubric cleared the AWS candidate as good-enough on every stressed axis and showed the within-noise tie that let gravity decide. The rubric didn't pick AgentCore; it proved AgentCore wasn't a poor fit, which the unweighted checklist had falsely implied. Common wrong answer to avoid: "The rubric picked it." The rubric filtered and forced honest weights; gravity and team capacity made the call.


Design/debug exercise (10 min)

Step 1 — Model it. Score the bank's support agent on three axes to see the method:

Axis: Tool/function calling   Weight: 5 (must reach in-VPC fraud service)
  AgentCore fit: 5 — Gateway + AWS networking reach private services natively
  Vertex fit:    4 — reachable but via cross-cloud identity bridge (standing cost)
Axis: Multi-agent             Weight: 1 (single-domain ReAct, no negotiation)
  Both ~3-4 — irrelevant; weight makes the score nearly free
Axis: Human-in-the-loop       Weight: 5 (refund > ₹50k needs sign-off)
  Both 4 — gate lives in owned orchestration on either runtime
Reason recorded next to every weight and score.

Step 2 — Your turn. Score the bank's internal-ops agent across all nine axes: first set its weights (hint: tools and memory stay high, human-in-the-loop may drop since it's internal, RAG may rise for document retrieval), then score AgentCore and one rival for fit with a one-line reason each, compute weighted totals, scan for disqualifiers, and decide whether the totals are within noise. Then run a platform decision from your own backlog the same way.

Step 3 — Reproduce from memory. Reproduce the nine-axis list and the weighted-score formula (weight × fit, summed) cold, and connect it to file 02: the rubric clears candidates the gravity-based decision then chooses among, which is why the bank's lowest-total candidate (AgentCore) was the pick. If you can write the nine axes and explain why the highest total lost, you own this chapter.


Operational memory

This chapter built the rubric that turns "the demo felt great" into a defensible score: nine capability axes, weighted by the use case's stress profile before any vendor is scored, with every score a fit judgment rather than a presence checkbox. The important idea is that the rubric's job is to clear poor fits and force honest weighting — not to emit the decision. A weighted total within ~10% of another is a tie, and the decision passes to factors the rubric excludes.

You learned to weight from the stress profile, score fit per axis with a recorded reason, scan for disqualifiers on high-weight axes, and read the total last. That solves the opening failure because the bank's unweighted checklist had ranked AgentCore worst (14/18) while drowning the two decisive axes; the weighted rubric showed AgentCore strong on every stressed axis and within noise of the top — so cloud gravity could legitimately pick it despite the lowest total.

Carry this diagnostic forward: when someone quotes a rubric total, ask for the weights, the high-weight-axis scores, and the disqualifier scan before trusting the number. A 5% gap is noise; treat it as such and bring in gravity, exit cost, cost-at-scale, and the governance gates.

Remember:

  • Weight axes from the use case's stress profile before scoring any vendor; the weights are the argument.
  • Score fit ("survives a crash, reaches the private service"), not presence ("has memory").
  • Scan for disqualifiers — a weight-5 axis scored 1-2 sinks the agent regardless of total.
  • A total within ~10% is a tie; the rubric cleared the field, gravity/cost/gates decide.
  • The demo exercises the five capability axes and skips the four operability axes — weight those deliberately.
  • The bank picked its lowest-total candidate (AgentCore) because the rubric cleared it and gravity broke the tie.

Bridge. The rubric cleared AgentCore and gravity picked it. But notice what the rubric deliberately refused to score: exit cost. We compared what each platform can do, not what it costs to leave. A platform can ace all nine axes and still be a trap if your orchestration, prompts, and data fuse to it so tightly that leaving takes quarters. The next file scores the dimension the rubric excluded — lock-in and portability — and shows how to keep an escape hatch open even on the platform you chose. → 04-lock-in-and-portability.md