13. Honest admission¶
Twelve chapters of discipline. None of them solve the problem entirely. This chapter lists the gaps a thoughtful lead is transparent about — with their team, their stakeholders, and themselves — when telling the story of what tool integration contracts can and cannot do.
The discipline in this module imports thirty years of API design experience and adapts it for a new kind of client. That is a real advance over treating tools as opaque function wrappers. It is not a complete solution. The honest admissions below are the gaps where current practice falls short, where the engineering community does not yet have a clean answer, and where a lead should expect to make judgment calls rather than apply rules.
The goal is not pessimism. It is calibrated confidence: the module has taught a discipline that meaningfully reduces incident surface, and there are categories of problem the discipline does not eliminate.
1 — Prompt injection is still mostly the agent's problem¶
Scope and credential isolation reduce the blast of a successful injection. They do not prevent the injection from happening or from convincing the model to do harmful things within scope. A tenant-bound credential cannot exfiltrate other tenants' data; it can still exfiltrate the current tenant's data through legitimate read tools, send a misleading email under the current tenant's identity, or trigger a chain of in-scope operations that aggregate to something the user did not intend.
The contract layer is a containment system, not a prevention system. Prevention requires:
- Agent-level prompt design that ignores instruction-shaped content from untrusted inputs (module
03_ai_security_safety/01_prompt_injection_security) - Output filters that catch attempts to exfiltrate via legitimate-looking calls
- Approval gates on actions that move data outside the tenant (module 01 chapter 07)
A lead should state plainly: contracts reduce the worst-case blast; they do not stop the attempt. The agent layer must do its share.
2 — Semantic drift is detectable only by invariant or impact¶
Chapter 09 listed six drift shapes. Five of them are mechanically detectable; the sixth — semantic shift where field name and type stay the same but meaning changes — is detectable only if your contract has an invariant postcondition that the new meaning violates, or if customer impact reports come in.
There is no general solution. The mitigations:
- Invariant postconditions on every tool that touches money or identity
- Periodic manual audits of downstream changelogs against your assumptions
- A contract pact that exercises end-to-end semantics, not just shapes
Even with these, a semantic shift in a low-value field that does not violate any invariant and does not produce visible customer impact can drift undetected for months. A lead should acknowledge: some categories of drift will get past us; the cost is bounded only by how soon downstream impact becomes visible.
3 — Idempotency keys do not survive every workflow design¶
Chapter 04 establishes that the key must come from the outermost retrying layer. This works when the platform has a durable workflow engine with persistent step state. It does not work — or works only partially — in setups where:
- The agent is single-process and stateless; a crash loses the in-flight key
- The orchestrator is a different system than the agent runtime and they do not share state
- The tool is invoked from multiple entry points (chat, API, batch job) that do not share a key strategy
- The workflow design itself does not have stable "step boundaries" (free-form agents that loop without persisted checkpoints)
Each is solvable, but solving each is non-trivial work. A platform that does not have a durable workflow engine is going to have idempotency gaps that no amount of contract design fixes from below. The fix is architectural: adopt durable workflows (module 02_durable_agent_workflows). The honest admission: if the platform above the contract layer is non-durable, the contract layer cannot fully compensate.
4 — The error contract is only as good as the model's reading of it¶
The structured error has a model_action field. There is no enforcement that the model follows it. Modern models follow well-written model_action fields reasonably well, but:
- The model can be confused by long error responses with multiple branches
- A poorly-prompted agent ignores the structure and parrots
human_hintto the user even when not asked to - An adversarial input can cause the model to act on a misleading combination of error fields
- New model versions can change how errors are interpreted
The contract has done its job by producing a clear, machine-readable error. Whether the model acts on it correctly is a prompt-engineering and eval concern (module 13, module 04_ai_product_evals). A lead should acknowledge: we built the contract to make correct behaviour possible; verifying it actually happens is the eval layer's responsibility.
5 — Scope minting at scale is an unsolved cost problem¶
Per-call short-lived credentials are the right pattern (chapter 06). At scale — thousands of calls per minute across hundreds of tools and tenants — the credential issuer becomes a hot dependency. Sub-millisecond minting, key rotation, distributed validation, and audit emission all have to scale together. Most token-issuance services in production today are sized for human-frequency operations (a developer logs in, gets a token, uses it for hours) rather than agent-frequency operations (a request comes in, the contract layer mints a token, uses it for milliseconds, discards it, and another follows in five seconds).
The honest admission: short-lived credentials are the security right answer; making them work without becoming a platform bottleneck is real platform engineering work, and many platforms compromise here in ways they should be explicit about.
6 — Contracts are mostly silent about cost¶
Module 01 chapter 08 is about cost as an architectural concern. This module includes a cost field in audit and an estimated_usd on the per-call record, but it does not establish a discipline of cost contracts. Questions like "what is the maximum cost of one call to this tool?" "what is the cost growth rate of this tool month-over-month?" "which agent is consuming this tool's budget?" are answered, if at all, through dashboards layered on top of the audit log — not through the contract itself.
This is a real gap. As agent platforms scale, cost contracts will become as important as security contracts. The discipline is younger; the community does not yet have well-shared patterns for cost-as-contract.
7 — Dual-run windows assume a stable consumer set¶
Chapter 08's dual-run pattern works when consumers are identifiable (the agent platform team, internal services, partner teams). It breaks down when consumers are anonymous, transient, or numerous — for example, when a tool contract is exposed through an MCP server consumed by other agents the producer team has no contact with.
The fix is registry-side: every consumer registers, the producer tracks them. But "every consumer registers" is itself a discipline that not every platform enforces. Open MCP-style ecosystems are particularly prone to this: anyone can build a client against your contract; you have no idea who they are.
A lead should be transparent: dual-run only works when we know who the consumers are. For open ecosystems, contract changes are riskier and breaking changes need longer windows and louder comms.
8 — Audit completeness is a constant fight¶
Chapter 11 establishes the per-call audit record. Real platforms always have gaps:
- A code path was added without going through the contract layer (shortcut for an "internal" call)
- A new field was added without being included in the audit emission
- A redaction was missed for a sensitive field
- A tracing context was dropped at a layer boundary
- Audit records get lost during ingestion peaks
- A retention bug deletes records before their window expires
A "complete" audit log is an asymptote, not a state. Continuous audit-of-the-audit (records-missing-fields metric, redaction sweep queries, retention verification) is the discipline that keeps the asymptote close. A lead should plan for this as ongoing work, not a one-time setup.
9 — The contract layer is itself code, and itself has bugs¶
Every chapter in this module describes machinery: validators, dedup stores, translators, scope resolvers, audit pipelines, monitors. Each is a software system with its own bug surface. A bug in the dedup store produces duplicates. A bug in the translator misclassifies errors. A bug in the scope resolver issues over-broad credentials. A bug in the audit emitter loses records.
The contract layer is more important than the tools it wraps, because every tool's correctness now depends on the contract layer's correctness. Treating the contract layer as a tier-zero system — with its own tests, SLOs, on-call, postmortem culture, and rollback discipline — is a load-bearing prerequisite the module does not always make explicit.
10 — The discipline is young; the community is converging slowly¶
Compared to traditional API design (decades of OpenAPI, REST, gRPC, OAuth conventions), the discipline of tool integration contracts for agents is two to three years old at the time of writing. Standards like MCP exist; conventions for idempotency keys, scope grammars, and error shapes are emerging but not standardised; libraries and frameworks vary widely; some platforms still treat tools as Python wrappers and have not yet adopted contract registries.
This means: practices in this module that read as obvious in five years are, today, the leading edge. They will evolve. A lead should hold them with confidence — the underlying engineering reasoning is sound — and with humility — the specific shapes and conventions will change.
The most honest closing posture: the discipline is real; it is in motion; what we have built is a defensible position today and a checkpoint for tomorrow's improvements.
What this module does not teach¶
Listed once, explicitly:
- How to design prompts that lead the model to call tools correctly (module 13)
- How to evaluate whether the agent is using tools well in aggregate (module
04_ai_product_evals) - How to safely run a multi-step workflow with compensation and recovery (module 02)
- How to design the gateway that fronts model providers (module
02_ai_infrastructure/01_model_gateway_provider_ops) - How to operate on-call for AI incidents (module
02_ai_infrastructure/06_ai_runbooks_oncall) - How to red-team agent tool use against prompt injection (module
03_ai_security_safety/01_prompt_injection_security) - How to architect the agent itself (module 01)
This module assumes those neighbours exist or are being built. Without them, contracts on their own are necessary but not sufficient.
How to use this module after reading it¶
The realistic learning path:
- Audit one existing tool against the chapter-12 checklist. Identify reds and yellows. Pick the three highest-leverage gaps to close in the next sprint.
- Stand up the contract registry. Even as a git directory. Migrate existing tool definitions into it. Add the meta-schema check (chapter 10) in CI.
- Add the
UPSTREAM_UNCLASSIFIEDmonitor to every tool. It is the highest-value drift signal for the lowest cost. - Implement audit emission with the per-call record from chapter 11. Make redaction at write the only mode.
- For the next new tool, follow the checklist top to bottom. A clean new tool is a reference point for how older tools should evolve.
- Then come back to this chapter every six months and re-read the gaps. Some will have moved. New ones will have appeared.
Closing¶
Tool integration contracts are the load-bearing engineering boundary between a non-deterministic model and the production systems it acts on. The discipline taught here gives the boundary a name, a shape, a set of fields, and a rhythm of testing, monitoring, and revision.
It is not the only discipline in the agent stack. It is not bulletproof. It is what stands between "the model made a call and something happened" and "the model made a call, the contract enforced what we agreed it would, and we know exactly what happened, on whose authority, at what cost, with what evidence."
That difference is what production-grade agent platforms are built on.
Bridge. This module's discipline ends at the contract boundary. Beyond it, the next layer is the operational practice of the engineering team itself — judgment, communication, decision quality, ownership rhythms. The next module is
20_engineering_leadership_judgment, which is where the contracts you have learned to design become the conversations you have to lead. → ../20_engineering_leadership_judgment/00-eli5.md