10. Contract testing¶

Drift detection catches changes the downstream made without telling you. Contract testing catches changes you are about to make to your own contract — before they hit production. This chapter builds the testing discipline: golden inputs, schema fuzz, version-compatibility tests, and the CI gates that hold the line.

A platform engineer at a Mumbai SaaS company runs git log on the contract registry and finds that the update_customer tool has changed thirty-one times in eighteen months. Most changes were small — added an enum value, clarified a description, extended an error code. The team's process is to PR the contract, get review, and merge. There is no test that runs against the contract itself. There is no validator that the schema is well-formed. There is no compatibility check that the new version is backwards-compatible with the previous one. The thirty-second change is meant to be a minor bump that adds an optional field. The change is merged. The next deploy goes out. The agent starts producing calls with the new optional field. The downstream Salesforce integration starts rejecting those calls with a 400 because the field name conflicts with a reserved Salesforce attribute. Nobody had tested the contract against the downstream. The change is rolled back two hours later; the postmortem identifies the missing CI gate, and the team spends a week building one.

This chapter teaches the gate. It is a small number of test types, each catching a different category of contract bug, arranged so they run in CI on every contract PR.

The four test types¶

Every contract has four test surfaces. Most platforms eventually have all four.

#	Test type	What it catches	When it runs
1	Schema well-formedness	Syntactic mistakes in the contract itself: missing required slots, invalid types, broken refs	On every contract PR
2	Version compatibility	A "minor" bump that is actually a breaking change	On every contract PR
3	Pact tests	The contract works against the real downstream system	On every contract PR; also on schedule
4	Behavioural tests	Idempotency, error handling, retry safety, scope enforcement work as the contract claims	On every contract PR

Schema fuzz testing is sometimes called out as a fifth type but lives most naturally inside type 4.

Test 1 — Schema well-formedness¶

The cheapest, dumbest, most important test. Does the contract YAML/JSON parse, validate against the contract meta-schema, and contain all the required slots?

The meta-schema covers:

Top-level slots: identity, class, schema, side_effects, errors, operational
Identity fields: name, version, owner, purpose
Class is one of the four enum values
Schema is a valid JSON Schema document with additionalProperties: false on parameters
Every error has code, retriable, human_hint, model_action
Operational has idempotency, scope, rate_limits, observability, sla
Bridge to the next file exists (for the curriculum side — non-curricular contracts don't need this)

# tests/contract_meta_schema.py
def validate_contract(path):
    contract = yaml.safe_load(open(path))
    ContractMetaSchema().validate(contract)
    # extra checks beyond schema:
    assert contract["class"] in {"read", "write-idempotent",
                                  "write-non-idempotent", "irreversible"}
    if contract["class"] == "irreversible":
        assert contract.get("human_gating", {}).get("required") is True or \
               contract.get("human_gating_waived_rationale"), \
               "Irreversible tools must justify if human gating is waived."
    for err in contract["errors"]:
        assert "code" in err and "retriable" in err \
            and "human_hint" in err and "model_action" in err
        if err["retriable"]:
            assert "retry_policy" in err

This test runs in a few milliseconds. It blocks every PR that drifts from the meta-shape. Add it first; add the other tests after.

Test 2 — Version compatibility¶

A check that the new version of the contract is backwards-compatible with the previous version, unless it is a major bump.

Procedure on every PR:

Read the contract's previous version (the one currently in production).
Read the contract in the PR.
Compute the diff:
Field added (optional or required)
Field removed
Field renamed
Type changed
Enum values added or removed
Error code added or removed
Class changed
Classify the diff against semver rules from chapter 08.
Compare to the version bump in the PR:
If the diff is "no change," the bump must be patch or none.
If the diff is "additive only," the bump must be minor or major.
If the diff includes any removal, rename, or type change, the bump must be major.
Fail the CI check if the version bump does not match the diff.

def test_version_bump_matches_diff(previous: Contract, current: Contract):
    diff = compute_contract_diff(previous, current)
    required_bump = classify_bump(diff)  # "patch" | "minor" | "major"
    actual_bump = compare_versions(previous.version, current.version)
    assert actual_bump == required_bump or \
        (required_bump == "patch" and actual_bump in {"minor", "major"}) or \
        (required_bump == "minor" and actual_bump == "major"), \
        f"Diff requires {required_bump}, PR bumped {actual_bump}"

The most common bug this catches: an "innocent" change that is actually breaking. Removing a field "nobody uses." Renaming a field "for clarity." Tightening a regex. Each of these slipped past human review will produce a postmortem. The version-bump CI gate is uncompromising about it.

It also catches the reverse — a major bump that doesn't need to be one. Useful for not burning consumer migration budget on patches.

Test 3 — Pact tests¶

Chapter 09 introduced pacts as a drift signal. They are also the strongest pre-merge test: a pact suite that exercises the contract against the real downstream (staging) catches breakages before they hit production.

A minimum pact suite per tool:

Happy path (one per significant variant — successful refund, successful partial refund)
One pact per error code (PAYMENT_NOT_FOUND, AMOUNT_EXCEEDS_REFUNDABLE, etc.)
Idempotency replay — call twice with the same key; assert one side effect
Scope refusal — call with a credential lacking the required scope; assert PermissionError
Tenant isolation — call against a resource of a different tenant; assert refusal
Unknown-error fallthrough — force the downstream to produce an unmapped error (sometimes feasible by triggering an obscure path); assert UPSTREAM_UNCLASSIFIED is returned cleanly

Pacts run in CI on every contract PR. They also run on a schedule (chapter 09 detail) to catch upstream drift.

When the downstream cannot be hit in CI¶

Sometimes the downstream is expensive, requires special credentials, or has rate limits that make running pacts in every PR impractical. Two fallbacks:

Recorded fixtures. A previous run captures the request-response pairs. The pact suite replays the recorded responses. This catches contract-side breakages (schema mismatches, translator drift, error-handling bugs) but does not catch downstream drift. Combine with scheduled live pacts.

High-fidelity fakes. A maintained fake of the downstream system that implements the surface the contract uses. More work to maintain; faster to run; risk of fake drifting from real downstream behaviour.

The decision tree: live pacts when the downstream supports it; recorded fixtures plus scheduled live pacts when it does not; fakes only when both are impractical and the maintenance is worth it.

Test 4 — Behavioural tests¶

The category that exercises the contract layer's own logic. Not the downstream system, not the schema, not the version — the behaviour the contract promises.

Examples:

def test_idempotency_replay_returns_same_result():
    key = new_idempotency_key()
    args = valid_refund_args()
    r1 = contract_call(tool="issue_refund", idempotency_key=key, args=args)
    r2 = contract_call(tool="issue_refund", idempotency_key=key, args=args)
    assert r1 == r2
    # exactly one downstream call happened:
    assert downstream_mock.call_count == 1

def test_retry_with_new_key_creates_two_effects():
    args = valid_refund_args()
    contract_call(tool="issue_refund", idempotency_key="k1", args=args)
    contract_call(tool="issue_refund", idempotency_key="k2", args=args)
    assert downstream_mock.call_count == 2

def test_scope_denial_returns_permission_error():
    result = contract_call(
        tool="issue_refund",
        agent_identity=AGENT_WITHOUT_REFUND_SCOPE,
        args=valid_refund_args(),
    )
    assert result.error.code == "PERMISSION_OUT_OF_SCOPE"
    assert result.error.retriable is False

def test_tenant_binding_refuses_cross_tenant_call():
    payment = create_payment(tenant="acme")
    result = contract_call(
        tool="issue_refund",
        tenant_id="other-tenant",
        args={"payment_id": payment.id, ...},
    )
    assert not result.ok
    assert result.error.code in {"NOT_FOUND", "PERMISSION_OUT_OF_SCOPE"}

def test_dry_run_does_not_execute_side_effect():
    args = valid_refund_args() | {"dry_run": True}
    result = contract_call(tool="issue_refund", args=args)
    assert result.ok
    assert result.value["status"] == "would_succeed"
    assert downstream_mock.call_count == 0

def test_unknown_optional_field_is_rejected():
    args = valid_refund_args() | {"surprise_field": "x"}
    result = contract_call(tool="issue_refund", args=args)
    assert not result.ok
    assert result.error.code == "INPUT_VALIDATION_FAILED"

def test_schema_fuzz_does_not_crash_contract():
    # property-based test: random shapes do not produce uncaught exceptions
    for _ in range(1000):
        args = random_jsonish()
        result = contract_call(tool="issue_refund", args=args)
        # either ok or a structured error, never a crash
        assert result.ok or result.error is not None

These tests exercise the contract layer's own promises: idempotency dedup, retry semantics, scope enforcement, tenant binding, validation, error handling. They do not require the real downstream — mocks or fakes are fine.

The fuzz test at the end is the cheapest defence against the contract layer crashing on unexpected input. Property-based testing libraries (Hypothesis for Python, Quickcheck-derived for other languages) generate random shapes; the contract is expected to either succeed or return a structured error, never crash.

CI gate composition¶

The PR for a contract change runs the tests in this order, fast first:

Schema well-formedness (milliseconds)
Version compatibility against the previous version (milliseconds)
Behavioural tests against mocked downstream (seconds)
Pact tests against staging downstream (minutes — but parallelisable per tool)

Each gate must pass; otherwise the PR is blocked. The scheduled live pact run (chapter 09) is independent and is the drift detector, not a PR gate.

A platform that has not yet built pacts can start with gates 1, 2, and 4 — the cheapest three — and add pacts incrementally per tool.

What contract testing does not solve¶

Model behaviour. The contract can be perfect, and the model can still pick the wrong tool, pass the wrong arguments, or surface the wrong message. Contract testing is about the contract layer, not the agent's prompt or tool-choice logic. Agent-level testing lives in module 04_ai_product_evals.
Downstream regressions inside their tested behaviour. If the downstream changes from returning success in 200ms to 5s, your contract tests pass but production users notice. SLAs are a separate signal (chapter 11 audit covers latency capture).
Coordination failures. A test gate cannot replace a notification protocol with the downstream team.

What contract testing solves is: a tool contract change that ships to production is, when tested, well-formed, correctly versioned, compatible with the running downstream, and faithful to the behaviours the contract claims. That is a tighter envelope than what most platforms ship today.

How contract testing interacts with the other surfaces¶

Schema (chapter 02). Tests 1 and 4 enforce schema correctness.
Class (chapter 03). Test 1 enforces class-specific requirements (irreversible must justify human gating, etc.).
Idempotency (chapter 04). Tested directly by behavioural tests 4.
Error contract (chapter 05). Tested by pacts (per error code) and behavioural tests (translator correctness).
Scope (chapter 06). Tested by behavioural tests and pacts (scope denial, tenant isolation).
Validation (chapter 07). Tested by behavioural tests (preconditions, postconditions, dry-run).
Versioning (chapter 08). Test 2 enforces the bump rules.
Drift (chapter 09). Scheduled pact runs are the drift signal.

How to recognise broken contract testing in the wild¶

A contract repo with no tests/ directory
A contract PR merges with no CI runs against the downstream
Engineers manually test changes against staging before merging
The team has been burned by a "minor" change that turned out to be major and there is no version-bump gate
A "pact" suite exists but only runs locally, never in CI
The fuzz test for contract crashes does not exist

Interview Q&A¶

Q1. What is the single CI gate you would build first if a platform has none? The schema well-formedness check. Cheapest, fastest, blocks the largest class of obvious mistakes. A close second is the version-compatibility check, because most production incidents from contract changes come from the version bump being wrong. Pacts and behavioural tests are higher-value but more expensive to set up; well-formedness and version compatibility are weeks of work to roll out across all tools, not months. Wrong-answer notes: "pacts" is the most-complete answer but not the cheapest-first answer; the question is about ordering.

Q2. Your downstream system is rate-limited and you cannot run pacts on every PR. How do you maintain coverage? A two-layer approach. PRs run recorded-fixture pacts — the contract is exercised against captured request-response pairs. A schedule (every six hours or daily) runs live pacts against staging, with an alert if any pact fails. The PR layer catches contract-side regressions; the scheduled layer catches downstream drift. Optionally, the highest-value tools run live pacts on PR even with the rate-limit cost. The point is not to have one solution but to have coverage at the granularity that matters. Wrong-answer notes: "skip pacts" loses the drift signal; "always run live" runs into the rate limit.

Q3. A teammate pushes a PR that removes an optional field from a tool, bumps the version to a patch ("nobody is using it"). Which test would block this? The version-compatibility test (Test 2). Removing a field — even an optional one — is a major-bump change because some caller may have been sending it. The test computes the diff (field removed), classifies it as major, compares to the PR's actual bump (patch), and fails the gate. The teammate then either restores the field, or commits to the major bump and the dual-run window. The discussion that follows is what the test was designed to provoke. Wrong-answer notes: "schema validation" catches malformed contracts, not version mistakes; "pact tests" catches the breakage but only after merge in staging.

Q4. The contract layer is supposed to enforce tenant binding. What test would you write to verify this on every PR? A behavioural test that creates a resource in tenant A, then attempts to call the tool with tenant_id=B and the resource's ID. The contract should refuse with PermissionError: out_of_scope or NotFound (either is acceptable; some platforms hide existence to avoid leaking that the resource exists in another tenant). The test runs on every PR. If the test passes for tenant A's resource called with tenant B's identity, tenant binding is broken — that is the worst-class bug for a multi-tenant system. Wrong-answer notes: "manual review" is what fails when tests don't exist; "the downstream enforces it" leaves the contract layer untested.

What to do differently after reading this¶

Build the meta-schema for contracts. Check every contract against it in CI.
Add the version-compatibility check next. Make CI compute and classify diffs automatically.
For each tool, write a minimum pact suite. Start with happy path and one per error code. Grow over time.
Add behavioural tests for idempotency, scope, tenant binding, and dry-run mode. These are the contract-layer's own promises.
Run a fuzz test that throws random shapes at every contract; the contract must never crash, only return structured errors.
Document the CI gate composition. New tools follow the same gates from day one.

Bridge. Contract testing catches changes before merge. Once a tool is in production, every call needs to be audited so incidents can be reconstructed, drift can be investigated, and the contract can be defended on its own terms. The next chapter builds observability and audit: what to log per call, what to redact, how to replay a call from logs, and how the audit log is the operating substrate for the rest of the discipline. → 11-observability-and-audit.md