09. Integration drift detection¶

Versioning catches changes the producer team makes intentionally. Drift is what happens when the producer team makes a change they didn't realise broke you — or didn't tell you about. This chapter builds the monitors and contract pacts that catch silent breakage in minutes instead of weeks.

A platform engineer at a Chennai e-commerce company gets paged because the agent's create_order tool has started returning errors at a 7% rate, where it had been steady at 0.3% for months. The error code is UPSTREAM_UNCLASSIFIED. Digging into the audit log, the engineer finds that the downstream order service has started returning a new error: LISTING_NOT_AVAILABLE. This is a new error code the order service shipped on Tuesday, three days ago, because they restructured how out-of-stock listings are handled. Nobody told the agent platform team. The translator (chapter 05) does not map this code, so it falls into the catch-all and the model surfaces a confusing message to customers. The fix is small — extend the translator — but the delay is what hurts: three days of confused customers because the team did not have a monitor watching for upstream drift, and the only signal was a customer-support ticket spike that took two days to trace back.

The downstream system will change without telling you. This is the steady-state assumption — not an accusation, just a fact of operating across team boundaries. Drift detection is the engineering response: make the silent change visible at the contract layer, fast.

What drift looks like¶

Drift is any change in the downstream system's behaviour that the contract author did not anticipate or coordinate. Six common shapes:

Drift shape	What changed	How it leaks
New error code	Downstream returns a code your translator does not map	`UPSTREAM_UNCLASSIFIED` rate rises
New field in response	Downstream adds a field your schema does not declare	Postcondition shape validation fails (if strict) or field silently ignored (if lax)
Removed field	Downstream stops sending a field you depended on	Postcondition invariant violation, or null reference downstream
Renamed field	Downstream renames a field; old name is gone	Same as removed
Semantic shift	Field name and type are identical but meaning changed (e.g., `amount` was paise, now rupees)	Silent — only visible through downstream metrics or customer impact
Behaviour change	Same shape, different outcomes (e.g., what used to be a sync call is now async with a status of `pending`)	Postcondition violation; or model receives a status it doesn't know how to handle

The first four are detectable mechanically. The last two require either contract pacts that exercise the semantics or postcondition checks that catch impossible-looking outcomes.

The drift signal stack¶

Multiple signals catch drift at different points. Each has a different sensitivity and latency.

+----------------------------------------------------------+
|                                                          |
|   1. Contract pacts (in CI, against staging downstream)  |   --- fastest signal
|      catch: shape, semantic, behaviour                   |       (caught pre-prod)
|                                                          |
|   2. Postcondition violations (per-call, in production)  |   --- fast signal
|      catch: shape, invariant, response surprises         |       (caught on first prod call)
|                                                          |
|   3. UPSTREAM_UNCLASSIFIED rate (aggregated metric)      |   --- medium signal
|      catch: new error codes, unmapped responses          |       (caught within minutes)
|                                                          |
|   4. 4xx / 5xx rate per tool (per-tool error rate)       |   --- medium signal
|      catch: any sudden upstream-side rejection           |       (caught within minutes)
|                                                          |
|   5. Schema diff against documented downstream API       |   --- slow signal
|      catch: published downstream changes you missed      |       (caught daily/weekly)
|                                                          |
|   6. Customer impact reports                             |   --- slowest signal
|      catch: semantic drift everything else missed        |       (caught in days/weeks)
|                                                          |
+----------------------------------------------------------+

A serious platform runs all six. The opening incident used only signal 6 (the customer ticket spike); upgrading to signals 2–4 would have caught the same drift in minutes.

Signal 1 — Contract pacts¶

A contract pact is a test the platform team writes that exercises the downstream system from the agent platform's perspective. It runs against a real (staging) instance of the downstream, sends realistic calls, and asserts that the responses match the contract.

Pacts cover what unit tests cannot: the joint behaviour of contract and downstream.

A pact looks like:

def test_pact_issue_refund_happy_path():
    # Arrange: ensure a known-state payment exists in staging
    payment = staging.create_payment(amount_minor=100000, currency="INR")

    # Act: call the contract under test
    result = contract_layer.call(
        tool="issue_refund",
        version_built_against="2.1.0",
        arguments={
            "payment_id": payment.id,
            "amount_minor": 50000,
            "currency": "INR",
            "reason_code": "customer_request",
            "idempotency_key": new_idempotency_key(),
        },
        tenant_id=TEST_TENANT,
    )

    # Assert: full shape + semantics
    assert result.ok
    assert result.value["status"] in {"pending", "succeeded"}
    assert result.value["amount_minor"] == 50000
    assert result.value["currency"] == "INR"
    assert "refund_id" in result.value
    # No surprise fields:
    expected_keys = {"refund_id", "status", "amount_minor", "currency", "created_at"}
    assert set(result.value.keys()) <= expected_keys

def test_pact_issue_refund_amount_exceeds():
    payment = staging.create_payment(amount_minor=100000, currency="INR")
    result = contract_layer.call(
        tool="issue_refund",
        arguments={"payment_id": payment.id, "amount_minor": 200000, ...},
    )
    assert not result.ok
    assert result.error.code == "AMOUNT_EXCEEDS_REFUNDABLE"
    assert result.error.retriable is False

def test_pact_issue_refund_unknown_payment():
    result = contract_layer.call(
        tool="issue_refund",
        arguments={"payment_id": "pay_doesnotexist", "amount_minor": 1000, ...},
    )
    assert not result.ok
    assert result.error.code == "PAYMENT_NOT_FOUND"

Three tests per tool is a thin coverage. A serious pact suite has ten to thirty per tool — one per major precondition, one per error code, the happy path, and an unknown-error fallthrough.

The pacts run:

On every PR that changes the contract.
On a schedule (every six hours, every day) against staging — this catches downstream changes the platform team did not initiate.
Optionally, in production against a synthetic tenant — for the most critical tools.

A failing pact is a contract-layer alarm. It is more valuable than any unit test, because it is the joint statement of "the contract works against the real downstream."

Signal 2 — Postcondition violations¶

Chapter 07 introduced postconditions. Each postcondition violation is a drift signal.

The infrastructure:

Postconditions emit a structured event when they fail: tool, kind (shape/invariant/state), detail, tenant.
A monitor counts the event rate per tool. Any non-zero rate is interesting; a sustained rate is an alert.
An on-call dashboard shows the top tools by postcondition violation rate.

A new shape violation on a tool that previously had none is almost always a downstream schema change. The on-call investigates, the translator or schema is updated, the violation rate drops back to zero.

Signal 3 — UPSTREAM_UNCLASSIFIED rate¶

Chapter 05 introduced the catch-all error code. Its rate is the leading indicator of new downstream error codes.

The monitor:

Per tool, count UPSTREAM_UNCLASSIFIED responses per minute.
Baseline: near zero (a few per day at most).
Alert: sustained non-zero rate, or a sudden spike.

When the alert fires, on-call inspects the audit logs to find the underlying upstream code and message. The translator is extended. The rate drops back.

This is the single highest-value drift signal because it is the cheapest to wire and catches the most common drift shape (new error code).

Signal 4 — 4xx / 5xx rate per tool¶

The fallback when nothing else catches it. Per-tool error rate, broken down by the contract's error enum.

Baseline error rate per tool (each tool has its own).
Alert on deviation: sustained rate above baseline, sudden change in distribution across error codes.

A tool whose error rate was 0.3% for months that suddenly hits 7% is a drift event regardless of what the codes say. The on-call may find it is the same underlying issue as a UPSTREAM_UNCLASSIFIED spike, or it may be a different shape (e.g., the downstream tightened a validation rule and now rejects more inputs).

Signal 5 — Schema diff against documented downstream API¶

If the downstream system publishes an OpenAPI or similar spec, a daily job can diff the latest spec against the one your contract was built against. New fields, removed fields, type changes, new error codes — all show up in the diff.

A diff is not necessarily a problem; the downstream team may have added something you do not consume. But it is a signal that the downstream changed, and the contract author should review whether the change is consequential.

This signal catches published drift. It misses drift that the downstream team did not publish (which is common; published specs lag implementation).

Signal 6 — Customer impact reports¶

The signal of last resort. A support ticket says "the agent told me the wrong refund amount" or "the schedule appointment confirmation never came through." Triage traces it back to a tool call; the tool call shows behaviour that does not match the contract.

By the time this signal fires, the drift has been in production for days. The goal of signals 1–5 is to make this signal redundant.

If signal 6 is firing regularly on a tool, that tool's contract pacts and postconditions are inadequate.

Reading the audit log for drift¶

When an alarm fires, the on-call's first move is the audit log. The audit log per chapter 11 captures the raw downstream request and response. A few queries on it isolate drift:

"Show me the raw response payloads for the last 50 calls to issue_refund that returned UPSTREAM_UNCLASSIFIED." — surfaces the new error code immediately.
"Compare the response field set for issue_refund calls in the last 24h against the schema." — reveals new fields the downstream is sending.
"Group postcondition violations on disburse_loan by kind." — sorts out shape drift from invariant drift.

This is why the audit log captures the raw upstream payload, not just the contract's structured response. The contract's response is the model-facing view; the raw payload is the drift-investigation view.

Drift response: from signal to fix¶

The standard response flow:

Alarm fires (one of signals 1–5). On-call is paged.
Triage: which tool, which signal, what changed. Use audit logs for raw payloads.
Classify: is this a new error code (extend translator), a new field (extend schema), a removed field (refuse or accommodate), a semantic shift (escalate), a behaviour change (escalate)?
Short-term mitigation:
For new error codes: extend the translator within the hour; ship a hotfix.
For added optional fields: usually no-op (the contract ignores extra fields unless strict).
For removed fields the contract depends on: rollback if possible; otherwise switch the affected calls to a degraded path.
For semantic shifts: stop the affected calls; alert the downstream owner; coordinate.
Coordinate: contact the downstream owner. Get on a call. Confirm the change. Negotiate either a rollback on their side or a versioned migration on yours.
Long-term fix: extend contract pacts to cover the case that drifted; add a postcondition that would have caught it; document the lesson in the tool's changelog.

Step 5 is the part most teams skip. The downstream owner needs to hear that their change broke a consumer; without that signal, the next drift incident is just as silent.

How drift detection interacts with the other surfaces¶

Error contract (chapter 05). The UPSTREAM_UNCLASSIFIED catch-all is the load-bearing primitive for drift detection. Skipping it makes drift invisible.
Validation (chapter 07). Postconditions are drift detectors that fire on a per-call basis, not just an aggregated metric.
Versioning (chapter 08). Drift is the unintentional twin of versioning. When you find drift, the response often includes asking the downstream team to ship the change as a proper major-bump with a dual-run window.
Audit (chapter 11). Audit logs are where on-call investigates drift. Without raw payload capture, drift detection lacks evidence.

How to recognise broken drift detection¶

A tool has no contract pacts in CI
A tool has no UPSTREAM_UNCLASSIFIED monitor
Postcondition violation events are not collected or counted
The on-call dashboard does not surface drift signals
"We found out about the breakage from a customer" appears in postmortems
The audit log does not capture raw upstream payloads, so drift investigation is guesswork

Interview Q&A¶

Q1. What is the cheapest, highest-value drift monitor to wire first? The UPSTREAM_UNCLASSIFIED rate per tool. Cheap because the catch-all error code already exists if you followed chapter 05; you just count it. High-value because new error codes from downstream are the most common drift shape — they are how the chapter's opening incident leaked. Alert on any sustained non-zero rate. Within minutes of a downstream code change, the on-call has the signal. Wrong-answer notes: "contract pacts" is the most thorough monitor but not the cheapest; the question is about leverage. "Customer reports" is signal 6 — the failure case.

Q2. The downstream team adds a new field to a response. Your contract uses additionalProperties: false on the response schema. What happens, and what should you do? The postcondition shape validator fails on every call. The contract returns UPSTREAM_UNCLASSIFIED or a more specific drift error. The on-call is paged. Triage shows the new field is harmless — informational. The short-term fix is to relax the schema (allow the extra field or list it as optional). The longer fix is to talk to the downstream team: the change should have been versioned, even if it was additive. The lesson is that strict additionalProperties: false on responses gives you a strong drift signal but requires fast operator response; some teams choose additionalProperties: true on responses (with field-name diff monitors) to trade detection sensitivity for fewer false alarms. Either is defensible; the choice is explicit. Wrong-answer notes: "just allow extra fields everywhere" loses the detection signal entirely.

Q3. The downstream system has shipped a semantic shift: a field called amount was previously in paise and is now in rupees. The shape is identical. What catches this? Most likely signal 2 (postconditions on invariants) or signal 6 (customer impact), depending on how invariant-checked the response is. A well-designed contract has a postcondition like "disbursed amount equals requested amount" — that catches the shift on the first call. Without that postcondition, the agent disburses 100x the intended amount, the customer notices, signal 6 fires days later. The lesson is that invariant postconditions are not optional on tools that touch money; the cost of an invariant postcondition is one extra comparison per call, and the value is catching exactly this kind of semantic drift. Wrong-answer notes: "the schema would catch it" is wrong because the type and shape are identical; the meaning changed.

Q4. Your team owns the agent platform. The downstream team owns Salesforce integration. Drift incidents keep happening because Salesforce ships changes weekly. What do you propose at the next leadership meeting? A contract registry shared between both teams; the Salesforce-integration team ships changes as versioned contract updates with dual-run windows; the agent platform team commits to running contract pacts on a schedule against Salesforce staging and reporting failures back to the downstream team within an SLA. Both teams agree to a notification protocol for breaking changes (calendar invite, not a Slack message). This is an organisational fix, not just a technical one — the technical drift detectors are already in place; the failure is in the coordination. Wrong-answer notes: "tell them to slow down their releases" is not a fix and produces friction; the fix is the protocol, not the cadence.

What to do differently after reading this¶

Wire signal 3 (UPSTREAM_UNCLASSIFIED rate per tool) today on every tool. It is the cheapest, highest-value monitor.
For every tool, draft a minimum pact suite: one happy path, one per error code, one unknown-fallthrough. Run them on a schedule against staging.
Add invariant postconditions to every tool that touches money, identity, or anything else where a semantic shift is catastrophic.
Capture raw upstream payloads in the audit log so on-call has evidence during drift investigation.
Establish a notification protocol with each downstream team. The protocol is a contract too.

Bridge. Drift detection catches changes the producer didn't tell you about. Contract testing catches changes you are about to make to your own contract — before they hit production. The next chapter builds the testing discipline: golden inputs, schema fuzz, contract-version compatibility tests, and the CI gates that hold the line. → 10-contract-testing.md