Skip to content

11. Observability per provider

Privacy is what the gateway enforces about where calls happen. Observability is what the gateway records about how they happen. Per-provider dashboards, per-tenant cost rollups, per-alias latency baselines, drift alarms — the alarm panel the on-call reads first when something is off.


A platform engineer at a Mumbai content company is paged because an interactive feature is slow. The team's first instinct is to blame the agent code. The on-call instead opens the gateway dashboard. Per-provider latency for Anthropic in ap-south-1 has shifted: p95 went from 700 ms to 1800 ms in the last forty minutes. The same model in us-east-1 (the fallback region) is at baseline 1100 ms. The dashboard tells the story before any code is read: the primary region is degraded; the fallback chain is doing its job; users are seeing slightly higher latency but no error. The on-call posts the diagnosis in the incident channel and starts the playbook (switch primary weight to the alternate region until the primary recovers). Total triage time: about three minutes, from page to root cause.

This chapter is what makes that triage possible. The dashboards, the metrics, the audit, the alarm panel — designed so the gateway's view is the first place to look, not the last.


What the gateway uniquely sees

Other systems see slices. The gateway sees every call, every response, every fallback step, every cache decision, every cost, every credential use, every privacy decision. That is the value of the boundary; the observability layer is what makes the value retrievable.

Six categories of signal the gateway owns, broken down below.

Category Examples
Per-provider health latency p50/p95/p99, error rate by code, throttle rate
Per-alias quality eval scores in production traffic, refusal rate, output-length distribution
Per-tenant usage calls/sec, tokens/sec, cost/sec, budget consumption
Routing and fallback fallback step frequency, cache hit rate, refusal rate
Quota and credential bucket fill levels, credential issuance, scope denials
Compliance per-region call counts per tenant, zone violations (should be zero)

A dashboard organised by these categories is the on-call's first stop.


The per-provider dashboard

For each provider, per region, per concrete model:

+----------------------------------------------------+
|  anthropic / claude-sonnet-4-6 / ap-south-1        |
+----------------------------------------------------+
|  Calls/min (live):         12,478                  |
|  p50 latency:              462 ms                  |
|  p95 latency:              781 ms      [BASELINE]  |
|  p99 latency:             1,124 ms                 |
|  Error rate:               0.23%        [BASELINE] |
|  Top error codes:          UPSTREAM_TIMEOUT (38%)  |
|                            RATE_LIMITED (22%)      |
|                            INVALID_REQUEST (15%)   |
|  Throttle rate (429):      0.04%                   |
|  Quota headroom:           71%                     |
+----------------------------------------------------+

Each pane has a baseline (recent moving average) and an alarm threshold. A latency or error-rate departure from baseline is an automatic alarm.

The dashboard is per-provider per-region — not aggregated globally — because the dimensions where things go wrong are usually the same dimensions where the alarms must fire. Aggregation hides region-down events.


The per-alias dashboard

For each alias, across all candidates:

+----------------------------------------------------+
|  smart-reasoner (all candidates)                   |
+----------------------------------------------------+
|  Calls/min (live):         8,912                   |
|  Primary candidate share:  91%                     |
|  Fallback step 1 share:    7%                      |
|  Fallback step 2 share:    1.5%                    |
|  Cache hit rate:           34%   (exact-match)     |
|  Refusal rate:             0.02%                   |
|  Cost / 1k calls:          $4.21                   |
|  Eval score (rolling 1h):  0.84  [baseline 0.86]   |
+----------------------------------------------------+

This view is for the team that owns the alias's product. A drift in the eval score (production-traffic eval, not just synthetic) is a leading indicator of model behaviour shift. A drift in cost per 1k calls indicates either input distribution change or a price-book change.


The per-tenant dashboard

For each tenant, near-real-time:

+----------------------------------------------------+
|  acme-corp (Tenant)                                |
+----------------------------------------------------+
|  Calls (last 24h):         482,113                 |
|  Spend (last 24h):         $1,847                  |
|  Budget consumed:          61% of monthly          |
|  Top features:             chat (62%), summary (28%)|
|  Privacy zone:             in-region-only          |
|  Zone violations:          0                       |
|  Quota saturation events:  3 (last 24h)            |
+----------------------------------------------------+

Tenant dashboards serve product owners, finance, and support. The "zone violations" pane is the compliance integrity check — it should always read zero.


The drift panel

A focused dashboard for slow-shifting signals that catch silent provider changes (chapter 09):

+----------------------------------------------------+
|  Drift signals — last 7 days                       |
+----------------------------------------------------+
|  Eval score (smart-reasoner):    0.86 -> 0.83 (-3) |
|  Output tokens / call (chat):     312 -> 358 (+15%) |
|  Refusal rate (summariser):       0.01% -> 0.4% (+) |
|  UPSTREAM_UNCLASSIFIED rate:      stable           |
|  Postcondition violation rate:    stable           |
|  Latency p95 (sonnet ap-south):   712 -> 740 (+4%) |
+----------------------------------------------------+

Each row is a comparison: now vs the prior 7-day average. Rows above tolerance are flagged. The on-call investigates the top flagged row first.


What metrics matter, and where they come from

A compact list of the metrics every gateway emits:

Metric Source Used for
gateway.calls.total Per-call audit Live throughput
gateway.calls.errors{code} Audit Error rate by structured code
gateway.latency.ms{provider,model,region} Audit Per-provider latency baselines
gateway.tokens.input{...} Audit Cost computation, drift signal
gateway.tokens.output{...} Audit Cost computation, drift signal
gateway.cost.usd{tenant,feature,alias} Audit + price book Cost dashboards, budget enforcement
gateway.cache.hits{kind} Audit Cache hit-rate dashboards
gateway.fallback.step{from,to} Audit Fallback frequency dashboards
gateway.bucket.fill{tenant,provider} Quota plane Quota saturation alarms
gateway.credentials.issued{} Credential plane Credential usage audit
gateway.zone.violations{} Routing audit Compliance integrity (should be zero)
gateway.itself.up{instance} Health check Gateway-itself SLOs

Every metric is dimensioned by the obvious slicing fields (provider, region, alias, tenant, feature). Card-explosion is real; some platforms keep per-tenant dimensions only on cost metrics to limit cardinality.


The alarm panel

The on-call's home view. A short list of high-signal alarms.

Alarm Threshold Action
Per-provider latency p95 > 2× baseline for 5 min dynamic Investigate; consider routing weight shift
Per-provider error rate > 2× baseline for 5 min dynamic Investigate; consider fallback to alternate
429 rate per provider > 0.1% fixed Tighten internal cap; raise with provider
UPSTREAM_UNCLASSIFIED rate > 0 sustained fixed Extend translator; investigate provider change
Postcondition violations > 0 sustained fixed Investigate shape drift
Bucket saturation > 90% for a tenant dynamic Notify tenant; check policy
Cost spend rate > 1.5× baseline for an hour dynamic Investigate runaway; check for regression
Zone violation count > 0 fixed Page immediately; security incident
Gateway itself unhealthy fixed Page; tier-zero outage
Audit emission rate < expected fixed Investigate; missing audit is a blind spot

Two of these are page-now alarms (zone violation, gateway-itself down). The rest are investigate-now alarms that the on-call addresses within their SLA.


Audit log as the substrate

Every dashboard, every metric, every alarm in this chapter reads from the audit log (templates from module 19 chapter 11; this module's calls have a slightly different shape, captured in chapter 02).

A few model-specific fields on top of module 19's template:

  • model_alias, model_used.{provider, model_version, region}
  • usage.{input_tokens, output_tokens, cache_read_tokens, cache_write_tokens}
  • cost_usd, price_book_version
  • cache_status, fallback_step, degraded, primary_failure_reason
  • route_resolution — the candidates considered, the one selected, the reason

The audit is queryable in close-to-real-time for live dashboards and persisted long enough for compliance (chapter 06 of module 19 covers retention; same principles apply, with longer retention for cost and compliance).


Cost vs cardinality

Dimensions like tenant_id, feature_id, caller_identity are valuable but expensive. A platform with thousands of tenants × hundreds of features × dozens of aliases produces many time series. A reasonable compromise:

  • Keep full cardinality for cost metrics (finance needs the breakdown)
  • Reduce cardinality for latency and error metrics (group rare tenants into a bucket)
  • Maintain audit log records at full cardinality for queryable on-demand reports

This trade is platform-specific. Plan it explicitly rather than letting it grow.


How observability interacts with everything else

  • Routing (chapter 03) — the routing scorer reads recent latency and error metrics live.
  • Fallback (chapter 04) — fallback step frequency is a key health signal.
  • Quota (chapter 05) — bucket fills are dashboard rows.
  • Credential (chapter 06) — credential usage is audited.
  • Cost (chapter 07) — cost dashboards are observability.
  • Cache (chapter 08) — hit rates are core signals.
  • Drift (chapter 09) — the drift panel reads from observability.
  • Privacy (chapter 10) — zone violations are alarms.

How to recognise broken observability in the wild

  • Incidents start with "let me check the logs" instead of "let me check the dashboard"
  • Latency baselines are absent; "what is normal" is a judgement call
  • Per-provider dashboards do not exist; provider behaviour is invisible
  • Cost is a monthly report, not a live dashboard
  • The alarm panel has more than 20 alarms (most ignored) or fewer than 5 (most signals missed)
  • Audit log volume is unknown; missing emissions are not detected

Interview Q&A

Q1. Why dashboard per-provider per-region instead of one aggregated latency view? Because most operational events are scoped to a specific provider in a specific region. Aggregation across regions hides a region-down event; aggregation across providers hides a provider-specific outage. The on-call needs to see the dimension where things are going wrong, and those dimensions are providers and regions. Aggregated views are useful for executive summaries but not for triage. Wrong-answer notes: "for completeness" is vague; the specific value is the dimension that matches incidents.

Q2. The chapter-opening incident: how does the on-call diagnose in three minutes? The latency dashboard for anthropic:claude-sonnet-4-6:ap-south-1 shows p95 jumped. Cross-reference: the fallback step share for smart-reasoner shows fallback step 1 (us-east-1) is at 25%, up from 7%. The picture is clear: the primary region is degraded, fallback is doing its job, user latency is up but no error. The on-call posts the diagnosis and applies the playbook (shift weights or wait for recovery). The audit log has the per-call evidence if a postmortem needs it. Wrong-answer notes: "we'd grep the logs" is what takes thirty minutes instead of three.

Q3. Eval scores in production traffic are part of the drift panel. How are they computed? A sample of production calls (often 1–5%) is paired with the eval suite. The pair runs through an LLM-as-judge or a rule-based grader and produces a score. The score is aggregated rolling hourly and compared against the prior week's baseline. Module 04_ai_product_evals owns the eval definition and the judge calibration; the gateway emits the production calls and consumes the scores into the dashboard. A drop beyond tolerance is the leading drift signal. Wrong-answer notes: "we look at user feedback" lags by days; eval-on-production-traffic catches drift within an hour.

Q4. What is the difference between an alarm and a page? An alarm fires when a signal departs from baseline and warrants attention; the on-call addresses it within an SLA (e.g., 30 minutes) but it does not necessarily wake them up. A page is a now-event — the on-call is interrupted, regardless of time. Pages should be rare: gateway-itself down, zone violation, total spending anomaly, audit emission stopped. Most alarms are not pages. A platform with too many pages produces alarm fatigue and starts ignoring real ones; a platform with too few pages misses real incidents. Wrong-answer notes: "all alarms are pages" produces fatigue; "no pages, just dashboards" misses time-sensitive incidents.


What to do differently after reading this

  • Build the per-provider, per-region dashboard first. It is the highest-leverage view.
  • Add the per-tenant cost and quota dashboards.
  • Maintain the drift panel; review weekly.
  • Define each alarm explicitly: what threshold, what action, who is paged or alerted, what SLA. Reject "watch dashboards" as an alarm.
  • Maintain audit completeness as a first-class metric.
  • Test the dashboards quarterly with a fire drill: simulate a provider outage and verify the on-call diagnoses it from dashboards alone.

Bridge. Eleven chapters have built the gateway surface by surface. The last two synthesise. The next chapter is the architect's checklist — twenty items that distinguish a gateway you can defend from a gateway you cannot. → 12-architect-checklist.md