03. API Design at Boundaries — drawing the right roads between zones¶

~16 min read. Great components still fail if the roads between them are confusing or dangerous.

Built on the ELI5 in 00-eli5.md. The road — the communication path between zones — now becomes the contract that keeps the city moving safely.

1) An API is not just a URL. It is the road shape.¶

Once you split a system into components, empty arrows are not enough.

A box-and-arrow diagram says two zones talk. It does not say how safely, how often, or with what guarantees. That missing detail is the API boundary.

At HLD level, an API answer must cover these questions: - who calls whom? - sync or async? - request-response or stream? - human-facing or service-to-service? - strict consistency or eventual consistency? - what happens on retries?

See. This is why the road metaphor works well. A narrow lane is fine for occasional admin traffic. A wide highway is better for internal low-latency calls. A one-way freight route fits event streams.

Different roads serve different movement. Different APIs serve different communication patterns.

A poor boundary looks like this: ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ Client │──→│ Order Service │──→│ Payment Svc │ └──────────┘ └──────────────┘ └──────────────┘

The arrows exist, but nothing is said about payloads, timeouts, retries, or duplicate requests.

A better HLD note says: - client to gateway: HTTPS REST API - order to payment: internal gRPC call with 300 ms timeout - order created event to analytics: async event stream - notification trigger: queue-backed async worker

Now the road is real, not decorative.

2) Why REST, why gRPC, why GraphQL?¶

Now what is the problem? People choose protocol by trend.

Do not do that. Start from access pattern.

REST¶

REST is a good default for external APIs. It is simple, debuggable, cache-friendly, and widely understood.

Use it when: - browsers, mobile apps, or partners are primary clients - resource-oriented operations are clear - tooling and observability simplicity matter - partial extra latency is acceptable

gRPC¶

gRPC is strong for internal service-to-service calls. It gives typed contracts, code generation, HTTP/2 multiplexing, and lower serialization overhead.

Use it when: - services call each other at high volume - low latency matters - you control both client and server stacks - strict schemas help fast evolution

GraphQL¶

GraphQL helps when clients need flexible reads from many entities. It shines when frontend teams want one fetch shaped to one screen.

Use it when: - data comes from multiple backing services - clients need variable field selection - overfetching or underfetching hurts often - read aggregation is more important than raw simplicity

But see the tradeoff. GraphQL can hide expensive fan-out. REST can create too many round trips. gRPC can be awkward for third-party consumers.

So what to do? Match the road to the traffic.

3) Contracts, versioning, and idempotency¶

An API boundary is healthy only when the contract is boringly clear.

That means: - fields are named predictably - required vs optional is explicit - errors are categorized - timeouts are defined - retry safety is defined - version evolution is planned

Let us take one order-create API. Client sends: - user_id - restaurant_id - item_ids - payment_method_id - idempotency_key

Server returns: - order_id - state = PENDING_PAYMENT - created_at - estimated_total

Why idempotency key? Because networks lie. The client may time out after payment reached the server. If the user taps again, the second call must not create a second order.

Worked example with numbers: Assume 8,000 checkout attempts per minute. Assume 2% of client calls time out and retry. Retries per minute = 8,000 × 0.02 = 160 retries.

Without idempotency, worst-case duplicate orders per minute = 160. Per hour = 160 × 60 = 9,600 duplicate orders. If average order value is ₹450, duplicate money at risk per hour = 9,600 × 450 = ₹4,320,000.

Now add idempotency. Client sends key checkout:user123:cart456:v1. Server stores the first successful result against that key for 24 hours. Retry with same key returns same order_id. Duplicate charge risk drops from 160 per minute to near zero, except implementation bugs.

Simple, no?

Versioning next. Suppose v1 returns one delivery_address string. Suppose v2 wants structured fields: line1, city, pincode.

Bad move: - silently change v1 response shape

Better moves: - add backward-compatible optional fields - expose /v2/orders for breaking changes - keep schema evolution rules documented per consumer type

The road must stay driveable while you widen it.

4) Design for latency and failure at the boundary¶

At HLD level, every sync call adds waiting. If service A calls B, and B calls C, user latency stacks.

Let us do the math. User request budget = 500 ms. Gateway overhead = 40 ms. Order service work = 60 ms. Payment service network + processing = 180 ms. Inventory service check = 90 ms. Notification enqueue = 20 ms. Observability and auth overhead = 30 ms.

Total = 40 + 60 + 180 + 90 + 20 + 30 = 420 ms.

Looks okay, yes? Now add p95 spike multipliers: - payment becomes 280 ms - inventory becomes 140 ms - gateway becomes 60 ms

New total = 60 + 60 + 280 + 140 + 20 + 30 = 590 ms.

Now you broke the SLO. So what to do? Move non-critical work off the sync road. Notification should be async. Analytics should be async. Some inventory updates may be reservation-based and short.

A useful boundary diagram looks like this: ┌──────────┐ ┌─────────────┐ ┌──────────────┐ │ Client │──→│ API Gateway │──→│ Order Service │ └──────────┘ └─────────────┘ └──────┬───────┘ │ ┌──────────────────┼──────────────────┐ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ │Payment gRPC│ │Inventory RPC│ │Order Event │ └────────────┘ └────────────┘ └─────┬──────┘ ▼ ┌────────────┐ │Notif Worker │ └────────────┘

This tells you three important things fast. One, the user waits for payment and inventory. Two, the user does not wait for notifications. Three, one road is sync and another is event-driven.

Now add boundary hygiene: - set deadlines, not infinite waits - return stable error codes, not random strings - keep APIs coarse enough to avoid chatty traffic - keep ownership clear so one team can evolve one contract

One more trap. Do not let the frontend orchestrate ten backend calls if the screen needs one combined answer. That creates a traffic jam across too many roads. A gateway or backend-for-frontend may aggregate safely.

Where this lives in the wild¶

Stripe Payments API — idempotency keys let merchants safely retry payment creation without double-charging customers after network timeouts.
Google Ads — bidding, budget, and serving services use typed internal RPC contracts because millisecond-level service calls need strict schemas and low overhead.
GitHub GraphQL API — clients fetch exactly the repository, issue, and review fields a screen needs without many REST round trips.
Slack — Web API endpoints serve external integrations, while internal event pipelines and realtime messaging paths use different communication shapes for different traffic.
Swiggy — checkout, payment, restaurant availability, and notification paths mix sync APIs with async events so users do not wait for side effects.

Pause and recall¶

Why is an API boundary more than just a URL name on a diagram?
When is gRPC a better road than REST?
In the idempotency example, where did the ₹4,320,000 risk number come from?
Which work should usually leave the sync request path first: notifications or payment confirmation?

Interview Q&A¶

Q: Why REST not gRPC for a public partner API? A: Because public consumers value simplicity, broad tool support, and easy debugging. REST over HTTPS is easier across mixed stacks, proxies, and human-operated integrations. gRPC is excellent internally, but not always friendly for every external consumer.

Common wrong answer to avoid: "REST is old, so modern systems should expose gRPC everywhere."

Q: Why GraphQL not many REST calls for one complex screen? A: Because one screen may need nested data from many entities with varying fields. GraphQL can reduce overfetching and round trips when the read pattern is flexible. But you still need cost controls so one query does not become hidden fan-out chaos.

Common wrong answer to avoid: "GraphQL is automatically faster than REST in every case."

Q: Why idempotency keys, not just client-side button disabling? A: Because duplicate submission can happen after the request leaves the device. Retries from mobile SDKs, proxies, or gateways can still replay a mutating call. Server-side idempotency protects the money path even when clients misbehave.

Common wrong answer to avoid: "Disable the button and duplicates disappear."

Q: Why X not Y: why async notifications, not synchronous notification calls inside checkout? A: Because checkout success matters to the user immediately, while notifications are side effects. Keeping notifications sync burns latency budget and extends failure blast radius. Async delivery keeps the critical road short and more reliable.

Common wrong answer to avoid: "Make everything synchronous so the system stays easier to reason about."

Apply now (5 min)¶

Take one boundary from a product you know: create order, send chat message, or upload file.

Do this quickly: 1. Choose REST, gRPC, GraphQL, or event stream. 2. Write one request example with 4-6 fields. 3. Write one response example with status and identifiers. 4. Add one timeout number. 5. Decide whether retries are safe, unsafe, or idempotent.

Sketch from memory: Draw one sync road and one async road leaving the same service. Label which path the user waits for and which path can happen later. If that is clear, your boundary thinking is improving.

Bridge. The roads are now named and disciplined, but data still needs homes. Next we choose the right warehouses so each component knows where state should live. → 04-data-modeling-and-storage.md