12. Service discovery — finding moving services without hardcoding moving addresses¶

~14 min read. In cloud systems, addresses change faster than people remember them.

Built on the ELI5 in 00-eli5.md. The town directory — the list of who is open and where — becomes service discovery.

1) Why discovery exists in the first place¶

See, containers restart, pods reschedule, and VMs come and go. If clients keep fixed IP lists in config files, those lists rot quickly.

A registry becomes the live town directory for services. Instances register themselves, refresh health, and disappear when they fail or stop renewing.

Registry answers: what service name exists, which instances are healthy, and what address each instance currently owns.
Discovery sits between naming and routing. A name like payment-service becomes an actual host, port, or endpoint.
Freshness matters. A stale registry can send traffic into a dead pod faster than any human can notice.

service name: payment-service

registry
  |- 10.0.1.14:8080 healthy
  |- 10.0.2.07:8080 healthy
  `- 10.0.3.91:8080 draining

Worked example. A checkout API needs payment-service. It asks the registry, not a spreadsheet. The answer changes as pods roll, fail, and heal.

So discovery is not glamour. It is the boring map that keeps requests from getting lost.

2) Client-side discovery gives clients the endpoint list¶

In client-side discovery, the caller fetches healthy instances and decides where to send traffic. The client library usually handles caching, retries, and balancing.

checkout client
    |
    | ask registry
    v
  [A, B, C]
    |
    | round robin / least loaded
    v
 payment instance B

This pattern gives the client more control. It can prefer same-zone endpoints, avoid overloaded nodes, or attach richer policies before making the call.

Good fit when the platform already ships a smart client library, like a strong internal SDK.
Harder fit when many languages exist, because every language now needs correct balancing and health behavior.
Caching helps latency, but stale caches need TTLs and refresh-on-error logic.

Worked example. An orders service receives three healthy inventory instances from Consul. It picks the closest zone first, then falls back cross-zone only on timeout.

Simple, no? Client-side discovery keeps power near the caller, but also keeps complexity there.

3) Server-side discovery hides the list behind a proxy or balancer¶

In server-side discovery, clients call one stable front door. A load balancer, sidecar, or gateway asks the registry and chooses the backend instance.

checkout client
    |
    v
 stable VIP / gateway / sidecar
    |
    | asks registry
    v
 payment instance B

This pattern keeps clients simpler. The routing policy lives in fewer places, which means one team can update balancing logic without touching every application.

Great for polyglot fleets, because Python, Java, Go, and Node clients can stay dumb.
Useful when Envoy, NGINX, or a managed load balancer already exists in the platform path.
Watch the extra hop. Central routing can add latency and become a concentrated failure point.

Worked example. A mobile API calls an internal gateway. The gateway resolves profile-service through etcd-backed xDS data and sends traffic only to passing endpoints.

So the town directory is still there. The caller simply does not open it directly.

4) Consul, etcd, DNS, health checks, and load balancing together¶

Now combine the pieces. A service registry is useful only when registrations stay fresh and traffic selection respects health.

Consul offers service registration, health checks, and often meshes nicely with sidecars and gateways.
etcd is often the strongly consistent source underneath a control plane that publishes discovery data elsewhere.
DNS-based discovery is simpler. A service name resolves to changing records, but TTLs and caching limit freshness.
Health checks decide whether an instance should remain discoverable. Registration without health is a lie written neatly.
Load balancers consume discovery data and apply routing methods like round robin, least requests, or locality preference.

instance starts
   -> register in registry
   -> pass readiness check
   -> appear in balancer pool

instance unhealthy
   -> fail health check
   -> removed from pool
   -> optionally deregister

Worked example. A pod becomes Ready after warming caches. Only then should Kubernetes Service endpoints or an external balancer send live traffic there.

DNS can act like a lightweight town directory, but remember TTL lag. A short TTL improves freshness while increasing query load.

The lesson is practical. Discovery, health, and balancing are one pipeline. Break one stage, and requests will still suffer.

Where this lives in the wild¶

Netflix platform engineer — uses discovery and routing libraries so service names resolve to healthy instances across many rapidly changing deployments.
Google Kubernetes platform engineer — relies on Services, EndpointSlices, and cluster DNS so pods find each other despite constant rescheduling.
HashiCorp operator at a fintech — uses Consul registration and health checks to keep internal APIs discoverable and safely routable.
Amazon load balancing engineer — connects health signals and target groups so traffic shifts away from draining or failing instances quickly.
Uber infrastructure engineer — combines discovery data with locality-aware load balancing to reduce cross-zone latency and failure blast radius.

Pause and recall¶

Why do static IP lists fail quickly in container and autoscaling environments?
What extra responsibility does client-side discovery place on application libraries?
Why can server-side discovery simplify a polyglot fleet at the cost of another hop?
How do health checks, registry freshness, and load balancing depend on each other?

Interview Q&A¶

Q: Why is service discovery needed even when every service already has a name? A: Because names must resolve to current healthy addresses. The name alone does not tell you which instances are alive right now.

Common wrong answer to avoid: "DNS or config files already make discovery a solved non-problem."

Q: Why pick client-side discovery in some systems? A: Because smart clients can use richer policies like zone affinity, adaptive retries, or custom balancing based on application context.

Common wrong answer to avoid: "Client-side discovery is always simpler because there is no proxy."

Q: Why pick server-side discovery in many platforms? A: Because routing behavior can be centralized in gateways or sidecars, keeping application code smaller across many languages.

Common wrong answer to avoid: "Server-side discovery removes all latency and all operational risk."

Q: Why are health checks inseparable from service discovery? A: Because a registry that keeps unhealthy instances advertised is worse than no registry. It actively routes users to failure.

Common wrong answer to avoid: "Registration time is enough; runtime health does not matter much."

Apply now (5 min)¶

Pick one service call in your system, then redesign its discovery path. Write the service name, the registry source, the health signal, the caching rule, and whether routing should happen in the client or in a proxy.

Sketch from memory:

the registry table mapping one service name to three changing healthy endpoints,
the client-side discovery flow where the caller chooses endpoint B,
and the server-side discovery flow where a stable front door chooses the backend.

Bridge. Finding a service is not enough; the next job is surviving when that service misbehaves. → 13-circuit-breaker-bulkhead.md