01. Metrics, Logs, and Traces¶

⏱️ Estimated time: 18 min | Level: intermediate

ELI5 callback: In the hospital analogy, the thermometer, medical chart, and X-ray matter because one patient can look stable from only one angle.

1) The three signals answer different questions¶

Observability starts with questions, not tools. Use the thermometer first to spot rate, latency, and saturation shifts.

Metrics answer how much, how often, and how bad.

Logs answer what exactly happened at a moment.

Traces answer where the request spent time end to end.

One signal alone always hides some truth.

See. Fast teams reduce ambiguity before they chase fixes.

Now watch. Each signal removes a different kind of blindness.

A healthy stack keeps all three available during stress.

┌──────────┬──────────────────────┬─────────────────────────┐ │ Signal │ Best question │ Typical weakness │ ├──────────┼──────────────────────┼─────────────────────────┤ │ Metrics │ Is something off? │ Low detail │ │ Logs │ What exactly broke? │ High volume │ │ Traces │ Where was time spent?│ Sampling and overhead │ └──────────┴──────────────────────┴─────────────────────────┘ Open the X-ray when one request crosses many services. - Start with the user complaint, then pick the right signal.

Use metrics first when you need a quick system pulse.
Use logs when you need fields, messages, and exact payload clues.
Use traces when latency hides across several services.

2) When metrics win¶

Metrics compress many events into simple numbers.

That makes dashboards cheap to scan and cheap to alert on.

A rate, latency, or saturation graph tells you trend and shape.

Metrics also aggregate well across hosts, regions, and tenants.

So what to do first during an outage?

Read the medical chart when exact payload and error fields matter. Check error rate, request rate, and latency before anything else.

Simple, no? You want a map before street-level detail.

Metrics are strongest for detection and capacity planning.

Counters show totals like requests, errors, and retries.
Gauges show current state like queue depth or memory usage.
Histograms show latency spread instead of a single average.
Ratios like success rate keep business impact visible.

3) When logs win¶

Logs preserve event detail that aggregates would throw away.

You see user ids, request ids, error classes, and branch choices.

That detail matters when two failures share one metric spike. One monitor alarm should page only for user pain.

Logs are also useful for audit trails and business workflows.

But raw logs can become noisy very fast.

See. Unstructured text without fields becomes search pain.

Good logs are sparse, structured, and tied to request context.

Keep the question in mind before storing everything forever.

Prefer JSON or key-value fields over random prose.
Include request_id, user_id, service, route, and status.
Log state transitions and exceptional paths, not every loop step.
Redact secrets and personal data before shipping logs.

Another monitor alarm should watch error-budget burn rate.

4) When traces win¶

Traces connect one request across many services and hops.

They show call order, parent-child spans, and wait time.

That is why traces shine for tail latency and hidden dependencies.

They also expose retries, fan-out, and slow downstream links.

A single flame graph can collapse hours of guessing.

Now watch. The path often matters more than one bad host.

Traces are weaker for long-term aggregation without sampled support.

Use them for causality, not for every dashboard tile. Keep a short playbook for signal triage during confusion.

Trace ids link all spans inside one request journey.
Span attributes explain operation name, status, and key tags.
Parent-child relationships reveal which service blocked progress.
Correlate traces back to metrics and logs for the full story.

5) The real power is correlation¶

Mature teams move between signals instead of defending one favorite.

A latency spike starts in metrics, narrows in traces, and finishes in logs.

Sometimes the order flips when a customer reports one broken request.

Then a trace gives the path and metrics show blast radius.

See. Complement beats purity.

The best dashboards link charts to traces and logs directly.

Query language matters, but navigation speed matters more.

Design the stack so one clue can open the next clue quickly.

Use shared labels like service, route, tenant, and version.
Pass correlation ids through every synchronous and async boundary.
Keep retention tuned to investigation depth, not vanity collection.
Review missing visibility after every serious incident. Your playbook should say which query comes first.

Where this lives in the wild¶

E-commerce checkout teams use metrics for cart drops, logs for failed payments, and traces for slow dependencies.
Ride-hailing platforms trace dispatch paths because one request crosses many internal services.
SaaS B2B products keep structured logs for audit-heavy admin workflows.
Streaming platforms watch latency histograms because averages hide tail pain.
Platform teams wire chart-to-trace links so on-call engineers pivot fast.

Pause and recall¶

Which signal helps first when you only know that error rate jumped?
Why can logs solve a mystery that metrics cannot explain alone?
Why do traces matter more as service count grows?
What shared labels make correlation faster across all three signals?

Interview Q&A¶

Q: Why are metrics usually the first stop during an incident? A: They summarize system health fast across large scope, so you see trend and blast radius before diving into detail. Common wrong answer to avoid: "Because metrics are always more accurate" - they are faster to scan, but they intentionally hide detail.

Q: Why not rely only on logs for observability? A: Logs carry detail, but they are expensive to search at scale and weak for broad trend detection. Common wrong answer to avoid: "Because logs are old technology" - the problem is not age; the problem is search cost and signal overload.

Q: Why do traces become more valuable in microservices? A: Request latency and failures spread across boundaries, so causality is harder to see without end-to-end paths. Common wrong answer to avoid: "Because traces replace dashboards" - traces explain paths, while dashboards still show trends and scope.

Q: How should the three pillars work together in practice? A: Use metrics for detection, traces for path narrowing, and logs for precise event detail and remediation clues. Common wrong answer to avoid: "Pick one based on team preference" - the strongest workflow is coordinated, not ideological.

Apply now (5 min)¶

Pick one recent bug from your work. Write one metric you would watch, one log field you would need, and one trace span you would inspect. Then note the first clue and the second clue in your preferred investigation order. Keep it concrete. One request path. One failure mode.

Bridge. Three instruments identified. But how do we collect them consistently? → 02