05. Distributed Tracing¶

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: In the hospital analogy, the X-ray shows the path of the problem, while the thermometer shows spread and the medical chart explains exact events.

1) The trace model: trace, span, and context¶

A trace represents one end-to-end request journey. A thermometer spike tells you when to open traces.

A span represents one timed operation inside that journey.

Spans can be nested to show causality and waiting.

Context carries the trace id and parent relationship forward.

See. Without context, you have fragments, not a trace.

Span names should describe operations, not random code symbols.

Attributes should explain route, dependency, status, and version.

Events inside spans capture meaningful milestones without starting new spans blindly.

┌──────────── trace abc123 ────────────┐ │ root span: HTTP /checkout │ │ ├─ span: auth service │ │ ├─ span: inventory check │ │ └─ span: payment gateway │ └──────────────────────────────────────┘ The X-ray is useful only when propagation stays intact. - One request usually maps to one root span.

Child spans reveal fan-out and nested waiting.
Attributes should stay bounded and searchable.
Status codes should reflect success, error, or timeout clearly.

2) Propagation is the whole game¶

Trace quality depends more on propagation than on beautiful UIs.

If headers or message metadata lose context, the story breaks.

HTTP, gRPC, queues, and batch jobs all need explicit handling.

Async boundaries are where many teams lose the chain.

So what to do?

Standardize propagation libraries and test them in integration environments. The medical chart should store trace ids for exact event lookup.

Include retries and dead-letter paths in those tests.

Simple, no? If the baton drops, the race story dies.

Propagate through synchronous calls and asynchronous messages.
Capture links when one span relates to another without direct parentage.
Preserve ids across thread pools and worker handoffs.
Validate propagation after framework or proxy upgrades.

3) Sampling decides cost and visibility trade-offs¶

Full tracing for every request can be expensive.

Sampling controls that cost.

Head-based sampling decides early, often at request start.

One monitor alarm can point responders to exemplar traces. Tail-based sampling decides later, after seeing outcomes.

Tail sampling is stronger for rare slow or failed requests.

Head sampling is simpler and cheaper in the hot path.

Now watch. Sampling policy changes what your team can learn.

Match the policy to investigation goals, not only storage budget.

Keep error traces at higher sample rates than healthy traces.
Consider route-aware policies for critical customer paths.
Watch bias when sampling misses rare but important patterns.
Document what investigators should expect to be absent.

Another monitor alarm can watch sampling or collector drops.

4) Reading a trace visualization well¶

A waterfall is not just a pretty chart.

It tells you sequencing, concurrency, and blocked time.

Start at the root span and scan for the longest gap.

Then check whether time is compute, network, queue, or dependency wait.

See. Latency often hides in waiting, not in CPU.

Compare healthy and unhealthy traces for the same route.

Look for new branches, retries, or widened child spans after deploys.

Trace comparison is one of the fastest ways to see regression shape.

A playbook should say how to inspect a slow waterfall first. - Focus on changed structure before chasing tiny duration noise.

Compare percentiles of trace durations, not only one sample.
Inspect span attributes for version and region drift.
Use exemplars if your metrics backend supports them.

5) Correlation makes traces operationally useful¶

Traces alone are powerful, but they become operational when linked.

A dashboard should open a representative slow trace quickly.

A log entry should include the trace id for request reconstruction.

An incident review should ask which traces were missing.

See. Tooling matters, but workflow matters more.

Correlation also helps with ownership.

When a span shows a slow dependency, routing gets faster.

When a trace shows retry storms, mitigation gets sharper.

Link metrics panels to exemplars or trace search templates.
Put trace ids into structured logs automatically.
Use service maps carefully; they hint topology, not root cause.
Review trace gaps after every major architecture change. The playbook should include correlation steps back to logs and metrics.

Where this lives in the wild¶

Microservice-heavy commerce stacks rely on traces for checkout fan-out paths.
API gateway teams use traces to spot proxy retries and upstream wait time.
Queue-based architectures trace message hops to reveal backlog and poison-message pain.
Platform observability teams use trace exemplars to jump from latency spikes to real requests.
Service owners compare traces before and after releases to confirm regression shape quickly.

Pause and recall¶

What breaks when context propagation fails across a queue or proxy?
Why can tail sampling be better for rare failures?
What should you inspect first in a waterfall view?
Why are traces much stronger when linked to metrics and logs?

Interview Q&A¶

Q: Why is propagation more important than fancy trace visualization? A: Because without continuous context, the UI only shows disconnected spans and cannot explain end-to-end causality. Common wrong answer to avoid: "Because UIs do not matter" - they do matter; broken context just destroys the raw story first.

Q: When would tail sampling beat head sampling? A: When rare slow or failed requests are the main target, because tail decisions can retain unusual outcomes more reliably. Common wrong answer to avoid: "Tail sampling is always better" - it is richer, but also more operationally complex and costly.

Q: How do you read a slow trace effectively? A: Start at the root, find the longest waiting segments, and compare the structure with healthy traces for the same route. Common wrong answer to avoid: "Just inspect the last span" - the problem can begin much earlier in the chain.

Q: Why put trace ids into logs? A: They let engineers pivot from exact events to the full request path, which speeds reconstruction during debugging. Common wrong answer to avoid: "Only for compliance" - the main gain is investigation speed and correlation.

Apply now (5 min)¶

Draw one request path from browser to database in your system. Mark the root span, three child spans, and one async boundary where propagation could fail. Then choose one sampling rule for healthy traffic and one for errors. Keep the design small enough to explain in two minutes.

Bridge. Traces reveal problems. But what do we DO when things break? → 06