02. OpenTelemetry Instrumentation¶

⏱️ Estimated time: 20 min | Level: intermediate

ELI5 callback: In the hospital analogy, the thermometer, medical chart, and X-ray are only useful when every ward records them with the same labels.

1) Why OpenTelemetry matters¶

OpenTelemetry gives one vocabulary for signals across languages. The thermometer becomes trustworthy only after consistent instrumentation names.

That solves a very boring but very painful problem.

Different teams otherwise emit different names, units, and tags.

Then your dashboards fight the codebase instead of helping it.

See. Standardization is reliability work, not paperwork.

OTel does not replace backends like Grafana or Datadog.

It standardizes instrumentation, context, and export paths.

That keeps vendor choice from leaking into every service.

The X-ray breaks when context propagation headers are dropped. ┌──────────┐ SDK ┌─────────────┐ export ┌──────────────┐ │ Service │ ─────→ │ OTel SDK │ ─────────→ │ Collector │ └──────────┘ ├─────────────┤ ├──────────────┤ │ traces │ │ process │ │ metrics │ │ batch │ │ logs │ │ route │ └─────────────┘ └──────┬───────┘ ▼

                                             observability backend

Common semantic conventions keep names and tags comparable.
Resource attributes identify service, version, region, and host.
Context propagation keeps one request connected across boundaries.
Exporters send data onward without hardwiring your app to one vendor.

2) SDK pieces you should recognize¶

The SDK has providers, instruments, processors, and exporters.

A tracer provider creates tracers and span processors.

A meter provider creates counters, histograms, and gauges.

Log support is improving and often routed through structured logging pipelines.

The medical chart should carry the same trace and request identifiers. The important habit is consistent initialization at process start.

Now watch. Late initialization loses startup failures and warmup clues.

Set resource attributes once and review them like API fields.

Small naming discipline saves huge debugging time later.

Counters are for monotonic totals like requests or retries.
Histograms are for latency and size distributions.
Spans wrap operations with start time, end time, and attributes.
Processors decide batching and enrichment before export.

3) Auto-instrumentation versus manual instrumentation¶

Auto-instrumentation is the fast way to cover common frameworks.

One monitor alarm can watch collector queue growth. It captures HTTP servers, clients, database calls, and queues quickly.

That gives immediate baseline traces and some metrics.

But business meaning rarely comes for free.

So what to do?

Start with auto-instrumentation, then add manual spans around critical workflows.

Instrument checkout, signup, payment, and recommendation decisions explicitly.

Simple, no? Generic hooks first. Domain meaning next.

Auto coverage is great for libraries you do not want to patch. Another monitor alarm can watch exporter failures before data disappears.
Manual spans add operation names your team actually understands.
Manual attributes should capture version, tenant, and important branch choices.
Review cardinality before adding free-form values like raw user ids.

4) Exporters and the collector pipeline¶

Exporters move telemetry out of the process.

The collector is where you batch, transform, sample, and route.

That central layer reduces duplicated config in every service.

It also keeps credentials and backend logic away from app code.

See. The collector is an operations control point.

You can fan data to one or many backends.

Write a playbook for collector outages and fallback routes. You can drop noisy attributes before storage cost explodes.

You can enrich records with environment labels consistently.

Receivers accept OTLP and sometimes Prometheus scrape input.
Processors batch, sample, limit memory, and redact fields.
Exporters forward data to tracing, metrics, and logging systems.
Multiple collectors can run as sidecars, agents, or gateways.

5) Rollout, governance, and sharp edges¶

Instrumentation is code, so treat it like product code.

Review naming, units, and tags during pull requests.

Add load tests because telemetry can add CPU and memory cost.

Missing propagation breaks traces more often than bad exporters.

High-cardinality tags can also hurt storage and query speed.

See. Better instrumentation is not equal to more instrumentation.

Keep a small golden set of metrics and spans per critical path.

Then expand only when a real investigation gap appears.

Publish a naming guide for services, routes, and status tags.
Track collector health as carefully as application health.
Add version labels so rollouts can be compared clearly.
Test telemetry in local, staging, and production-like traffic. The playbook should tell teams where to add manual spans.

Where this lives in the wild¶

Polyglot platform teams use OTel so Go, Java, and Python services look consistent.
Kubernetes platform engineers deploy collectors as daemonsets or gateway deployments.
Fintech apps add manual spans around payment authorization and fraud decisions.
Data teams export traces and metrics to different backends without rewriting app code.
Internal developer platforms ship OTel defaults so product teams inherit sane telemetry quickly.

Pause and recall¶

What problem does OTel solve before any backend query even runs?
Why is auto-instrumentation a starting point, not the full answer?
What jobs belong in the collector instead of in application code?
Why can high-cardinality attributes become dangerous?

Interview Q&A¶

Q: Why introduce OpenTelemetry instead of using each vendor SDK directly? A: It keeps instrumentation portable and consistent, so backend choice changes less application code and fewer naming rules. Common wrong answer to avoid: "Because vendors are all the same" - they are not; OTel reduces lock-in and drift.

Q: When is auto-instrumentation enough, and when is manual work needed? A: Auto hooks cover frameworks quickly, but manual spans are needed for business workflows, decisions, and domain-specific attributes. Common wrong answer to avoid: "Auto-instrumentation covers everything important" - it rarely captures your business boundaries well.

Q: Why use a collector instead of exporting straight from every service? A: The collector centralizes batching, routing, enrichment, and credential handling, which keeps apps simpler and operations safer. Common wrong answer to avoid: "Because apps cannot export directly" - they can; central control is the real value.

Q: What is the most common trace quality failure in rollout? A: Broken context propagation across service or queue boundaries, because the request chain then fragments into unrelated spans. Common wrong answer to avoid: "Missing dashboards" - dashboards matter later; propagation breaks the raw data itself.

Apply now (5 min)¶

Take one service in your stack. List the startup resource attributes, one auto-instrumented dependency, and two manual spans you would add tomorrow. Then draw where the collector would batch, sample, and export that data. Keep the list small and realistic.

Bridge. Data collected. But what should we actually measure and alert on? → 03