04. Dashboards and Queries¶
⏱️ Estimated time: 21 min | Level: intermediate
ELI5 callback: In the hospital analogy, the thermometer trends, the monitor alarm thresholds, and the medical chart details should appear in one useful view.
1) A dashboard is a decision surface¶
Good dashboards answer a live question for a specific audience. Start every dashboard with a thermometer that reflects user pain.
They are not wallpaper for big screens.
Start with who is looking and what decision they must make.
On-call engineers need fast anomaly detection and drill-down links.
Product leaders need service health and customer impact summaries.
See. Same data. Different decision horizon.
Every panel should earn its place.
If a panel never changes action, remove it.
┌──────────────────────── Dashboard flow ────────────────────────┐ │ overview panel → suspect panel → focused query → linked trace │ │ │ │ │ │ │ scope first narrow cause confirm detail │ └───────────────────────────────────────────────────────────────┘ Jump to an X-ray when one panel shows latency without cause. - Put the golden signals at the top where eyes land first.
-
Keep time range obvious because context changes every chart.
-
Use consistent units and labels across dashboards.
-
Prefer fewer panels with clear purpose over giant walls.
2) Querying with PromQL mindset¶
PromQL is about selecting, grouping, and transforming time series.
The first habit is knowing your metric type.
Counters need rates or increases before comparison.
Gauges can often be graphed directly.
Histograms need quantiles or bucket math with care.
So what to do when a chart looks wrong? Another X-ray view helps compare healthy and unhealthy request paths.
Check labels, time window, rate function, and aggregation order.
Query bugs are common, especially with sparse or reset-heavy data.
-
rate() is usually better than raw counters for live traffic.
-
sum by() changes the question, so read grouping labels carefully.
-
avg() can hide hot shards or one bad region.
-
Increase time window when noise dominates, not when truth disappears.
3) RED and USE give panel discipline¶
RED stands for rate, errors, and duration.
It is ideal for request-serving services.
USE stands for utilization, saturation, and errors.
A medical chart query should back each major dashboard panel. It is strong for resources like CPU, disks, queues, and workers.
Simple, no? One method is service-facing, one is resource-facing.
Blend them when one service depends heavily on one scarce resource.
RED tells you user pain.
USE tells you whether the machine or queue is the bottleneck.
-
Put RED panels first on customer-facing service dashboards.
-
Put USE panels first on infrastructure component dashboards.
-
Add business counters when a technical graph misses revenue impact.
-
Label panels with the question they answer, not only metric names.
4) Grafana patterns that help during incidents¶
Grafana works best when dashboards guide a narrative. The monitor alarm should link straight to the dashboard you trust.
Top row for overview.
Middle rows for suspects.
Lower rows for deep detail and breakdowns.
Repeat templates by service, region, or tenant carefully.
Now watch. Repetition can clarify or create clutter.
Use annotations for deploys, config changes, and incidents.
Without change markers, teams overfit random graph wiggles.
- Link from panel to logs or traces with inherited variables. Keep a playbook beside the dashboard for first-response steps.
-
Use templated variables for region, service, and environment.
-
Keep legends readable; too many series hide the message.
-
Prefer percentiles over averages for user-facing latency panels.
5) Common dashboard anti-patterns¶
The most common mistake is building a museum.
Many panels, no story, no owner.
The second mistake is hiding thresholds and expected ranges.
Users cannot infer healthy state from raw numbers alone.
The third mistake is mixing low-value and high-value routes.
See. Aggregation can erase the only problem that matters.
The fourth mistake is no review cycle.
Dashboards should age with the architecture, not freeze in time.
- Remove stale panels after major migrations or retired services.
- Review each dashboard after incidents for missing drill-downs.
- Keep one dashboard per job, not one giant board for everything.
- Make dashboards load fast, or nobody will trust them during pressure. The playbook should list the exact PromQL or log query to run.
Where this lives in the wild¶
- SRE teams use overview dashboards to spot region-level impact within seconds.
- Product engineers build route-specific RED boards for sign-up and checkout paths.
- Platform teams use USE dashboards for queue workers, caches, and databases.
- FinOps-minded teams annotate dashboards with release events to connect cost and traffic shifts.
- Executive reviews often consume simplified service health summaries derived from deeper engineering boards.
Pause and recall¶
- Why should every dashboard panel answer a clear decision question?
- When is RED a better framing than USE?
- Why can avg() hide a reliability problem?
- What role do annotations play during investigations?
Interview Q&A¶
Q: Why do dashboards fail even when they show lots of data? A: Because panels without audience, thresholds, and drill-down paths create visual load without improving decisions. Common wrong answer to avoid: "Because engineers are bad at graphs" - the deeper issue is missing purpose, not missing taste.
Q: Why is PromQL aggregation order so important? A: Summing before or after rate, quantile, or grouping operations can change the meaning and distort the actual question. Common wrong answer to avoid: "Because PromQL is quirky syntax" - syntax matters less than the semantics of time series math.
Q: When should you choose RED over USE? A: Choose RED for request-serving paths where user impact is best seen through traffic, failures, and latency. Common wrong answer to avoid: "Always choose RED because it is simpler" - USE is better for resources and bottleneck diagnosis.
Q: How do you keep dashboards useful over time? A: Review them after incidents and architecture changes, then remove stale panels and add missing navigation links. Common wrong answer to avoid: "Set them once and trust monitoring forever" - dashboards decay as systems change.
Apply now (5 min)¶
Take one dashboard you already use. Label each panel as overview, suspect, or deep-dive. Remove one panel that does not change action, and add one annotation source you currently miss. Then write the main question the dashboard should answer in one sentence.
Bridge. Dashboards built. But metrics don't show the request path. → 05