Skip to content

11. On-call health and burnout

~8 min read. The apparatus runs on humans. A technically perfect apparatus staffed by exhausted engineers degrades faster than a moderately good apparatus staffed by healthy ones. The discipline is to measure on-call load, defend fairness, and act on burnout before it converts to attrition.

Continues from 10-drills-and-game-days.md. This chapter develops the human-load side of the rotation plane. Recurring concepts in bold: page load, off-hours burden, alert fairness, handoff load, rotation rebalance, engineer satisfaction.

The apparatus's six surfaces are all designed; the previous nine chapters built them. This chapter is about keeping the humans inside the apparatus healthy enough to operate it.


The load metrics that matter

A healthy on-call rotation produces measurable signals; the signals worth tracking:

Metric What it measures Healthy range
Pages per shift How often the on-call is interrupted 0-2 per shift for stable apparatus
Off-hours page rate Pages between 22:00 and 06:00 local < 30% of total pages
Sustained shift length Hours spent actively on an incident < 4 hours per shift typical
Acknowledge-to-context latency Time from page to first action Tracked per page; rising trend is the signal
Page-shift recovery time Days before the engineer is fully back to project work 1-2 days typical for a major incident
Engineer satisfaction Surveyed quarterly, scaled 1-5 3.5+ typical for healthy rotations

A rotation rising on any of these should produce intervention before the metric is dangerous. The intervention is structural (rebalance, alert tuning, training), not motivational.


Alert fatigue as a structural risk

Repeated false-positive pages are the primary driver of on-call burnout. The mechanism:

  • The on-call is paged.
  • The page is not a real incident.
  • The on-call's investment (sleep, attention, project context loss) is uncompensated by the avoided harm.
  • After several such pages, the on-call's trust in the apparatus erodes.
  • The next real page is responded to slowly or skeptically.

The apparatus's response is alert tuning (chapter 03's signals), not "tell the on-call to take alerts seriously." Engineers respond rationally to noisy streams; rational response is muting.

A useful target: false-positive rate below 20% per alert per quarter. Above this, the alert is muted in practice. The apparatus's discipline is to find these alerts and retune.


Off-hours burden

Pages between 22:00 and 06:00 (local time of the engineer) carry disproportionate cost — sleep loss, family disruption, next-day productivity loss. A rotation that places off-hours pages disproportionately on the same engineers will burn them out.

Mitigations:

  • Time-zone diversification. Spread rotation across engineers in different time zones so off-hours pages distribute.
  • Off-hours severity tiering. Lower-severity pages defer to business hours; only P1 incidents page off-hours.
  • Compensation for off-hours response. Time-off or compensation that acknowledges the disproportionate cost. The compensation is structural, not heroic.
  • Off-hours bench. A dedicated engineer for after-hours response, separate from the daytime primary, for high-volume rotations.

A rotation that does none of these will see attrition on the engineers who carry the off-hours load.


The handoff load

Rotation handoffs cost time: the outgoing primary writes a summary, the incoming primary reads and asks questions, both spend 30-60 minutes synchronously. The cost is real; teams skip handoffs to save the time, then pay 5-10× the cost in the next incident's mean time to context.

The fix is to budget for handoff explicitly. The handoff is a calendar event; the time is allocated; the team accepts the cost because the alternative is worse.


A worked example — the burning-out senior

The Pune fintech's senior engineer Anjali has been the primary on-call for the lending agent for 18 months. Quarterly survey shows her satisfaction at 2.1. The rotation manager investigates:

  • Pages per shift: 4-5 (above range).
  • Off-hours rate: 45% (above range).
  • Engineer satisfaction: 2.1 (low).
  • Acknowledge latency: rising over the last two quarters.

The diagnosis is structural, not personal. Anjali is overloaded by an apparatus that has grown without rotation rebalance. The interventions:

  • Alert tuning: two noisy alerts retuned over a sprint; false-positive rate drops from 35% to 12%.
  • Rotation rebalance: a sibling team's engineer is trained into the rotation; primary load splits between two engineers.
  • Off-hours bench: a third engineer covers 22:00-06:00 IST in a separate shift.
  • Anjali takes two weeks off; her project work backlog is explicitly reprioritised.

Three months later: Anjali's satisfaction is at 3.6. The rotation has not lost its senior; the apparatus has been redesigned around the human load.


Fairness across rotations

Fairness is not just within a single rotation; it is across the team's portfolio of rotations. Patterns to avoid:

  • The same engineers on every weekend. Distribute weekend duty across the team.
  • The senior gets the harder rotation. The senior often carries the harder rotation already through their expertise; piling on the harder paging schedule too is a recipe for attrition.
  • The newcomer gets the easier rotation forever. Newcomers should rotate through harder surfaces with training, not be insulated indefinitely.
  • The lead is also primary. Leadership work plus primary on-call is double load; one or the other should be lighter.

Fairness is visible. Publishing the per-engineer page count, off-hours rate, and satisfaction quarterly makes imbalances visible and addressable.


When to declare a rotation broken

Some rotations cannot be made healthy by tuning; the underlying load is structurally too high. Signals:

  • Pages per shift consistently above 4.
  • Off-hours rate above 50% despite mitigation.
  • Engineer satisfaction below 2.5 across multiple quarters.
  • Attrition from the rotation above team baseline.

The intervention is not more tuning; it is rotation rebalance, additional staffing, or apparatus simplification. Declaring the rotation broken is a leadership conversation, not a denial.

A common cause: the apparatus has accumulated alerts and runbooks beyond what the staffing supports. The fix is to either staff up (slow), simplify (faster), or remove apparatus coverage with explicit risk acceptance (rare but sometimes correct).


Operational signals

Healthy. Per-rotation load metrics are within healthy ranges. Engineer satisfaction is stable or rising. Rotation rebalances happen proactively, not after attrition.

First degrading metric. Pages-per-shift climbing on a rotation. Capacity is being consumed by load.

Misleading metric. Total page count across the apparatus. The aggregate hides per-rotation imbalances; an apparatus with healthy aggregate can have one rotation collapsing.

Expert graph. Per-engineer dashboard: page count, off-hours rate, satisfaction, project-time impact. The per-engineer view catches imbalances the aggregate misses.


Boundary of applicability

Strong fit. Teams running ongoing AI on-call with measurable load. The discipline of measuring and intervening is essential.

Pathology. Treating on-call health as a culture problem rather than an apparatus problem. Burnout is structural; the apparatus's load is the cause. Pep talks and team-building events do not fix overloaded rotations.

Scale limit. Very large platforms have many rotations; the meta-problem is portfolio-level fairness. The pattern is a quarterly review with rotation health as a leadership metric.


Failure-prone assumption

The seductive wrong belief: strong engineers can handle the load. They can — for a while. Then they leave. The apparatus loses its institutional knowledge with their departure; the next engineer enters a degraded rotation and the cycle repeats. The correct belief: the apparatus's load is finite; it must fit the staffing, not the other way around.


Where this appears in production

  • A fintech publishes per-engineer rotation load quarterly; imbalances are addressed before attrition.
  • A telecom AI has off-hours compensation policy; engineers are willing to take rotation duty.
  • A consumer chatbot treats burnout as "the engineer needs a break"; the same engineer returns to the same overloaded rotation and the cycle repeats.
  • A healthtech AI has time-zone diversification; off-hours pages distribute across geographies.
  • A coding assistant has a satisfaction survey quarterly; trends are reviewed by leadership.
  • A retail AI has alert false-positive rate as a tracked metric; high-FP alerts are retuned.
  • A logistics AI budgets handoff time explicitly; mean time to context after handoff is low.
  • A government AI has structural on-call compensation policy; turnover from on-call is below team baseline.
  • A B2B SaaS discovered after attrition that the lead was also primary on-call; restructured.
  • A travel platform has the off-hours bench for the highest-volume rotation; daytime on-call is unburdened by sleep loss.
  • A payments AI runs a quarterly rotation health review; degrading rotations are remediated.
  • A legal AI has an alert tuning sprint scheduled quarterly; false-positive rate is sustained low.
  • A staffing AI has fairness audits — same engineers on the same weekends? same off-hours? — caught and corrected.
  • A search-rerank service has the on-call dashboard accessible to leadership; load is visible.
  • A document AI declared a rotation broken; restructured with additional staffing rather than continuing to overload.
  • A media AI had a senior burn out and leave; postmortem identified structural overload; rotation rebalance now proactive.
  • An ad-tech AI has fairness as an explicit policy in the rotation spec; deviations require lead approval.
  • A real-estate AI caught alert fatigue early through satisfaction surveys; retuning sprint reset the trend.
  • A medical AI has a regulatory on-call rotation that is paid extra for the off-hours load; turnover is low.
  • A small SaaS ignored on-call health; both engineers in the rotation left; the apparatus collapsed.

Recall / checkpoint

  1. Name the six load metrics that matter for rotation health.
  2. What is alert fatigue, and how does the apparatus respond to it?
  3. List three mitigations for off-hours burden.
  4. Why is "the senior can handle it" a failure-prone assumption?
  5. What signals indicate a rotation is structurally broken?
  6. What is fairness across rotations and why does it matter?
  7. How does the handoff load compare to the cost of skipping handoffs?

Interview Q&A

Q1. A senior engineer on the rotation is burning out. Walk through the diagnosis and intervention. The diagnosis is structural, not personal. Check the load metrics: pages per shift, off-hours rate, satisfaction score. If any are out of healthy ranges, the apparatus's load on this engineer is the cause. Interventions: alert tuning to reduce false positives, rotation rebalance to distribute load, off-hours bench for the worst-hit shifts, structural compensation acknowledging off-hours cost. The engineer takes time off as part of the recovery, but the apparatus changes are what prevent recurrence. Common wrong answer to avoid: "tell them to push through" — push-through accelerates attrition.

Q2. The team's alert false-positive rate is 35%. What is the structural fix? Schedule an alert tuning sprint. Identify the noisiest alerts. For each: tighten thresholds, add slice confirmation, lengthen sustain windows, split severity tiers (paging vs. ticket). Validate post-tuning against the last quarter's postmortems for recall. The aim is per-alert FP rate below 20%. Maintenance: schedule alert tuning as a recurring quarterly activity, not a one-time fix. Common wrong answer to avoid: "the on-call should learn which to trust" — engineers respond rationally to noisy streams; the tune is the fix.

Q3. How does off-hours compensation work, and why is it structural rather than discretionary? Off-hours pages cost the engineer sleep, family time, next-day productivity. Discretionary compensation (a thank-you, a mention in the team meeting) does not pay for these. Structural compensation: documented policy for paid time-off accruing per off-hours page, or higher compensation rate for off-hours rotation shifts. The policy is in writing; the application is automatic; the engineer does not have to ask. Common wrong answer to avoid: "engineers are well-paid, they don't need extra" — pay covers expected work; off-hours pages are an exceptional load.

Q4. The team's rotation has the lead as primary on-call. What is the problem? The lead has leadership responsibilities (meetings, mentoring, planning) that compete with on-call duty. Either the leadership work suffers (the team's planning degrades) or the on-call suffers (the lead is unavailable during meetings, hands off to backup frequently). The fix is to separate roles: the lead is in the escalation graph as lead-on-call, but the primary is a different engineer. Combining the roles is double duty; both suffer. Common wrong answer to avoid: "the lead knows the most, they should be primary" — knowledge can be transferred through training; load cannot be eliminated by expertise.

Q5. What is the difference between an exhausted engineer and a structurally overloaded rotation? The engineer is the visible symptom; the rotation is the cause. An exhausted engineer can be rested; the rotation, if unchanged, will exhaust the next engineer. Treating exhaustion as personal misses the structural cause. The fix is rotation rebalance, alert tuning, or apparatus simplification — changes to the system, not just the person. Common wrong answer to avoid: "give them a vacation" — necessary but insufficient.

Q6. The team is below industry baseline on engineer satisfaction. What does the apparatus owe them? Structural review: load metrics, alert tuning, off-hours mitigation, handoff time allocation, rotation rebalance if warranted. The aim is to bring the apparatus's load into a sustainable range, not to motivate engineers into accepting unsustainable load. The apparatus is changeable; the engineers are not interchangeable. Common wrong answer to avoid: "improve the team culture" — culture follows structural reality; toxic culture often signals structural overload.


Design / debug exercise (10 minutes)

Modelled example. Walk through the worked example (Anjali's case). Identify each load metric that flagged the problem, each intervention chosen, and the apparatus changes that produced sustainable load.

Your turn. Pick your team's rotations. For each, compute (or estimate) the six load metrics. Identify any rotation outside healthy ranges. Design the interventions you would propose.

Reproduce from memory. Write the six load metrics and their healthy ranges from memory. The signal of internalisation is that the metrics land in under two minutes; the test is that you can apply them to a hypothetical rotation quickly.


Operational memory

This chapter explained the human-load side of the rotation plane: load metrics, alert fatigue, off-hours burden, fairness across rotations, and when to declare a rotation broken. The important idea is that the apparatus's load is finite and must fit the staffing — engineers respond to load rationally, and unsustainable load produces attrition before it produces complaint.

You learned to measure rotation health, intervene structurally (alert tuning, rotation rebalance, off-hours bench), and treat burnout as a system signal rather than a personal failing. That solves the opening failure because the apparatus's humans now operate within sustainable load.

Carry this diagnostic forward: when an engineer says they are "fine" but the load metrics say otherwise, trust the metrics. Engineers under-report load; the apparatus's job is to measure what they will not say.

Remember:

  • Six load metrics: pages/shift, off-hours rate, sustained shift, ack latency, recovery time, satisfaction.
  • Alert fatigue is structural; the response is alert tuning, not motivation.
  • Off-hours compensation is policy, not discretion.
  • Fairness across rotations needs visible per-engineer load reporting.
  • Declaring a rotation broken is a leadership conversation, not a denial.

Bridge. The apparatus is built, exercised, and sustainable. The architect checklist condenses the module into the items the lead engineer runs through to validate the apparatus on any AI feature. The next chapter is that checklist. → 12-architect-checklist.md