06. Turning 40,000 calls a day into something the business can act on¶

~17 min read. The live pressure is over. The caller hung up. Now there are 40,000 recordings a day, each a few minutes of lossy audio, and the business wants to know: which calls went wrong, why customers are angry this week, whether agents followed the script, and which calls leaked a card number. No turn budget. Just scale, cost, and a thousand ways to measure the wrong thing.

Built on 05-agent-assist-realtime-guidance.md. The same ASR (chapter 03) and transcripts now run offline, in bulk. The turn budget is gone; the pressures are scale, cost, and data quality. This is also where false containment (chapter 01) finally gets caught and where disposition accuracy (chapter 07) gets audited.

Note: live ASR (chapter 03) optimized for turn-taking feel and entity accuracy under a deadline. Offline analytics has no deadline, so it makes different choices — bigger models, diarized two-channel audio, multiple passes. This chapter focuses on the seam unique to bulk analysis: scoring every call cheaply and correctly, and avoiding metrics that measure the wrong thing at scale.

What the live layers left behind, and the question they can't answer¶

Chapters 02–05 handled one call at a time, under a clock. They left behind artifacts: a recording, a transcript, a disposition, a sentiment trace. Individually, those are just exhaust. In aggregate, they're the only window the business has into what's actually happening across 40,000 calls a day — and the live layers, by design, couldn't see it. A bot answering one call can't tell you that this week duplicate-charge complaints tripled, or that one agent skips the required disclosure on every call, or that the bot's "65% deflection" (chapter 01) is hiding a wall of abandoned callers.

Those are aggregate questions, answerable only by processing everything after the fact. That's post-call analytics: transcribe every call, score it, mine it for topics and sentiment, grade agents and the bot against QA criteria, and flag compliance violations. No latency pressure — a summary that takes 30 seconds is fine when the call is already over. The pressures invert: now it's scale (40k calls × minutes each, every day), cost (you can't run a premium model on every second of audio without a budget blowup), and data quality (garbage transcripts produce confident garbage analytics).

By the end you can lay out the bulk pipeline, name what gets measured (sentiment, QA scorecards, summaries, topics, compliance flags), see why scoring on a sample beats scoring everything badly, and recognize the metrics that look like insight but measure noise.

What this file solves¶

A contact center can have dashboards full of sentiment scores, QA grades, and topic charts that are all confidently wrong — because the transcripts were lossy, the sentiment model is fooled by polite anger, the QA criteria are unauditable, or the sample is biased toward calls that completed. This file shows how to transcribe and score calls in bulk, what each metric actually measures (and mismeasures), and how to keep the analysis cheap and trustworthy at 40k calls/day — so the dashboards drive real decisions instead of confident noise.

Why "score every call with the best model" blows the budget and still lies¶

The obvious build: take every recording, run it through a premium ASR, then run a premium LLM over each full transcript for sentiment, summary, QA, and topics. It's the natural instinct — analytics should be thorough, and compute is cheap-ish.

Two failures, fast. First, cost: 40,000 calls/day × ~5 minutes × premium ASR (~$0.02/min) + a premium LLM pass over each transcript adds up to thousands of dollars a day — and most of it scores routine calls that tell you nothing new. Second, and worse, the analytics are confidently wrong on a chunk of calls because the source transcripts are lossy (chapter 02's 8 kHz μ-law channel) and the sentiment model misreads tone. You get a beautiful dashboard built on bad inputs, which is more dangerous than no dashboard — people act on it.

So the real problem is not "we need a bigger analytics budget" and not "we need a better sentiment model." It is that uniform full-fidelity scoring of every call spends the most on the least informative calls while inheriting the source transcript's errors as confident metrics. How can the pipeline spend compute where it pays and stop treating model outputs as ground truth?

That question shapes the whole pipeline: tier the work (cheap pass on everything, expensive analysis only where it matters — flagged calls, samples, outliers), and treat every metric as an estimate with error, validated against human-labeled samples, not as truth. Amazon Connect's conversational analytics does exactly this shape — analyze 100% for cheap signals (categories, sentiment, summaries) but reserve scorecard evaluations and deep review for sampled or flagged contacts.

Rule: tier the compute, and treat every metric as an estimate to validate¶

The load-bearing rule of analytics: score cheaply at full coverage, score expensively only where it pays, and validate every metric against human-labeled samples — because an analytics number inherits all the errors of the transcript and the model beneath it. Coverage and trust come from tiering and validation, not from running the best model on everything.

Why this rule exists. The primitive is that an analytics metric is a composition of estimates: ASR (lossy) → model scoring (imperfect) → aggregation. Each layer adds error, and the final number carries all of it with false confidence. The constraint is that at 40k calls/day you cannot afford full-fidelity everywhere, and you cannot afford to act on un-validated numbers. The rule splits coverage (cheap, everywhere) from depth (expensive, targeted) and ties both to a human-labeled ground truth so you know the error bars.

1) The bulk analytics pipeline — what runs after the call¶

Trace one day's calls through the offline pipeline.

        POST-CALL ANALYTICS PIPELINE (40k calls/day, batch)

  Recordings + live artifacts (transcript, disposition, sentiment trace)
        │
        ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ 1. TRANSCRIBE   diarized, 2-channel (caller vs agent), redact PII │
  │                 cheap/standard ASR — no deadline, batch it        │
  ├─────────────────────────────────────────────────────────────────┤
  │ 2. CHEAP PASS (100% coverage):                                    │
  │    sentiment · topic/intent categories · auto-summary · keywords  │
  ├─────────────────────────────────────────────────────────────────┤
  │ 3. FLAG + SAMPLE → route to expensive analysis:                   │
  │    compliance flags · QA scorecards · outlier/long-call review    │
  ├─────────────────────────────────────────────────────────────────┤
  │ 4. AGGREGATE + VALIDATE against human-labeled sample (error bars) │
  └─────────────────────────────────────────────────────────────────┘
        │
        ▼
  Dashboards · coaching · bot-improvement loop · compliance audit

Step 1 differs from the live ASR (chapter 03) in two ways that the lack of a deadline allows: it diarizes properly using two-channel audio (caller on one channel, agent/bot on the other — the channel separation chapter 03 wanted), and it redacts PII before the transcript is stored (chapter 08). Step 2 runs cheap signals over everything. Step 3 spends the expensive compute only on flagged or sampled calls. Step 4 ties the numbers to human-labeled ground truth so you know how wrong they are.

For the billing line, this is where the duplicate-charge call from chapters 02–05 finally gets graded: did the bot resolve it or strand the caller (false containment, chapter 01)? Did the agent follow the dispute-handling script? Was the card number properly kept out of the transcript (chapter 08)? None of that is visible live; all of it is visible here.

2) Picture: the funnel from coverage to depth¶

The mental model that keeps analytics affordable and honest: a funnel — wide and cheap at the top (every call gets light signals), narrow and expensive at the bottom (few calls get deep human-level review), with a ground-truth probe running alongside to keep everything calibrated.

        ALL 40k CALLS  ───────────────────────────────────  cheap, 100%
        sentiment · categories · summary · keyword spot       (pennies each)
              │  flag anomalies, sample, route outliers
              ▼
        ~FLAGGED + SAMPLED  ──────────────────  QA scorecards, deep LLM
        compliance hits, low-sentiment, outliers   (dollars each, few %)
              │
              ▼
        HUMAN REVIEW  ──────  the few that need eyes (disputes, escalations)
              │
        ┌─────┴──────────────────────────────────────────┐
        │ GROUND-TRUTH PROBE: human-label a random sample │
        │ → measure how wrong each automated metric is    │
        └─────────────────────────────────────────────────┘

The funnel controls cost (you don't run the dollar-per-call analysis on every penny-per-call routine contact). The probe controls trust (a random human-labeled sample tells you the sentiment model is, say, 80% accurate, so you read the dashboard with that error bar). Without the funnel, you go broke; without the probe, you act on confident lies. Both are non-negotiable at scale.

3) The running example: grading the billing dispute call, in aggregate¶

Thread the call one last time, now as data. The duplicate-charge call (handled by the bot in chapters 02–04, escalated to a human with assist in chapter 05) is one of 40,000 today. Two ways the analytics could grade it.

Attempt A — score the raw transcript with one premium LLM pass¶

The pipeline takes the single-channel live transcript (no diarization), feeds it to a premium LLM, and asks for sentiment, summary, QA score, and compliance. The transcript has the bot's words and the caller's words interleaved without clear speaker labels (chapter 03's mixed-channel problem), and a lossy stretch garbled the account number. The LLM confidently produces: sentiment "positive," QA "100% compliant," summary "balance inquiry resolved." Every one is plausible and at least one is wrong — the call was actually a frustrated dispute, and the sentiment model read the caller's polite phrasing as positive while missing the anger.

Attempt B — diarized, redacted, tiered, validated¶

The pipeline re-transcribes from two-channel audio (caller separated from agent), redacts the card number, runs cheap sentiment/categories on it (categorized as "billing dispute — duplicate charge," sentiment "negative-trending-neutral"). Because it's a dispute (a flagged category), it routes to the QA scorecard pass: did the agent verify identity, acknowledge the charge, follow the credit policy, give the required disclosure? The scorecard checks each, grounded in the actual transcript with speaker labels. The summary is drafted and the disposition cross-checked. And this call sits inside the day's aggregate, where a human-labeled sample tells the team the sentiment model is ~82% accurate — so "negative-trending" is read as "probably negative, ±some error."

The hard part hiding here: diarization and channel separation are prerequisites for every downstream metric. You cannot score "did the agent follow the script" if you don't know which words the agent said. Garbage speaker attribution makes confident garbage QA. This is chapter 03's speaker-separation problem resurfacing — but now it's not about the bot transcribing itself live; it's about whether the whole analytics stack can attribute words correctly.

4) Why tiered batch scoring instead of premium-model-on-everything — choosing under a scale-and-cost workload¶

The plausible alternative: compute is getting cheaper, so just run the best model on every call and skip the tiering complexity.

Premium model on every call — simplest pipeline, uniform quality, no routing logic. But cost scales linearly with volume and most spend lands on routine calls that produce no new insight, and it still inherits transcript errors as confident metrics. At 40k/day it's expensive and not more trustworthy.
Tiered batch (cheap everywhere, expensive on flagged/sampled) + validation — cheap signals give full coverage, expensive analysis concentrates where it pays (compliance, disputes, outliers, escalations), and a human-labeled probe gives error bars. More pipeline complexity, far lower cost, and more trustworthy because it's calibrated.

For a 40k-call/day center where 70%+ of calls are routine and the value is in the flagged tail, tiering wins decisively. The deciding question: does every call need deep analysis, or does the value concentrate in a flagged subset (compliance hits, disputes, angry callers, outliers)? In a contact center, value concentrates — so spend there and sample the rest.

5) The property that changes the design: the sample you score is biased toward the calls that completed¶

The dimension people miss is survivorship bias in the analytics input. The calls you can fully transcribe and score are the ones that completed — but the most important signal often lives in the calls that didn't: the caller who hung up on the bot in frustration (false containment, chapter 01), the dropped call, the 8-second "are you there?" abandon. If your analytics only scores clean completed calls, your sentiment looks better than reality and your "resolution rate" counts the survivors.

   What you score:    completed calls          sentiment 78% positive
   What you ignore:   abandoned/hung-up calls  (the angriest, excluded)
   ───────────────────────────────────────────────────────────────────
   True sentiment is worse; the abandons were the unhappy callers leaving.
   Deflection looked great (ch 01) precisely because abandons weren't scored.

This asymmetry should change your design: deliberately include the failed, short, and abandoned calls in analytics — they carry the false-containment signal that the headline metrics hide. This is exactly how chapter 01's false containment finally gets caught: cross the deflection rate against the abandon/repeat-call rate in the analytics, where you can see the abandoned calls that the live deflection metric counted as wins. The calls that didn't survive are the ones with the most to tell you.

6) One failure walked through: the sentiment dashboard that drove the wrong coaching¶

Incident: the contact center runs sentiment analytics and coaches agents on calls that scored "negative." Three months in, the coaching isn't improving CSAT, and some of the best agents are being flagged for "negative" calls. The sentiment model's accuracy dashboard says 85%.

The chain: the sentiment model scored tone and word choice, not outcome. A great agent handling an angry fraud-victim call — calm, empathetic, resolving a genuinely bad situation — produced a transcript full of negative words ("fraud," "stolen," "charge I didn't make," "upset"). The model scored it "negative." Meanwhile a bot call where a frustrated caller was polite ("okay, sure, thanks, bye") before hanging up to call back scored "positive." The team coached the empathetic agents to "improve sentiment" and praised the calls that actually failed.

The root cause is not a bad sentiment model — 85% accuracy on its task (tone) is fine. It's that sentiment-of-words was used as a proxy for call-quality-of-outcome, and they're different things. The fix: score outcome (was the issue resolved, did the caller call back, was the disposition correct) as the primary quality signal, and use sentiment as a secondary signal interpreted with the context of the call type. And validate against human-labeled outcomes, not just human-labeled sentiment. This is the same wrong-proxy trap as chapter 01's "good transcript means good outcome" — measuring the conversation instead of the result, now at analytics scale.

7) Cost movement: where the analytics dollars go and how tiering cuts them¶

Per-day cost for a 40k-call/day center (illustrative; ~5 min average call, varies by vendor):

Stage	Naive (premium everything)	Tiered
Transcription	40k × 5min × ~$0.02 = ~$4,000/day	standard ASR, batch: ~$2,000/day
Cheap pass (sentiment/category/summary)	included in premium LLM/call	~$0.005/call × 40k = ~$200/day
Expensive analysis (QA/compliance/deep)	premium LLM × 40k	only ~5–10% flagged: ~$0.05 × ~3k = ~$150/day
Human review	unbudgeted, swamped	sampled + flagged only: bounded
Rough total	thousands/day, mostly wasted	far lower, spent where it pays

The pressure evolution: tiering relieves cost pressure (you stop paying premium on routine calls) but creates routing-and-flagging pressure — you must correctly decide which calls get deep analysis, and a bad flag means an important call gets only the cheap pass — absorbed by the flagging logic and the validation probe. Validation relieves the confident-lie risk but creates a continuous human-labeling cost (you must keep a fresh labeled sample as call patterns drift), absorbed by a small QA team. Both costs are far smaller than the waste they prevent.

8) Signals that the analytics is the problem¶

Healthy: automated metrics tracked with known error bars (validated against a labeled sample), outcome-based quality scores that correlate with repeat-call and CSAT, analytics that include abandoned/short calls.

First metric to degrade: the gap between an automated metric and its human-labeled ground truth. When the sentiment model or QA scorer drifts (new call types, new products, model update), the validation gap widens before anyone notices the dashboards are wrong — which is exactly why the probe runs continuously.

Misleading metric people watch: the analytics model's own accuracy dashboard (e.g., "sentiment 85% accurate"). Accurate at what task? 85% on tone tells you nothing about whether tone predicts outcome — the section-6 trap.

First graph an expert opens: automated metric vs human-labeled sample over time (the calibration drift), and deflection/containment cross-referenced with abandon and repeat-call rates including the short/abandoned calls — this is where chapter 01's false containment finally shows itself. The second graph: QA score distribution by agent, sanity-checked against outcome (a "low QA, high resolution" agent means the scorecard is measuring the wrong thing).

9) Boundary: where bulk analytics shines, where it misleads¶

Bulk analytics shines on aggregate, trend, and compliance questions over high volume: this week's rising topics, script-adherence rates, sentiment trends, compliance-violation counts, bot-vs-human resolution gaps. At scale, even a noisy per-call signal averages into a usable trend, and the funnel keeps it cheap.

It misleads on individual-call judgment and on anything used punitively: flagging one agent for one "negative" call, or making a firing decision on an automated QA score. Per-call, the error bar that's invisible in an aggregate dominates — a single call's sentiment or QA score can easily be wrong. The scale limit that inverts intuition: analytics gets more reliable in aggregate as volume grows (noise averages out) but is least reliable exactly where it's most tempting to use it — on the individual call, the individual agent, the punitive decision. Trust the trend; verify the individual with a human.

10) Wrong assumption: "if it's on the analytics dashboard, it's a fact"¶

The seductive idea: the dashboard says sentiment is 78% positive, QA adherence is 92%, so those are facts. They're estimates — compositions of a lossy transcript, an imperfect model, and a possibly-biased sample — each carrying error the dashboard renders as a clean number. Acting on the number as truth is how you coach the wrong agents (section 6) and celebrate false containment (chapter 01).

Replace it with: every analytics number is an estimate with an error bar, only as good as the transcript and sample beneath it — validate before you act. This reorders how the dashboard is used: trends with error bars drive decisions; individual numbers get human verification before anyone is coached or any call is judged. It's the same "the conversation is the interface, not the result" correction from chapter 01, now applied to the metrics themselves: the dashboard is a measurement, not the truth.

11) Other ways analytics bites¶

No diarization — speaker attribution is wrong, so "did the agent follow the script" is unanswerable; every QA score is suspect.
PII not redacted before storage — card numbers and SSNs sit in the analytics transcript store, expanding compliance scope (chapter 08).
Sentiment as outcome proxy — empathetic agents on hard calls flagged "negative"; the section-6 coaching disaster.
Survivorship bias — only completed calls scored; abandons (the angriest) excluded, inflating every metric.
Topic mining on garbled transcripts — lossy ASR produces phantom topics; the trend chart tracks ASR errors, not customer issues.
QA scorecard rubric drift — criteria written once, never re-validated; agents game the measurable parts.
Compliance flag false positives — every call mentioning "card" flagged, swamping the compliance team with noise.
Acting on un-validated metrics — coaching, staffing, or bot changes driven by numbers with unknown error bars.

12) Pattern transfer¶

Tiering is the same as hot/cold storage or sampling-based monitoring — you can't afford full fidelity on everything, so you keep cheap signals on all of it and expensive analysis on the hot/flagged subset. The funnel is a cost-vs-coverage tradeoff identical to log sampling: full coverage at low fidelity, full fidelity at low coverage.
Survivorship bias — structurally identical to only logging successful requests and concluding the system is healthy: the failures dropped out of the dataset, so the metric lies upward. Including abandoned calls is logging the failures too.
Validation against ground truth is the eval-set discipline — same as never trusting a model's training metric without a held-out human-labeled set. An un-validated analytics dashboard is a model reporting its own training loss as accuracy. The probe is the held-out eval.

13) Design test¶

Does the pipeline tier compute (cheap on all, expensive on flagged/sampled), or run premium on everything?
Is every transcript diarized/two-channel before any per-speaker metric (QA, script adherence) is computed?
Is every automated metric validated against a continuously-refreshed human-labeled sample with known error bars?
Does the analytics include abandoned and short calls, so false containment and the angriest callers aren't excluded?
Is quality scored on outcome (resolution, repeat-call, correct disposition), not just sentiment-of-words?

Where this appears in production¶

Amazon Connect conversational analytics (formerly Contact Lens) — analyzes 100% of contacts for sentiment, categories, and summaries; reserves scorecard evaluations and deep review for flagged/sampled calls; redacts PII from recordings and transcripts.
NICE Enlighten / CXone QM — automated QA scorecards and interaction analytics at scale across all contacts.
Genesys Cloud speech & text analytics — topic/sentiment mining and automated quality management.
Verint — workforce-engagement analytics: QA automation, compliance, and trend mining.
CallMiner — conversation-intelligence platform specializing in compliance and risk-flagging across calls.
Observe.AI — post-call QA, auto-scoring, and coaching insights for contact centers.
Gong / Chorus — conversation analytics on sales calls (same shape: transcribe, score, mine topics, coach).
Cresta analytics — outcome-linked scoring tying behaviors to results, not just tone.
Two-channel diarized recording — caller and agent on separate channels so per-speaker QA is reliable.
PII redaction (Connect, Presidio) — strips card numbers/SSNs from transcripts before storage (chapter 08).
Auto-summarization — drafts the disposition summary post-call, cutting after-call work (links to chapter 05/07).
Compliance flagging (disclosure detection) — confirms required disclosures were read on regulated calls.
Topic/intent mining — surfaces "duplicate charge" complaints trending this week to feed bot improvement.
Containment-quality analytics — cross-references deflection against abandon/repeat to catch false containment (chapter 01).
Human-labeled validation sampling — the QA team's labeled sample that calibrates every automated metric.

Recall¶

How do the pressures of post-call analytics differ from the live layers (chapters 02–05)?
Why does running a premium model on every call blow the budget and still produce wrong metrics?
What does the funnel (coverage vs depth) control, and what does the ground-truth probe control?
Why is diarization/channel separation a prerequisite for QA scorecards?
What is survivorship bias in analytics, and why does excluding abandoned calls inflate every metric?
Why can a sentiment dashboard cause a center to coach its best agents the wrong way?
Where is bulk analytics most reliable, and where is it most dangerous to trust?

Interview Q&A¶

Q1. Your analytics bill is huge and the dashboards still seem off. What do you change? Stop running premium models on every call. Tier it: cheap signals (sentiment, categories, summary) at 100% coverage, expensive analysis (QA, compliance, deep LLM) only on flagged and sampled calls, and validate every metric against a continuously-refreshed human-labeled sample. That cuts cost dramatically and makes the numbers trustworthy, because tiering controls spend and the probe controls trust. Common wrong answer to avoid: "switch to a cheaper model across the board" — uniform cheap scoring lowers cost but worsens trust; the win is tiering plus validation, not one model everywhere.

Q2. Why diarize and use two-channel audio offline when the live bot didn't bother? Because every per-speaker metric — "did the agent follow the script," "who said the disclosure" — requires knowing which words each speaker said, and a single mixed channel makes that guesswork. Offline has no deadline, so you can re-transcribe from two-channel audio and diarize properly. Garbage speaker attribution produces confident garbage QA scores. It's chapter 03's speaker-separation problem, now blocking the entire analytics stack. Common wrong answer to avoid: "the live transcript is good enough" — the live transcript was tuned for turn-taking on a mixed channel; it can't reliably attribute who-said-what for QA.

Q3. Sentiment analytics flagged several of your best agents for "negative" calls. What's wrong? Sentiment-of-words is being used as a proxy for call quality, and they differ. An empathetic agent on a fraud or hardship call produces a negative-word-heavy transcript and scores "negative" despite handling it perfectly, while a polite-but-failed bot call scores "positive." Score outcome (resolution, repeat-call, correct disposition) as the primary quality signal and use sentiment as a context-interpreted secondary signal. Validate against human-labeled outcomes. Common wrong answer to avoid: "improve the sentiment model's accuracy" — it may be accurate at tone; the bug is using tone as a stand-in for outcome.

Q4. The bot's deflection looked great live, but you suspect false containment. How do analytics catch it? Include the abandoned and short calls in the analytics — the live deflection metric counted them as wins, but they're the frustrated hang-ups. Cross-reference deflection against the 48-hour repeat-call and abandon rates. If callers the bot "deflected" called back or abandoned mid-call, that's false containment surfacing in the aggregate, which the per-call live layer structurally couldn't see. Common wrong answer to avoid: "the deflection metric says 65%, so containment is fine" — that's the exact metric that counts abandoned failures as successes; analytics catches it precisely by scoring the calls the live layer excluded.

Q5. Can you use the automated QA score to put an agent on a performance plan? Not on its own. Aggregate analytics is reliable for trends but unreliable per-individual-call — a single call's QA or sentiment score can easily be wrong, and the error bar invisible in an aggregate dominates one call. Anything punitive needs human review of the actual calls. Trust the trend; verify the individual. Analytics is least reliable exactly where it's most tempting to use it punitively. Common wrong answer to avoid: "the dashboard says 60% adherence, that's grounds for action" — acting punitively on an un-verified per-agent automated score coaches and penalizes on noise.

Q6. Topic mining shows a new "topic" spiking this week, but no one knows what it is. What do you check first? Whether it's a real customer trend or an ASR artifact. Lossy transcripts (chapter 02's channel) produce garbled tokens that cluster into phantom topics. Pull the actual transcripts behind the spike and listen to a sample of the calls; if the "topic" is a transcription error pattern, you're tracking ASR noise, not customer behavior. Validate the topic against the underlying audio before acting. Common wrong answer to avoid: "escalate the new topic to product immediately" — acting on an unvalidated topic that may be a transcription artifact wastes the org's attention on noise.

Q7. (Cumulative) A call's analytics transcript contains a full card number. Which chapters failed? Chapters 08 and 02 primarily, surfacing in 06. The card should have been captured via a PCI-safe DTMF path (chapter 08) so it never entered the audio the recorder tapped (chapter 02's bridge), and any residual PII should have been redacted before the transcript was stored (chapter 06 step 1, chapter 08). Finding it in the analytics store means the capture path leaked it and redaction didn't catch it — now the entire analytics store is in PCI scope. Fix the capture, not the dashboard. Common wrong answer to avoid: "redact it from the analytics dashboard now" — it's already stored in scope; redaction at display doesn't undo that the card entered the recording, transcript, and analytics pipeline.

Design/debug exercise (10 min)¶

Step 1 — Modeled example. Walk the billing-dispute call through the funnel (sections 1–3, Attempt B): transcribe two-channel + redact → cheap pass (category "billing dispute," sentiment) → flagged as dispute → QA scorecard pass → aggregate with error bars from the validation sample. For each stage, write the one failure if it's skipped (e.g., skip diarization → QA score is meaningless).

Step 2 — Your turn. Design the analytics for the chapter-01 false-containment problem on the billing line: which calls must you include that naive analytics would exclude, which two metrics do you cross-reference, and how do you validate that a "deflected" call was actually resolved versus abandoned? Note where the abandoned-call signal comes from.

Step 3 — Reproduce from memory. Redraw the coverage-vs-depth funnel with the ground-truth probe (section 2) cold. Then connect it back to chapter 01: mark where false containment gets caught, and to chapter 03: mark where the speaker-separation problem (diarization) gates every per-speaker metric.

Operational memory¶

This chapter explained why a contact center can have full dashboards of sentiment, QA, and topic metrics that are confidently wrong — built on lossy transcripts, wrong-proxy models, and survivorship-biased samples. The important idea is that analytics is a composition of estimates, so you tier the compute (cheap everywhere, expensive on flagged/sampled) to afford coverage, and validate every metric against a human-labeled sample so you know its error bar before you act.

You learned to transcribe diarized two-channel audio and redact PII first, run cheap signals at full coverage, route only flagged/sampled calls to expensive QA and compliance analysis, include the abandoned and short calls so false containment surfaces, and score outcome over sentiment-of-words. That solves the opening "confident garbage dashboard" failure because the failure was never a weak model — it was uniform scoring of biased inputs treated as truth.

Carry this diagnostic forward: when a dashboard drives a bad decision, check the validation gap and the sample for survivorship bias before trusting the number. When analytics flags good agents, check whether sentiment is being used as an outcome proxy. When a topic spikes, listen to the calls before believing the trend.

Remember:

Tier compute (cheap at full coverage, expensive on flagged/sampled); premium-on-everything is costly and no more trustworthy.
Every analytics number is an estimate with an error bar — validate against a human-labeled sample before acting.
Diarize/two-channel before any per-speaker metric; wrong attribution makes confident garbage QA.
Include abandoned and short calls; excluding them hides the angriest callers and false containment.
Score outcome (resolution, repeat-call, correct disposition), not sentiment-of-words; trust trends, verify individuals.

Bridge. Analytics can only grade what it can see, and what it sees is the transcript and the disposition — which means those records have to actually exist, be correct, and be attached to the right account. That depends on the AI authenticating the caller mid-call and writing structured outcomes back into the CRM, the same systems integration that carries the warm-transfer baton. Wiring the AI into Salesforce and Zendesk — auth, screen pop, disposition — is the next seam. → 07-crm-cti-and-systems-integration.md

Stage	Naive (premium everything)	Tiered
Transcription	40k × 5min × ~\(0.02 = ~\)4,000/day	standard ASR, batch: ~$2,000/day
Cheap pass (sentiment/category/summary)	included in premium LLM/call	~\(0.005/call × 40k = ~\)200/day
Expensive analysis (QA/compliance/deep)	premium LLM × 40k	only ~5–10% flagged: ~\(0.05 × ~3k = ~\)150/day
Human review	unbudgeted, swamped	sampled + flagged only: bounded
Rough total	thousands/day, mostly wasted	far lower, spent where it pays