17. Naive Bayes — update the odds, one clue at a time¶

Six minutes. Start with base rates. Add word clues. Get a fast probability machine that is wrong in theory and useful in practice.

Built on the ELI5 in 00-eli5.md. The confidence score — starting from a prior and moving as new clues arrive — is the central picture here. Naive Bayes is the model's quick odds-updater.

The picture before math¶

See. An email arrives. Before reading the words, the model already knows base rates. Ham is common. Spam is rarer. So the model starts with a prior. Then new clues arrive. "Free money." "Urgent." "Click here." Each clue pushes the odds. That is Bayes.

prior  +  word clues  →  updated belief
 40%        data      →      91%

The prediction is the class with highest posterior probability. The confidence score is that posterior itself. So what is the "naive" part? It assumes the clues are independent given the class. That is often false. Yet the model still works surprisingly well.

Bayes' theorem in one breath¶

For a class C and features x:

P(C | x) = P(x | C) P(C) / P(x)

In classification, we usually compare classes. P(x) is the same for all classes. So we use:

P(C | x) ∝ P(x | C) P(C)

Read it like this: - P(C) = prior before reading the email - P(x | C) = how likely these words are if class C is true - P(C | x) = updated belief after seeing the words The theorem is simple. Estimating P(x | C) is the hard part.

Why it is called "naive"¶

Suppose the words are: - free - click - urgent These are not independent. If "free" appears, "click" becomes more likely. If "urgent" appears, "click" may also become more likely. Naive Bayes ignores these correlations. It says:

P(x1, x2, x3 | C) ≈ P(x1 | C) P(x2 | C) P(x3 | C)

That is clearly wrong in strict probability terms. So why does it still work? Because classification only needs the ranking of class scores to be useful. Even if the probability is not perfectly calibrated, the right class often still gets the highest score. Especially in text. Words are correlated. But spam words still pile evidence in the right direction. The model often gets the prediction right even if the confidence score is too confident.

Three common versions¶

Gaussian Naive Bayes¶

Use when features are continuous. Assume each feature follows a Gaussian inside each class. Good for low-dimensional numeric data. Example: email length, number of links, sender reputation.

Multinomial Naive Bayes¶

Use when features are counts. Best known for text classification. Word counts are the killer app. If spam emails contain "free" and "win" often, Multinomial NB loves that.

Bernoulli Naive Bayes¶

Use when features are binary. Word present or absent. Clicked or not clicked. Link present or absent. So what to remember? - Gaussian → continuous values - Multinomial → counts - Bernoulli → binary indicators

Worked example — spam classification with word counts¶

Let the vocabulary be only three words: - free - win - meeting Suppose the training set has: - 4 spam emails - 6 ham emails So the priors are:

P(spam) = 4/10 = 0.4
P(ham)  = 6/10 = 0.6

Now count words across all spam emails. Spam totals: - free = 8 - win = 5 - meeting = 1 Total spam word count inside this tiny vocabulary:

8 + 5 + 1 = 14

Ham totals: - free = 1 - win = 0 - meeting = 9 Total ham word count:

1 + 0 + 9 = 10

We will classify the new email:

"free win"

Step 1 — use Laplace smoothing¶

Without smoothing, P(win | ham) would be zero. Then one unseen word would force the whole ham score to zero, so we add 1 to every word count. Vocabulary size V = 3. For spam:

P(free | spam) = (8 + 1) / (14 + 3) = 9/17
P(win  | spam) = (5 + 1) / (14 + 3) = 6/17

For ham:

P(free | ham) = (1 + 1) / (10 + 3) = 2/13
P(win  | ham) = (0 + 1) / (10 + 3) = 1/13

Step 2 — compute unnormalized scores¶

For spam:

score(spam)
= P(spam) · P(free | spam) · P(win | spam)
= 0.4 · (9/17) · (6/17)
≈ 0.0747

For ham:

score(ham)
= P(ham) · P(free | ham) · P(win | ham)
= 0.6 · (2/13) · (1/13)
≈ 0.0071

Step 3 — compare¶

Spam score is about ten times larger. So the email is classified as spam. If we normalize:

P(spam | x) ≈ 0.0747 / (0.0747 + 0.0071) ≈ 0.91
P(ham  | x) ≈ 0.09

So the model says: - prediction: spam - confidence score: about 91% That 91% is often overconfident. Naive Bayes is good at ranking, not always at calibration.

Why Naive Bayes wins on text¶

Text has three properties Naive Bayes likes. 1. Huge feature dimension. 2. Very sparse vectors. 3. Tiny labeled datasets are common. Each word contributes a small likelihood update, so you do not need heavy optimization or gradient descent. Training can finish in minutes, sometimes seconds. Also, rare words are highly informative. "Lottery" and "unsubscribe" push hard toward spam. "Schedule" and "meeting" push toward ham. So even the naive independence assumption gets the class ranking right surprisingly often.

Why the independence assumption can still hurt¶

Now the failure mode. If features are strongly correlated, the model double-counts evidence. Example: - word = "offer" - bigram = "special offer" These are related clues. Naive Bayes may treat them as two separate votes. Then the confidence score becomes too extreme. Also, if the real boundary depends on interactions like: - free AND click together matter - neither alone matters much Naive Bayes struggles. It does not model interactions naturally.

A practical trick — work in log space¶

Multiplying many tiny probabilities underflows. So real systems usually add logs.

log score(C) = log P(C) + Σ log P(x_j | C)

Same ranking. Better numerics. Interviewers love this answer because it sounds shipped, not memorized.

Naive Bayes vs Logistic Regression¶

Both are linear classifiers. Naive Bayes learns P(features | class) and applies Bayes. Logistic regression learns P(class | features) directly. So Naive Bayes is generative; logistic regression is discriminative. Naive Bayes converges fast on tiny sparse data. Logistic regression usually wins once you have enough data because it does not assume feature independence. For very small bag-of-words datasets, try Naive Bayes first. For most other problems, try logistic regression.

When Naive Bayes wins and loses¶

Use it for tiny datasets, sparse text, fast baselines, and quick interpretable word-level evidence. It struggles with strongly correlated features, interaction-heavy boundaries, calibrated probabilities, and dense continuous problems. Then logistic regression, boosting, or deep models usually do better.

Where this lives in the wild¶

Gmail and Outlook spam filtering. Word-count signals and sender-pattern counts make Naive Bayes a classic baseline and, in some simple pipelines, still a useful production component.
Zendesk ticket routing. Tiny labeled support datasets and sparse word features make Multinomial NB a fast first classifier for issue categories.
Reuters and Bloomberg news tagging. Short-text topic hands_on_lab with bag-of-words features often starts with linear baselines like Naive Bayes because setup cost is tiny.
Shopify review moderation. Sparse phrase counts for “refund”, “fake”, or “broken” are enough for a quick NB-style triage model before heavier moderation systems take over.
On-device email classification. When memory and CPU are tight, Naive Bayes remains attractive because training and inference are both lightweight.

Interview Q&A¶

Q: Why does Naive Bayes work even though the independence assumption is wrong?
A: Because classification cares about comparing class scores, not building a perfect joint distribution. Even when features are correlated, the correct class often still gets the highest score because the evidence piles up in the right direction. The model is bad probability theory but often good decision ranking.
Common wrong answer to avoid: "Because the features are actually independent in text." They are not. Words are highly correlated. Q: When do you use Gaussian vs Multinomial vs Bernoulli Naive Bayes?
A: Gaussian for continuous numeric features, Multinomial for counts like bag-of-words, and Bernoulli for binary present/absent indicators. The right version depends on how the feature values are generated. It is a data-model match question, not a preference question.
Common wrong answer to avoid: "Multinomial is always best for text." If features are binary presence flags, Bernoulli can be better. Q: Why do we need Laplace smoothing?
A: Because an unseen feature in a class would otherwise get probability zero, and one zero term kills the whole product. Smoothing prevents brittle certainty from tiny sample counts. It is especially important with rare words and small datasets.
Common wrong answer to avoid: "Only to improve accuracy." The primary reason is numerical and statistical robustness. Q: When would you pick Naive Bayes over logistic regression?
A: When training data is tiny, when features are close to independent as in bag-of-words text, or when you need a fast baseline with almost no tuning. Naive Bayes converges in one pass; logistic regression needs iterative optimization. On small sparse data, fewer parameters can beat a more flexible model.
Common wrong answer to avoid: "Naive Bayes is always worse because of the independence assumption." On small data with many features, Naive Bayes often beats logistic regression because it has fewer parameters to estimate.

Apply now (5 min)¶

Take the spam example. By hand, recompute: 1. P(free | spam) with Laplace smoothing. 2. P(win | ham) with Laplace smoothing. 3. The two unnormalized class scores. Then answer: - Why would the ham score become zero without smoothing? - Why can the model still classify well even if the final 91% is badly calibrated? Then — without looking — sketch from memory: 1. Prior → likelihood → posterior. 2. One sentence on why "naive" is wrong but useful. 3. The three variants: Gaussian, Multinomial, Bernoulli. If you can do all three in 90 seconds, you own Naive Bayes.

Bridge. We have now covered the classifier families. But sometimes you do not want to classify at all — you want to compress. The next file shrinks dimensions while keeping signal. Read 18-dimensionality-reduction.md next.