10. Random forests — the voting panel, made literal¶
Why averaging many noisy trees beats one careful tree. Three minutes. One picture. The variance trick.
Built on the ELI5 in
00-eli5.md. The voting panel is this whole file. Every random forest is that panel, made out of trees instead of people.
The picture before the math¶
See. One decision tree is a clever but jumpy model. Move two applicants in or out of the training set, and the tree reshuffles its splits completely. High variance. Overfitting lives one prune-step away.
Now think about a loan desk. Five underwriters, each biased by a different lens. None is reliable alone. But their average is steadier than any one of them — if their mistakes are independent.
That is the whole idea of a random forest. Build many trees. Make sure they disagree for honest reasons, not the same reason. Average their votes. The voting panel is calmer than any member.
single tree (one tree) forest (voting panel)
prediction wobbles applicant- each tree wobbles, but their
to-applicant as data shifts average is much smoother
^ ^
| wobble | tiny wobble
| /\ /\ | _/\__/\___
| / \ / \ |
+-------------> applicants +-------------> applicants
Same bias. Way less variance. That is the trade we are buying.
Why averaging works — but only if mistakes are independent¶
Pure math fact. If you average N predictions, each with variance σ², all independent:
Five independent trees with σ² = 0.04 give an average with variance 0.008. Five times calmer.
But trees trained on the same data are not independent. They share most splits. Averaging correlated noise barely helps. The corrected formula:
ρ is the average correlation between trees. If ρ = 1 (identical trees), Var(average) = σ² — averaging gave us nothing. If ρ = 0, we get the full σ² / N discount.
So the engineering goal is brutally simple. Push ρ down. Make the trees disagree on purpose. That is what bootstrap sampling and random feature subsets do.
"Without diversity, averaging is useless" — three numerical attempts¶
Let us prove it. Same data, same algorithm, same trees, no randomness. Then add bootstrap. Then add feature subsets. Watch the variance drop.
We have a tiny dataset of 6 loan applicants. The forest has 3 trees. Each tree, when built, gives a probability of deny for some new applicant X.
Attempt 1 — same data, same features (no randomness)¶
Every tree sees the full dataset and the full feature list. The tree-building algorithm is deterministic given the same inputs.
| Tree | Saw | Predicted P(deny) for X |
|---|---|---|
| T1 | full data, all features | 0.80 |
| T2 | full data, all features | 0.80 |
| T3 | full data, all features | 0.80 |
| Average | 0.80 |
Three identical trees. Variance across them is zero. Average equals each. We did three times the work for nothing. ρ = 1. Useless.
Attempt 2 — bootstrap samples only (bagging)¶
Each tree gets a different bootstrap sample — sample 6 applicants with replacement from the 6. About 63% of unique applicants show up; the rest are duplicates. Different trees, different splits.
| Tree | Saw | Predicted P(deny) for X |
|---|---|---|
| T1 | bootstrap A, all features | 0.85 |
| T2 | bootstrap B, all features | 0.70 |
| T3 | bootstrap C, all features | 0.75 |
| Average | 0.767 |
Now there is real disagreement. But the trees still picked from the same full feature menu. If one feature dominates (say credit_score), every tree splits on it first. Trees stay correlated. ρ is maybe 0.6. Better, but not great.
Attempt 3 — bootstrap + random feature subsets at each split¶
Now at every split, the tree is only allowed to consider, say, √p features chosen at random. Sometimes credit_score is in the menu. Sometimes it is not — and the tree must use income or employment_years instead. Trees take genuinely different shapes.
| Tree | Saw | Predicted P(deny) for X |
|---|---|---|
| T1 | bootstrap A, random features at each split | 0.90 |
| T2 | bootstrap B, random features at each split | 0.55 |
| T3 | bootstrap C, random features at each split | 0.72 |
| Average | 0.723 |
Maximum disagreement. Now ρ might be 0.2. The variance reduction is the biggest. This is what RandomForestClassifier does by default.
The lesson — bootstrap alone helps, but feature subsets at each split are what truly decorrelate the trees. Both knobs together = the voting panel with genuinely different lenses, not five trees staring at the same credit report.
Worked example — 5 trees, one applicant, the panel vote¶
Applicant X arrives. Five trees in the forest. Each gives a probability of deny.
| Tree | P(deny) for X |
|---|---|
| T1 | 0.90 |
| T2 | 0.70 |
| T3 | 0.40 |
| T4 | 0.80 |
| T5 | 0.60 |
| Forest average | 0.68 |
Spread of single trees: 0.40 to 0.90. That is a 0.50 swing. One tree alone could push the applicant into the wrong bucket. The average lands at 0.68 — past the usual 0.5 threshold, so the panel says "deny", confidently but not absurdly.
Now compute the variance reduction. Sample variance of the 5 tree predictions:
mean = 0.68
deviations = [+0.22, +0.02, -0.28, +0.12, -0.08]
squared = [0.0484, 0.0004, 0.0784, 0.0144, 0.0064]
sum = 0.148
σ²(single) = 0.148 / 4 = 0.037
If trees were perfectly independent, Var(average) = 0.037 / 5 = 0.0074. Five times calmer than a single tree.
In practice trees are correlated. With realistic ρ ≈ 0.3, the formula gives:
Still less than half the single-tree variance. Same bias, much less wobble. The voting panel in numbers.
Pause and recall. Without scrolling — what does bootstrap sampling do? What do random feature subsets do? Why does averaging not reduce bias? Which knob matters more for decorrelating trees?
Bias unchanged, variance crushed — the picture¶
Each tree is grown deep. Deep trees have low bias (they can capture any wiggle) and high variance (they wobble per applicant). Averaging deep, biased-the-same-way trees:
bias variance
━━━━━━━━━━ ━━━━━━━━━━
single |▓▓▓▓▓▓▓▓▓▓| same |▓▓▓▓▓▓▓▓▓▓| big
forest |▓▓▓▓▓▓▓▓▓▓| same |▓▓▓▓| small
This is why people say "random forest reduces variance, not bias". If every tree is systematically wrong in the same direction, the average is also wrong in that direction. Averaging cannot fix a bias that all trees share.
Feature importance in a forest¶
Feature importance in a random forest averages the Gini-reduction across all trees.
That makes it more stable than single-tree importance because averaging reduces variance and softens some of the high-cardinality bias.
But impurity-based importance still over-rates features with many split opportunities.
So what to do in production? Prefer permutation importance — shuffle one feature, measure how much accuracy drops.
In sklearn, model.feature_importances_ is impurity-based.
sklearn.inspection.permutation_importance is usually the better default.
Where this lives in the wild¶
The voting panel ships in production at scale.
- scikit-learn
RandomForestClassifierdefaults.n_estimators=100,max_features='sqrt',bootstrap=True. The defaults are tuned to give you a working voting panel out of the box. - Loan-approval screening. Banks use random forests as a stronger second-pass model after simple rules. The forest captures interactions between income, credit history, and employment stability.
- Fraud scoring. Card and payments teams use forests or forest-like bagged trees when they want robustness with minimal tuning. Averaging keeps false alarms steadier.
- Recommendation candidate screening. Teams use random forests to rank which users or items are worth sending into a more expensive downstream ranker.
- Anomaly detection — Isolation Forest. Same averaging trick applied differently. Many random partitioning trees vote on weirdness. SOC and fraud pipelines use it because it is fast and simple.
The pattern. Wherever you need a quick, dependable tabular model with stable ranking and useful importance hints, the voting panel earns its seat.
Interview Q&A¶
Q: Why does random forest reduce variance but not bias?
A: Averaging cancels independent noise across trees, so variance drops by roughly 1/N after the correlation correction. But every tree is biased in the same direction. Averaging same-direction errors does not cancel them. So the forest inherits the bias of one tree, with much less wobble.
Common wrong answer to avoid: "Random forest reduces both because it averages everything." It only reduces what differs across trees. Shared bias survives untouched.
Q: Why random feature subsets at each split, not just per tree?
A: Per-tree feature subsets give some diversity, but the dominant feature can still recolonize every level inside that tree. Subset-per-split forces every node to consider a fresh random menu — sometimes the strongest feature is not even available. That is what truly drops ρ.
Common wrong answer to avoid: "Per-tree is the same as per-split, just slower." It is not — the diversity collapses quickly if the dominant feature is always available.
Q: When does random forest beat gradient boosting?
A: Three cases. Very noisy targets where boosting would chase the noise. Situations where you do not want to tune learning rate and depth carefully. And workloads where parallel training across cores matters, because every RF tree is independent while boosting is sequential.
Common wrong answer to avoid: "boosting is always better." Boosting often wins on clean leaderboard-style data, but RF wins plenty of real jobs on robustness and simplicity.
Q: What is out-of-bag error and why is it free?
A: Each bootstrap sample leaves out about 37% of the original rows. That tree never saw those rows. So you can score each tree on its own out-of-bag rows and average — you get a held-out-style estimate without doing a separate cross-validation split. Free validation, baked into bagging.
Common wrong answer to avoid: "out-of-bag is just training accuracy with a fancy name." No — those rows were literally excluded from that tree's training sample.
Q: Does adding more trees to a random forest ever hurt?
A: No. Unlike boosting, random forest does not overfit just because you keep adding independent trees. More trees only smooth the estimate and reduce variance; the real tradeoff is compute and latency.
Common wrong answer to avoid: "more trees can overfit." That is boosting-style thinking, not bagging.
Apply now (5 min)¶
Hand-ensemble three toy predictions. Pretend three trees say 0.30, 0.55, 0.80 for some applicant.
- Compute the forest average. (Should be
0.55.) - Compute the spread (max − min =
0.50). That is your single-tree variance band. - Now imagine if all three trees had said
0.55, 0.55, 0.55(no diversity). What is the average?0.55. What did averaging buy you? Nothing. Same answer, three times the cost.
Now sketch from memory:
- The bias-variance picture — bias unchanged, variance crushed.
- The three-attempt table — same data → bootstrap → bootstrap + feature subsets, with the variance dropping.
- One sentence: why the voting panel only works when its members disagree for independent reasons.
If you can reproduce all three in 90 seconds, you own the variance-reduction story.
Bridge. Random forest crushed variance but left bias alone. So how do we attack bias? Build trees sequentially, each fixing what the last voting panel got wrong. That is gradient boosting. Read
11-gradient-boosting.mdnext.