11. Gradient boosting and XGBoost — sequential trees correcting residuals¶
A different ensemble shape. Not a voting panel. A chain of shallow trees where each one fixes the price errors left by the previous ones.
Built on the ELI5 in
00-eli5.md. The voting panel returns here in a new shape — sequential weak models, prediction errors, and underfitting cured one residual at a time.
The picture before the math¶
See. Random forest was a panel — many trees, each looking at the house independently, then voting. Independent biases cancel out. Variance reduces.
Now imagine a harder house-pricing dataset. One tree alone cannot price well. The voting panel helps, but the average still leans toward the easy middle and misses the hard signal.
So we change the protocol.
The first tree makes a quick, rough prediction. Then the second tree looks only at where tree 1 was wrong, and writes a correction. The third tree looks at where trees 1+2 are still wrong, and writes another correction. And so on.
House → Tree 1 (rough price) → residual
→ Tree 2 (corrects price error) → smaller residual
→ Tree 3 (corrects again) → smaller still
→ ...
This is gradient boosting. The voting panel from ELI5, but sequential and correcting instead of parallel and voting.
Each tree is weak — a stump or a shallow tree, depth 3 to 6. Alone, useless. Chained with corrections, brutal.
What "fitting the residual" means¶
Residual = y - prediction so far. The mistake the ensemble has not yet fixed.
We do not train each tree on the original target. We train it on what is left over after the previous trees did their work. Each tree's job is to predict the gap.
The ensemble is a sum:
lr is the learning rate — usually 0.01 to 0.1. It says "trust each new tree only a little". Small steps, many of them.
Why small? Because if each correction is too aggressive, the chain overshoots. The ensemble flips between over-corrections like a panicky junior. Small steps let the chain settle.
Worked example — three rounds, real numbers¶
Three houses. True sale prices y = [90, 100, 140] lakh. Learning rate lr = 0.5 (large for visibility).
Round 0 — baseline. Predict the mean price for every house.
| House | y (₹L) | Pred₀ | Residual (price error) |
|---|---|---|---|
| 1 | 90 | 110.0 | -20.0 |
| 2 | 100 | 110.0 | -10.0 |
| 3 | 140 | 110.0 | +30.0 |
Round 1 — fit tree₁ on residuals. Tree₁ outputs [-18, -8, +28]. Update.
| House | y (₹L) | Pred₁ | New residual |
|---|---|---|---|
| 1 | 90 | 101.0 | -11.0 |
| 2 | 100 | 106.0 | -6.0 |
| 3 | 140 | 124.0 | +16.0 |
Errors shrunk. Not zero. Now another tree.
Round 2 — fit tree₂ on the new residuals. Tree₂ outputs [-10, -6, +14].
| House | y (₹L) | Pred₂ | New residual |
|---|---|---|---|
| 1 | 90 | 96.0 | -6.0 |
| 2 | 100 | 103.0 | -3.0 |
| 3 | 140 | 131.0 | +9.0 |
Round 3 — tree₃ outputs [-6, -2, +8].
Residuals now [-3, -2, 5] lakh. Three weak trees, each useless alone, together closing the price gap.
Run 100 such rounds with lr = 0.05, and the ensemble nails the patterns the single tree would have missed.
Pause and recall. Without scrolling — what does each tree fit? Why does the sum work? How is this chain different from the random-forest voting panel?
Random forest vs gradient boosting — the contrast¶
Same building block (decision tree). Opposite philosophy.
| Aspect | Random forest | Gradient boosting |
|---|---|---|
| Trees built | In parallel, independently | Sequentially, each fixes the last |
| Each tree sees | Bootstrap sample + feature subset | Residuals of current ensemble |
| Each tree is | Deep (low bias, high variance) | Shallow (depth 3-6) |
| Reduces | Variance (averaging cancels noise) | Bias (each step closes a gap) |
| Underfitting | Rarely — voting panel is rich | Real risk if too few trees |
| Overfitting | Hard to hit — averaging protects | Easy — too many trees memorize |
| Tuning | Forgiving | Sensitive (lr × n_trees × depth) |
Random forest is the voting panel — diverse biases cancel. Gradient boosting is the chain of correctors — each one chases what is left.
For most tabular problems with rich interactions, the chain wins. The panel is steadier; the chain is sharper.
How the chain breaks — three numerical scenarios¶
The "X breaks" claim. Boosting overfits when the chain is too aggressive and too long. Three regimes, same dataset, same depth (4):
Scenario A — lr = 0.5, 1000 trees → overfits hard¶
Big steps, lots of them. The chain memorizes noise.
train RMSE val RMSE
50 0.42 0.55
200 0.18 0.61
500 0.04 0.78 ← val rising while train collapses
1000 0.01 0.94 ← classic overfitting
Train looks perfect. Val is worse than the baseline at round 1000. The chain memorized rare sale quirks. That is overfitting, exact.
Scenario B — lr = 0.05, 100 trees → underfits¶
Small steps, too few. The chain stops before reaching the signal.
Both errors high. Both still falling. We pulled the plug too soon. Underfitting — the model never finished pricing.
Scenario C — lr = 0.05, 1000 trees → good¶
Small steps, enough of them. The chain converges and sits.
train RMSE val RMSE
50 0.71 0.74
200 0.41 0.48
500 0.22 0.39
1000 0.15 0.38 ← val plateaued; train still falling but val stable
Val flattens at the right point. With early stopping on a held-out set, the chain stops itself when val stops improving.
The rule. Lower lr × more trees × early stopping ≈ best of both worlds. High lr + many trees is the trap. Many lr × n_trees combinations work; the product matters more than either alone.
XGBoost, LightGBM, CatBoost — the modern flavors¶
Vanilla gradient boosting works but is slow. The modern libraries add engineering and regularization.
- XGBoost. Second-order gradients (uses both first and second derivatives of the loss for sharper updates). Built-in L1/L2 regularization on leaf weights. Sparse-aware splits — missing values get their own default direction at each split. Histogram-based split finding.
- LightGBM. Histogram + leaf-wise growth (grows the leaf with biggest loss reduction, not level-wise). Faster on large data. GOSS sampling — keeps high-gradient rows, samples low-gradient ones.
- CatBoost. Native categorical handling without one-hot. Ordered boosting to avoid target leakage in categories. Fewer footguns when columns are categorical-heavy.
All three: same chain-of-correctors core. Different engineering, different defaults.
Key hyperparameters — what each knob does¶
| Knob | What it controls | Typical |
|---|---|---|
learning_rate (eta) |
Trust per tree | 0.01-0.1 |
n_estimators |
Length of the chain | 100-5000 (use early stopping) |
max_depth |
Tree complexity per step | 3-8 |
subsample |
Row fraction per tree | 0.6-1.0 |
colsample_bytree |
Column fraction per tree | 0.6-1.0 |
min_child_weight |
Min samples to split | 1-10 |
reg_lambda / reg_alpha |
L2 / L1 on leaf weights | 0-10 |
The trick — tune lr low, set n_estimators high, use early stopping on a validation set. The library finds the right chain length for you.
Feature importance and SHAP — what the ensemble used¶
Gradient boosting gives feature importance by tracking average gain per split. Good for a fast global ranking: square footage, neighbourhood score, roof age, distance to metro.
But feature importance is global only. It tells you what mattered on average, not why this one house got this prediction.
For production explanations, use SHAP. It gives per-prediction attributions that add up to the house's predicted price shift relative to baseline. On tree ensembles, shap.TreeExplainer is fast and is the standard choice.
Where this lives in the wild¶
- Kaggle tabular dominance. XGBoost and LightGBM win or place in nearly every tabular competition. From 2014 onward the winners' write-ups are gradient boosting + careful feature engineering, almost without exception.
- Stripe Radar — fraud detection. Gradient-boosted trees score every card swipe in milliseconds. Each transaction's risk is a sum of corrections — amount × merchant × geo × velocity interactions that linear models cannot catch.
- Uber and Lyft ETA prediction. LightGBM models estimate trip time from origin, destination, time of day, weather, and traffic features. Boosted trees handle the threshold effects (rush hour vs not) that linear regression smears out.
- Yandex search ranking and Bing ads click prediction. Both ship gradient-boosted trees (CatBoost was born inside Yandex, LambdaMART inside Microsoft) for learning-to-rank — predicting which result a user clicks. Click prediction is XOR-shaped: query × document × user features interact richly.
- Airbnb dynamic pricing. Boosted trees estimate nightly price from location, season, lead time, events, host quality, and recent booking pace. Same city, same size, different weekend demand — the residual-correction chain catches it.
Interview Q&A¶
Q: Why does gradient boosting usually beat random forest on tabular tasks?
A: Random forest reduces variance by averaging high-variance trees. Gradient boosting reduces bias by chaining corrections — each tree fits what the ensemble still gets wrong. Tabular data with rich feature interactions has bias the panel cannot reach but the chain can. Boosting also tunes the bias-variance tradeoff finely via lr, depth, and length.
Common wrong answer to avoid: "boosting uses more trees so it's better". Both can use thousands of trees. The difference is sequential correction vs parallel averaging — different mathematical objectives.
Q: What is the role of the learning rate in boosting?
A: It scales each tree's contribution. Small lr (e.g., 0.05) means each correction nudges gently, so we need many trees but the ensemble settles smoothly. Large lr (e.g., 0.5) means each tree shifts predictions a lot, causing oscillation and faster overfit. Lower lr + more trees + early stopping is the standard recipe.
Common wrong answer to avoid: "learning rate doesn't matter, just train more trees". Wrong. With lr = 0.5 and 1000 trees, you overfit hard. The product lr × n_trees is what matters, with low lr giving a smoother optimum.
Q: Why is XGBoost faster than vanilla gradient boosting?
A: Three reasons. First, histogram-based split finding bins continuous features so each split tries ~256 thresholds instead of every unique value. Second, second-order gradients (using the Hessian) give a cleaner closed-form for leaf weights and split scores, so fewer trees are needed. Third, sparse-aware splits handle missing values and zeros without iterating over them. LightGBM goes further with leaf-wise growth and gradient-based row sampling.
Common wrong answer to avoid: "because it is written in C++." The speedup is not just implementation language. The algorithm itself does less work per split and regularizes more intelligently.
Q: When should you NOT reach for boosting?
A: Three cases. Raw images, audio, or text — where representation learning matters and CNNs/transformers dominate. Very small datasets (< 1000 rows) — boosting overfits and a regularized linear model with hand features wins. Hard latency budgets at huge scale — a deep boosted ensemble of 1000 trees can be slower than a quantized linear model or a small neural net on GPU.
Common wrong answer to avoid: "always use XGBoost for tabular". Almost true, not exactly. For very small data or extreme latency, simpler wins. For high-dimensional structured signals (images, sequences), deep nets win. Match model to data geometry.
Apply now (5 min)¶
Take a notebook. Three houses. Targets y = [90, 100, 140] lakh. Start with baseline prediction = mean = 110 for all rows.
Round 1 — residuals are [-20, -10, 30]. Suppose tree₁ predicts [-16, -8, 24]. With lr = 0.5, compute new predictions and new residuals.
Round 2 — fit a tree to the new residuals. Pick reasonable outputs. Compute again.
Round 3 — repeat. Watch the price errors shrink toward zero.
Then, without looking — sketch from memory:
- The chain-of-trees picture (three boxes, residuals flowing).
- The formula
F(x) = tree_1 + lr · tree_2 + lr · tree_3 + .... - The three-scenario table (
lr×n_trees→ over/under/good). - One sentence: how is this chain different from random forest's voting panel?
If you can do all four in 90 seconds, you own gradient boosting.
Bridge. The chain finds the right answer — but only if the validation gate is honest. A leaky split tells you the chain converged when it actually memorized. Next we set up the splits that mimic deployment. Read
12-evaluation-cv.md.