09. Decision trees — branching boundaries from yes/no questions¶

A flowchart of threshold questions. Each split cuts the feature list into two cleaner groups. Stop when each leaf is mostly approve or deny.

Built on the ELI5 in 00-eli5.md. The feature list and overfitting both return here. The tree is the first model in this module that does not draw a single line — it draws a staircase of yes/no boxes.

The mental model — a flowchart, not a formula¶

See. A decision tree is a flowchart. At each step you ask one question about the feature list. Based on the answer, the applicant goes left or right. You ask another question. And another. Until the applicant lands at a leaf — a final box that says approve or deny.

Picture a loan desk. The underwriter walks the application through a small checklist:

                  [ credit_score > 700? ]
                  /                      \
                no                       yes
                /                          \
           [ deny ]                 [ income > 8? ]
                                    /             \
                                  no              yes
                                  /                 \
                       [ employment > 2? ]      [ approve ]
                       /                \
                     no                 yes
                     /                    \
                [ deny ]              [ approve ]

That is the whole model. No matrix multiplications. No gradients. Just a tree of threshold questions, where each split makes the two groups more pure than they were before.

So what is the math doing? The math only picks the next question. Out of all features and all possible thresholds, find the one split that separates classes best. Repeat for each child. Stop when nothing helps.

What "separates best" means — Gini impurity¶

We need a number that says "this group is mixed" or "this group is pure". The standard choice is Gini impurity:

Gini = 1 - Σ pᵢ²

where pᵢ is the fraction of class i in the group.

All one class → Gini = 0. Pure.
Half and half → Gini = 0.5. Maximally mixed.

Information gain (entropy-based) is the same idea with -Σ pᵢ log pᵢ instead. Same shape, same minimum at pure, same maximum at 50/50. Trees built with either look almost identical in practice.

Worked example — eight loan applicants, build the first split¶

Eight applicants. Three features: income (₹ lakhs), credit score, and employment years. Label: 1 = approve, 0 = deny.

#	income	credit_score	employment_years	approve?
1	4	620	1	0
2	5	640	2	0
3	6	660	3	0
4	7	690	4	0
5	8	710	3	1
6	9	730	5	1
7	10	760	6	1
8	11	780	8	1

Four approve, four deny. Parent group:

Gini_parent = 1 - (4/8)² - (4/8)² = 1 - 0.25 - 0.25 = 0.5

Maximally mixed. Now try two candidate splits.

Candidate A — `credit_score > 700`¶

Left (score ≤ 700): applicants 1,2,3,4 → 0 approve, 4 deny. Gini_L = 0. Right (score > 700): applicants 5,6,7,8 → 4 approve, 0 deny. Gini_R = 0.

Weighted average: (4/8)·0 + (4/8)·0 = 0. Gini gain = 0.5 - 0 = 0.5. Perfect split.

Candidate B — `income > 8`¶

Left (income ≤ 8): applicants 1,2,3,4,5 → 1 approve, 4 deny. Gini_L = 1 - (1/5)² - (4/5)² = 0.32. Right (income > 8): applicants 6,7,8 → 3 approve, 0 deny. Gini_R = 0.

Weighted average: (5/8)·0.32 + (3/8)·0 = 0.20. Gini gain = 0.5 - 0.20 = 0.30. Decent, but worse than A.

The tree picks A — credit_score > 700 as the root split. That is one full step of the algorithm. Recurse on each child the same way. Stop when a leaf is pure or hits the depth limit.

The boundary in 2D — axis-aligned rectangles¶

Each split is one vertical or horizontal line in feature space. Stack splits → you carve the plane into axis-aligned rectangles. Each rectangle = one leaf = one prediction.

credit_score
    ^
780 |        |   A    A
740 |        |   A    A
700 |________|____________
680 | D   D  |   D
620 | D   D  |
    +-----------------------> income (₹ lakhs)
      4   6   8   10   12

D = deny leaf, A = approve leaf. The boundary is a step function — only horizontal and vertical cuts. This is why a tree handles mixed types natively: a numeric question is "x > threshold?", a categorical question is "x ∈ {self_employed, salaried}?". Same yes/no shape. No scaling, no encoding, no dummy variables. Drop the feature list in. The tree picks what to ask.

Where trees break — overfitting, in numbers¶

A single deep tree is textbook overfitting from the ELI5. Give it depth and it will memorize. It will carve a private rectangle around every noisy applicant.

Take 100 applicants. The true rule: approve if credit_score > 700. But 5% of labels are noise. Train trees of different depths.

Depth	Splits	Train acc	Val acc	What's happening
1	1	0.93	0.94	Underfitting — one cut, captures the broad rule, ignores useful nuance
5	~30	0.97	0.95	Sweet spot — refines a bit, still general
20	~200	1.00	0.86	Overfitting — wraps each noisy point in its own leaf

At depth 20, the tree is the junior underwriter who memorized that "applicant #47 had credit score 701, employment 11 months, and still defaulted, so anyone like that exact row must be denied." Train accuracy 100%. Validation collapses. Pure overfitting.

Three rough fixes for a single tree: 1. max_depth — hard ceiling. Forces blunt rules. 2. min_samples_leaf — every leaf must contain ≥ N applicants. No leaf-of-one memorization. 3. Cost-complexity pruning (ccp_alpha) — penalize tree size, prune weakest splits.

But the real fix is the voting panel — many trees voting. That is the next file.

The diagonal problem — staircases approximating slopes¶

Trees draw axis-aligned boxes. What if the true boundary is a clean diagonal?

True boundary (diagonal)        Tree approximation (staircase)
credit_score ↑                  credit_score ↑
780 | A \                       780 | A |__
740 |    \ A                    740 |   | A |__
700 |     \                     700 |___|   |   |
660 | D    \                    660 | D |___|   |
620 |   D   \  A                620 |   D   |___| A
    +-----------> income            +-----------> income

A diagonal needs many axis-aligned cuts to approximate. Each cut is one extra split. The tree gets there, but spends depth on geometry instead of signal — and depth is exactly what brings overfitting. Linear models or a feature engineered as income × good_credit_flag would handle some of that shape in one shot.

So. Trees love thresholds and interactions. They struggle with smooth slopes.

Regression trees — same tree, numeric leaves¶

Trees predict numbers too. Replace Gini with MSE as the split criterion. Each leaf predicts the mean of the rows that land there, not a class vote. A house-price tree might split on square footage, then neighbourhood, then building age. Each final leaf just says: the average price of homes in this box is ₹78 lakh. Same staircase picture. Numeric prediction instead of approve/deny.

Feature importance — which columns did the tree keep using?¶

Each feature's importance is the total Gini reduction from all splits that used that feature. If credit_score creates the biggest purity gains across the tree, it gets the highest importance. Useful, yes — but be careful. High-cardinality features get unfairly high importance because they offer more split opportunities. So treat tree importance as a clue, not as courtroom evidence.

Pause and recall. Without scrolling — what does Gini impurity measure? How did the algorithm pick credit_score > 700 over income > 8? Why does a depth-20 tree on 100 noisy applicants hit 100% train but 86% validation? Where is overfitting in the staircase picture?

Where trees ship in production¶

Loan underwriting rules. Regulated lenders still deploy shallow decision trees because every path can be read aloud: income low, credit short, deny. Adverse-action notices become straightforward.
E-commerce recommendation rules. A shallow tree can route users into coarse intent buckets: high-value repeat buyer, discount hunter, window shopper. That simple branching often feeds the next ranking stage.
Insurance risk scoring. Trees split on age band, claim history, vehicle type, and geography to create explainable premium buckets.
sklearn.tree.DecisionTreeClassifier as a baseline. Every Lead AI Engineer on a new tabular project fits one shallow tree first. It tells you which features matter, what the gross interactions look like, and how much non-linearity the data already wants.

Interview Q&A¶

Q: Why do decision trees handle mixed numeric and categorical features natively? A: Every split is just a yes/no question. Numeric: income > 8? Categorical: employment_type ∈ {salaried, self_employed}? Both produce a binary partition. No scaling, no one-hot, no normalization. Gini only cares about class counts after the split. Common wrong answer to avoid: "trees ignore feature scale because they normalize internally." They do not normalize anything. They just compare one feature against one threshold at a time.

Q: Why doesn't a single decision tree win Kaggle competitions? A: One tree has high variance — small data shifts produce wildly different trees. It also draws axis-aligned boxes, which approximate diagonals poorly. Random forests average many trees to kill variance. Gradient boosting builds many shallow trees that correct each other. The voting panel beats one tree. Common wrong answer to avoid: "trees are obsolete." Boosted trees still dominate tabular ML. The weak point is the single deep tree, not tree methods as a family.

Q: What's the difference between Gini impurity and information gain (entropy)? A: Both measure mixed-ness of a node. Gini = 1 - Σ pᵢ², entropy = -Σ pᵢ log pᵢ. Both peak at 50/50 and bottom at pure leaves. Gini is faster because there is no log. In practice, the resulting trees are usually very similar. Common wrong answer to avoid: "entropy is always more accurate." It is usually just a slightly different scoring rule, not a magic upgrade.

Q: How do regression trees differ from classification trees? A: Replace Gini or entropy with MSE as the split criterion. Leaf predictions are means, not class votes. CART handles both — the R in CART is regression. A house-price tree is still a tree, just with numeric leaves. Common wrong answer to avoid: "you can't use trees for regression." You absolutely can — it is built into CART itself.

Q: A tree gives 100% train accuracy and 70% validation accuracy. What's the first knob you turn? A: Reduce max_depth and increase min_samples_leaf. The tree has memorized — it has carved a leaf around noisy rows. Cap depth at 4–6, set leaf size to 20+, retrain. If that still fails, switch to a random forest or gradient boosting and stop over-tuning one tree. Common wrong answer to avoid: "collect more data before touching the model." More data can help, yes, but the first move is to stop the obvious overfitting.

Apply now (5 min)¶

Take the eight applicants above. Cover the credit-score column. Using only income, hand-build a depth-2 tree. Pick the best income threshold for the root by trying > 7, > 8, > 9. Compute Gini for each. Pick the winner. For each child, pick a second income threshold. Write the four leaf predictions.

Then — without looking — sketch from memory:

The flowchart picture (3–4 splits, leaf labels).
The 2D axis-aligned rectangle picture.
One sentence: why depth-20 hits overfitting.

If you can reproduce all three in under 90 seconds, the mental model has stuck.

Bridge. One tree memorizes. Many trees, each on a different bootstrap sample and each looking at a random subset of features, vote together — and the votes cancel out the memorization. That is the voting panel. Read 10-random-forests.md next.