Skip to content

20. Honest admission — what classical ML cannot do well

The previous twenty files were "this breaks, here is the fix." This one is different. We sit with what classical ML is — and name what it is not.

Built on the ELI5 in 00-eli5.md. The feature list is good on structured data. The prediction is clear. The confidence score is usable. The voting panel is strong. But classical ML has walls. Some problems are too raw and messy for manual features. This file names those walls.


Why this file exists

See. Twenty chapters of fixes. Every chapter ended with a solution. The teaching was clean — something breaks, here is the tool, now you own it.

That cleanness hid something. Classical ML has real limits. Not "limits that a cleverer engineer could fix" — structural limits baked into what these models are.

A Lead AI Engineer who cannot name these limits ships classical ML into problems where it will fail quietly. The model trains, the metric looks decent, the A/B test is inconclusive, and six months later someone rewrites it with a neural network and gets a 15-point lift.

So we name the walls. Not to dismiss classical ML — it dominates tabular production to this day — but to know when to stop reaching for it.

From linear regression to SVM, from KNN to clustering — twenty chapters cover supervised and unsupervised, classification and regression, tabular and high-dimensional.


Wall 1 — the feature engineering bottleneck

Classical ML does not learn features. You hand it features. The model finds weights.

For tabular data — age, income, click count, temperature — this is fine. The features are already numbers. Maybe you engineer a few interactions (age × income, log(click_count)), but the heavy lifting is modest.

For images, text, audio — the features are raw pixels, raw characters, raw waveforms. What do you hand the model?

   tabular data                    image data
   ────────────                    ──────────
   age = 35                       pixel[0][0] = 142
   income = 72000                  pixel[0][1] = 138
   clicks = 14                     pixel[0][2] = 145
   city = "Seattle"                ...
                                   pixel[223][223] = 97
   → logistic regression works     → logistic regression on raw pixels?
                                     accuracy ≈ random

The classical pipeline for images was: hand-engineer features (SIFT, HOG, SURF, color histograms), feed the feature vector to an SVM or random forest. This worked — sort of. ImageNet accuracy with hand-crafted features topped out around 72%. Then a ConvNet (AlexNet, 2012) hit 85% — by learning the features from raw pixels.

The lesson. When the gap between raw input and useful features is large, classical ML needs a human to bridge it. Deep learning bridges it automatically. On structured tabular data, the gap is small — classical ML still wins. On images, text, and audio, the gap is a canyon.


Wall 2 — no representation learning

Related but deeper than feature engineering. Classical ML operates in the feature space you give it. It cannot discover that "this cluster of pixels is an edge" or "this sequence of words is sarcasm."

Representation learning — the ability to transform raw input into increasingly abstract, useful internal representations — is what neural networks do in their hidden layers. Each layer learns to represent the input at a higher level of abstraction.

   classical ML                  deep learning
   ──────────────                ──────────────
   raw input                     raw input
      ↓                             ↓
   YOU engineer features         layer 1: learns edges
      ↓                             ↓
   model finds weights           layer 2: learns textures
      ↓                             ↓
   prediction                    layer 3: learns parts
                                 layer 4: learns objects
                                 prediction

   you do the hard part          the network does the hard part

For tabular data, you are the representation. The features are the representation. So classical ML's lack of representation learning does not hurt — because you already did it.

For everything else — vision, NLP, speech, genomics on raw sequences — representation learning is the game. Classical ML sits out.


Wall 3 — scaling with data

Classical models have a ceiling. Give a random forest 1 million rows vs 100 million rows — performance plateaus. The model's capacity is bounded by the number and depth of trees, not by data volume.

Neural networks keep improving with more data. The scaling laws are roughly:

   performance
      │        ●────────────── classical ML (plateaus)
      │       ╱
      │      ╱
      │     ╱        ●
      │    ╱        ╱
      │   ╱       ╱
      │  ╱      ╱               deep learning (keeps climbing)
      │ ╱     ╱
      │╱    ╱
      ├───╱──────────────────→ data size
      small              huge

At small data, classical ML often wins — fewer parameters, less overfitting, faster to train. At large data, neural networks win — more capacity to absorb signal.

The crossover point depends on the problem. For clean tabular data with good features, classical ML can dominate even at millions of rows. For images and text, the crossover happens early — neural networks win even at moderate data sizes because they extract better features.


Wall 4 — sequential and structural data

Some data has structure that classical ML cannot naturally consume.

  • Time series with long-range dependencies. A recurrent network or transformer can attend to events 1000 timesteps ago. A feature-engineered classical model must manually create lag_1000 as a feature — and you must decide in advance which lags matter.
  • Graphs. Social networks, molecular structures, knowledge graphs. Classical ML needs hand-engineered graph features (degree, clustering coefficient, PageRank). Graph neural networks operate on the graph directly.
  • Variable-length sequences. A sentence can be 5 words or 500 words. Classical ML needs fixed-length input — pad, truncate, or bag-of-words. Transformers consume variable-length sequences natively.

None of these walls are absolute. You can engineer features for time series, graphs, and text. People did for decades. The question is whether the engineering cost and accuracy ceiling justify the choice.


Wall 5 — multimodal inputs

The classical ML toolkit expects one feature list. What if one fraud case brings a receipt image, a support chat transcript, transaction values, and a voice call recording?

Classical ML processes each modality separately. Engineer features from the image (edges, textures). Engineer features from the text (TF-IDF, n-grams). Concatenate with the tabular features. Feed to a model.

A multimodal neural network processes all modalities jointly. The image encoder, text encoder, and tabular encoder share a latent space. Cross-attention lets the text attend to the image. The model discovers that "blurred receipt" (image) + "refund now" (text) + unusual merchant history (tabular) = scam, without the human having to spell out that interaction.

Classical ML can only combine modalities at the feature level. Neural networks combine them at the representation level. The latter is strictly more powerful.


Pause and recall. Without scrolling — name the five walls. For each, give one sentence on why it is a structural limit, not just a missing trick. If any is fuzzy, scroll back.


What classical ML is still best at

The walls are real. And yet — in production, classical ML dominates an enormous amount of work. The walls define where not to use it. Here is where to use it.

Strength Why it wins Example
Tabular data Features are already meaningful. No representation gap. Credit scoring, churn, pricing, demand forecasting
Interpretability A logistic regression coefficient or tree split is auditable. Regulated industries — FICO, healthcare, insurance
Small data Fewer parameters → less overfitting → better with 500 rows Rare-disease studies, niche B2B analytics
Speed Inference in microseconds. No GPU needed. Real-time bidding, high-frequency trading, edge devices
Debuggability Feature importance, SHAP values, partial dependence plots Any production system where "why did the model say this?" matters

The honest summary. Classical ML is the right choice for tabular data, small data, interpretable models, and latency-critical systems. It is the wrong choice for raw images, raw text, raw audio, and any problem where the feature engineering burden is larger than the modeling burden.


Where this lives in the wild

  • Kaggle tabular competitions (2015–2025). XGBoost and LightGBM dominate tabular leaderboards to this day. Neural networks on tabular data (TabNet, FT-Transformer) match but rarely beat gradient boosting. Classical ML's wall is not on tabular — it is on everything else.
  • AlexNet moment (2012). ImageNet accuracy jumped from ~72% (hand-crafted features + SVM) to ~85% (ConvNet). The feature engineering wall was hit, and deep learning broke through it. This is the clearest example of Wall 1 in history.
  • NLP before transformers. TF-IDF + logistic regression was the production standard for text classification until ~2018. It worked for simple tasks (spam, sentiment on short text). It failed on nuance, sarcasm, long-range context, and anything requiring world knowledge. BERT replaced it not because classical ML was broken — but because representation learning was better.
  • Fraud detection at banks. Many banks still use logistic regression or gradient boosting for fraud scoring — not because deep learning would not help, but because regulators require model interpretability. The interpretability wall works in classical ML's favor here.
  • Recommendation at Netflix/Spotify. The core ranking model is a neural network (deep learning). But hundreds of upstream features are computed by classical ML pipelines — user clusters (k-means), item similarity (cosine on TF-IDF), engagement scores (gradient boosting). Classical ML lives inside deep learning systems.

Interview Q&A

Q: When should you choose classical ML over deep learning?
A: When the data is tabular with meaningful features, when the dataset is small (hundreds to low thousands of rows), when interpretability is required (regulated industries), when latency constraints rule out GPU inference, or when the team does not have deep learning infrastructure. In all these cases, gradient boosting or logistic regression is the pragmatic choice.
Common wrong answer to avoid: "deep learning is always better." On tabular data, gradient boosting matches or beats neural networks in most benchmarks. On small data, deep learning overfits catastrophically.

Q: Why does deep learning beat classical ML on images and text?
A: Because deep learning learns features (representations) from raw input, while classical ML requires hand-engineered features. The gap between raw pixels or raw text and useful features is large — too large for manual engineering to bridge competitively. Neural networks close that gap automatically through hierarchical representation learning.
Common wrong answer to avoid: "because neural networks are more complex." Complexity alone does not help. The key is that the complexity is structured — convolutional layers exploit spatial structure, attention layers exploit sequential structure. Random complexity would overfit.

Q: Will deep learning eventually replace classical ML on tabular data?
A: As of 2025, no. Research (TabNet, FT-Transformer, TabPFN) has narrowed the gap but not closed it. Gradient boosting remains the default for tabular production work because it trains faster, requires less tuning, is more interpretable, and matches neural network accuracy. The structural reason is that tabular features are already good representations — there is less for representation learning to discover.
Common wrong answer to avoid: "yes, it's just a matter of time." This ignores the structural argument. Tabular features are human-engineered representations. The representation learning advantage of deep learning is smallest exactly where the input representation is already good.

Q: What would you do if your classical ML model plateaus and business needs more accuracy?
A: Three checks. First — is the plateau from data quality or model capacity? Clean the data, fix leakage, improve features. Classical ML often has headroom in feature engineering. Second — is the problem fundamentally tabular? If yes, try a different classical model (switch from logistic regression to gradient boosting). Third — if the problem involves unstructured input or the feature engineering burden is growing, consider deep learning. But switch only when the lift justifies the infrastructure and maintenance cost.
Common wrong answer to avoid: "Just throw a bigger model at it." If the bottleneck is bad labels, leakage, or weak features, a larger model only scales the mess.


Apply now (5 min)

No code. No numbers. Just thinking.

Write down three real systems you have worked on or studied. For each:

  1. Was the input tabular, image, text, or mixed?
  2. Were the features hand-engineered or learned?
  3. Did classical ML hit a wall? Which one?
  4. Would deep learning have helped? At what cost (infrastructure, interpretability, latency)?

Then — without looking — sketch from memory:

  1. The five walls (one phrase each).
  2. The scaling curve — classical ML plateaus, deep learning keeps climbing.
  3. One sentence: when is classical ML still the right choice?

If you can do all three in 90 seconds, you own the honest admission.


The end of this module

Twenty tools. One honest admission. That is the classical ML story.

The feature list feeds the prediction. The confidence score tells you how sure. The voting panel votes. Overfitting is controlled by regularization. The threshold is tuned to business cost. And when classical ML hits its walls — unstructured data, representation learning, scale — you know to reach for the next tool.

You can build it. You can ship it. You can debug it. You can interview about it. You can name what it cannot do.

That is the bar.


Bridge. Classical ML is the structured-data workhorse. Fast, cheap, interpretable, and dominant on tabular data. But the next dataset brings an image, a paragraph of text, and a voice recording. Rows and columns are not enough. You need a model that learns to see — a neural network. Start the next module at ../01_neural_network_primitives/00-eli5.md.