07. Feature stores — one recipe for training and serving¶

~15 min read. Many model failures begin as quiet feature mismatches, not bad algorithms.

Built on the ELI5 in 00-eli5.md. The assembly line — the factory path where every station expects the right part shape — depends on features being computed the same way in training and production.

Train-serve skew is a recipe mismatch¶

Look.

A model learns from features, not raw tables alone. If those features are built differently online and offline, the model sees different worlds.

That mismatch is called train-serve skew.

One team may compute 7_day_spend in training with late-arriving refunds removed.

Another service may compute it online without that correction.

The model was trained on one meaning and served on another meaning.

Then everyone acts surprised when production quality drops.

The assembly line did not fail because the model forgot math. It failed because the recipe changed mid-shift.

Picture first.

training path                     serving path
┌──────────────┐                  ┌──────────────┐
│ batch tables │                  │ live events  │
└──────┬───────┘                  └──────┬───────┘
       ▼                                 ▼
┌──────────────┐                  ┌──────────────┐
│ feature code │      same?       │ feature code │
└──────┬───────┘        ?         └──────┬───────┘
       ▼                                 ▼
 train features                     serve features
       │                                 │
       └──────────── skew if different ──┘

Simple, no?

Feature stores exist to reduce that mismatch.

They give teams one managed place to define, retrieve, and serve features consistently.

That does not remove all data problems, but it removes a very common one.

What a feature store actually provides¶

A feature store is not just a database with a fancy name.

It usually provides feature definitions, offline retrieval for training, online serving for inference, and metadata about freshness.

It also helps with point-in-time joins so training data does not accidentally peek into the future.

That future leak is deadly because it makes offline scores look better than production reality.

The ideal is simple.

Define feature logic once. Use it for both model training and live prediction.

That keeps the assembly line using one recipe instead of two contradictory recipes.

A common shape looks like this.

raw data
   │
   ▼
┌──────────────────┐
│ feature defs     │
│ entities + logic │
└───────┬──────────┘
        ├──────────────→ offline retrieval for training
        │
        └──────────────→ online serving for inference

Feast is a widely known open-source choice. It gives a practical way to define entities, feature views, offline sources, and online serving.

Managed options also exist from cloud and platform vendors. These reduce operational burden if your team prefers service contracts over platform ownership.

So what to do?

Choose open-source when you want flexibility and can support the platform. Choose managed when reliability and team bandwidth matter more than customization.

The product need should drive the choice. Brand loyalty should not.

Point-in-time joins matter more than people expect¶

See.

Suppose you train a churn model using account balance and support ticket count.

The prediction timestamp is 10 January, 10:00 AM.

If your training join accidentally uses a balance updated at 4:00 PM, you leaked the future.

The model now learns from evidence unavailable at decision time.

Offline metrics become fake confidence.

A feature store helps by retrieving feature values as they existed at prediction time.

That is the point-in-time join idea. Yes?

Look at the tiny example.

prediction time = 10 Jan, 10:00

customer_id   balance seen at 10:00   tickets seen at 10:00
C17           12,000                  2

wrong join uses balance at 16:00 = 5,000
right join uses balance at 10:00 = 12,000

The wrong join quietly teaches the model from information it never had live.

The right join keeps training reality aligned with serving reality.

Feature stores also help with offline retrieval at scale.

You can fetch historical features for many entities, then serve the latest features online for low-latency inference.

That pairing is important. Training wants broad history; production wants fast current values.

A decent feature store gives both lanes without making every team rebuild them.

When feature stores help a lot, and when they do not¶

Feature stores help most when many models share important features.

They help when several teams reuse user, merchant, item, or session features across products.

They help when your company has already been burned by train-serve skew.

They help when online freshness, backfills, and point-in-time correctness really matter.

They help when platform reuse beats ad hoc scripts.

The assembly line becomes calmer because feature logic stops splintering across repos and services.

Now the honest part.

Feature stores do not help every AI team equally.

If your system is mostly pure prompt engineering with little structured feature logic, a feature store may add more platform than value.

If your features are trivial and computed directly at request time, the added surface area may not pay off.

If the team cannot maintain another platform reliably, the feature store may become one more brittle dependency.

Simple, no?

Do not install Feast just because it sounds mature.

Use it when shared features, multiple models, or past skew incidents make the need obvious.

Use managed options when you want the capability without platform ownership.

Skip the whole category when the problem is mostly prompts, retrieval, or lightweight rules.

That is senior judgment, not tool snobbery.

Where this lives in the wild¶

DoorDash delivery ETA — ML platform engineer: shares courier, merchant, and route features across training jobs and real-time scoring services.
Spotify recommendations — recommender systems engineer: reuses user-behavior features across candidate generation and ranking models with freshness controls.
Stripe risk scoring — data platform engineer: protects online fraud models from skew by standardizing transaction features across batch and serving paths.
Instacart search ranking — machine learning engineer: retrieves historical features offline and serves low-latency latest values online for shopper requests.
Grab marketplace models — MLOps engineer: uses shared entity features because many pricing and dispatch models depend on the same signals.

Pause and recall¶

What is train-serve skew, in one plain sentence?
Why do point-in-time joins matter during training?
What are the usual offline and online jobs of a feature store?
When should a team skip adding a feature store?

Interview Q&A¶

Q: Why do feature stores reduce train-serve skew? A: They encourage one shared definition and retrieval path for features used in training and inference, so the model sees more consistent inputs across both environments. Common wrong answer to avoid: "Because they make models more accurate by default." They reduce inconsistency risk; the model still needs good signal.

Q: Why are point-in-time joins critical for ML training data? A: They prevent future information from leaking into historical examples, which keeps offline evaluation honest and closer to live serving conditions. Common wrong answer to avoid: "Because SQL joins are slow otherwise." Speed is secondary; correctness is the main issue.

Q: When is Feast a sensible choice? A: It makes sense when you want an open-source feature store, shared feature definitions, and your team can support some platform complexity. It is especially useful after skew has already caused pain. Common wrong answer to avoid: "Always use Feast if you are serious about ML." Serious teams also skip tools that do not fit.

Q: When might a feature store be unnecessary overhead? A: If the product relies mostly on prompts, trivial real-time features, or a team that cannot maintain another platform surface, the added system may cost more than it helps. Common wrong answer to avoid: "Never skip it, because every production ML stack needs one." That is cargo cult thinking.

Apply now (5 min)¶

Exercise. Write down one feature your model uses, then describe how it is computed in training and how it is computed online today.

If the two descriptions differ, mark the mismatch clearly.

Then write one point-in-time rule that would keep the historical join honest.

Sketch from memory. Draw the assembly line with one shared feature definition feeding both offline retrieval and online serving.

Label where skew enters if the recipes diverge.

Bridge. Once features are consistent, the next problem is serving them through real infrastructure at useful latency and cost. So now we move to model serving stacks. → 08-serving-infrastructure.md