13. Honest Admission — MLOps is still messy in the real world¶

~13 min read. Good teams win through discipline more than perfect tooling.

Built on the ELI5 in 00-eli5.md. The assembly line — the promised neat factory flow — is useful, but real factories are rarely neat.

1) First picture: the floor is more chaotic than the brochure¶

Look.

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ data tools   │──▶│ model tools  │──▶│ deploy tools │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │
       ▼                  ▼                  ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ lineage gaps │   │ eval gaps    │   │ alert gaps   │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       └──────────────────┴──────────────────┘
                          ▼
                 operational discipline

Vendor pages show a smooth factory floor. One platform for data. One platform for training. One platform for deployment. One button for MLOps.

Reality is rougher. Data lives in one place. Features live somewhere else. Model artifacts live in the warehouse. Deployments live in a different control plane. Observability lives in another product.

So what is the real hard part? Not buttons. Operational discipline across teams, artifacts, and decisions. Simple, no?

The assembly line analogy still helps. But do not mistake the diagram for the ground. A neat drawing is not mature operations.

2) Teams usually over-engineer or under-engineer¶

This is the classic failure split. Small teams buy or build too much. Other teams stay on notebooks for too long. Both choices hurt.

Over-engineering¶

Some teams build a moon mission for a bicycle. They create a feature store, custom metadata service, multi-stage registry workflow, event bus, and three dashboards, before one model proves durable value.

Now maintenance becomes the product. The ML system sits on top of a tower of internal tools. Every engineer becomes part-time infrastructure support.

Under-engineering¶

Other teams go the opposite way. They train in ad hoc notebooks. They push models manually. They keep no lineage, weak evals, and no reliable rollback path.

This feels fast early. Then the first incident arrives. No one knows which data built the model, which prompt version is live, or which checkpoint the warehouse approved.

Look. Both extremes lose. The right answer is enough tooling for lineage, safety, and speed. Not much more. Not much less.

3) "One-click MLOps" is usually a sales sentence¶

Vendors are not useless. They solve real problems. But over-promises create bad expectations.

A platform can help you track artifacts. It can help schedule jobs. It can help monitor metrics. It can help promote through the quality gate.

But it cannot remove judgment. It cannot decide which metrics truly matter. It cannot make weak labels trustworthy. It cannot align product, platform, and legal teams by itself.

No universal best stack exists. The right stack depends on team size, risk, traffic shape, model type, and how fast truth returns from the world.

A fintech fraud team needs different controls than a marketing classifier. A medical summarization team needs different controls than an internal search bot. A three-person startup should not mimic a hyperscaler architecture. Yes?

So what to do? Choose boring components that cover your real risks. Ignore tools that mainly add ceremony. The warehouse should solve approved artifact management. The quality gate should solve promotion checks. The production monitor should solve live visibility. If a new tool does none of those, question it hard.

4) The hard parts remain genuinely hard¶

Some MLOps pain is not just bad execution. Some of it is an open operational problem. That is worth admitting directly.

Observability for LLMs is immature¶

System metrics are easy. Business metrics are understandable. But semantic quality in live traffic is still slippery. You can count refusals, length, and latency. You cannot instantly count truthfulness for every answer.

Ground truth is often delayed¶

Fraud labels may arrive weeks later. Customer satisfaction may be noisy and partial. Support resolution may be influenced by human agents after the model. That makes tight learning loops harder.

Feature stores add power and surface area¶

Feature stores help when many teams share stable features. But for small teams, they can become a maintenance tax. Freshness bugs, schema drift, backfill complexity, and serving-store parity all add surface area.

Data, model, and infra boundaries stay blurry¶

Was the failure caused by data? By thresholding? By retrieval freshness? By the hosted provider? By the prompt? Real incidents often cut across all of them.

This is why discipline matters more than stack purity. You need clear ownership, reliable metadata, and calm runbooks. The tools help. They do not replace that muscle.

5) Worked example: enough tooling beats maximum tooling¶

Suppose a six-person startup has one support bot. Traffic is moderate. Risk is medium. Labels arrive with a few days delay.

Option A is the grand platform. Build a custom feature store. Build internal experiment tracking. Build a deployment controller. Build a bespoke online evaluation layer. Build cross-region failover from day one.

Engineering estimate:

2 platform engineers for 4 months,
1 ML engineer maintaining metadata flows,
slower product iteration during setup.

Option B is the minimum disciplined stack. Use managed training jobs. Use a warehouse for approved model versions. Use the quality gate for automated evals. Use the production monitor for live metrics and alerts. Track prompt versions beside model versions. Write one real incident runbook.

Engineering estimate:

1 engineer for 4 weeks,
moderate vendor cost,
faster product iteration.

Now compare outcomes. If the bot is still proving value, Option B usually wins. You keep lineage, promotion checks, rollback, and monitoring. You do not build a moon mission for a bicycle.

Later, if three more products share features, if traffic grows sharply, and if online-offline parity becomes painful, then maybe deeper platform work becomes justified.

See the point. The best stack is staged maturity. Not maximal machinery on day one.

Where this lives in the wild¶

DoorDash ML platform — a platform lead decides which shared tooling is worth centralizing and which should stay product-local.
OpenAI applied AI teams — engineers work around delayed ground truth and still need shipping discipline for prompts, models, and evaluations.
Spotify recommendation infrastructure — an ML platform architect balances shared feature systems against team autonomy and maintenance burden.
Razorpay risk systems — a data science manager chooses tooling that preserves lineage and rollback without overwhelming a smaller team.
Anthropic enterprise deployment teams — reliability engineers face immature semantic observability even when infrastructure telemetry is strong.

Pause and recall¶

Why is operational discipline often harder than tool selection?
How do over-engineering and under-engineering fail in different ways?
Why is there no universal best MLOps stack?
When can a feature store add more burden than value?

Interview Q&A¶

Q: Why do many MLOps platforms still fail to solve the core production problem? A: Because the core problem is cross-functional discipline under uncertainty. Tools can store artifacts and metrics, but they cannot choose the right safeguards, owners, and trade-offs for your product.

Common wrong answer to avoid: "We just need a better platform vendor." Vendors help, but they do not remove judgment and ownership.

Q: Why can small teams be damaged by copying big-company ML architecture? A: Large-company stacks assume scale, specialization, and maintenance capacity. Small teams often inherit complexity long before they inherit the traffic or organizational need.

Common wrong answer to avoid: "Best practices are universal." Good principles are transferable, but stack size must match reality.

Q: Why is LLM observability still immature compared with classic infrastructure monitoring? A: Because semantic quality is harder to measure instantly and reliably than latency or error rate. Many important labels arrive late or remain ambiguous.

Common wrong answer to avoid: "Just add more dashboards." Dashboards help only when the underlying signal is truly measurable.

Q: What is the honest goal for an early MLOps stack? A: Enough tooling for lineage, safety, speed, and rollback. The goal is dependable delivery, not architectural theater.

Common wrong answer to avoid: "Build everything once so we never revisit it." Maturity should grow in stages, not in one giant upfront leap.

Apply now (5 min)¶

Take one AI product you know. List the smallest set of tools needed for lineage, promotion, rollback, and monitoring. Then mark which extra tool would be premature today.

Now sketch from memory:

the messy-tooling diagram,
the two failure extremes,
and the staged-maturity example.

Say aloud why a bicycle does not need a moon mission, and why discipline beats brochure architecture.

Bridge. MLOps teaches how to ship and operate AI in general. Next module, voice systems demand all of that under brutal realtime constraints, where latency and turn-taking become first-class engineering problems. → ../00_realtime_voice_agents/00-eli5.md