Skip to content

14. Honest admission — where database confidence should end, and good judgment should begin

~15 min read. Mature engineers know which answers are principles, and which are still experiments.

Built on the ELI5 in 00-eli5.md. The shelf — the storage engine where data physically lives — still matters, but some choices above that shelf remain unsettled and empirical.


1. NewSQL versus NoSQL is still a live debate, not a closed chapter

See. People like clean verdicts. Reality gives families of systems with overlapping strengths. NewSQL systems promise relational models with distributed scale. NoSQL systems promise flexible schemas, simpler scale-out paths, or workload specialization. Both camps have real wins.

A NewSQL-style system can give SQL, transactions, and stronger consistency. That is attractive for developer ergonomics and correctness. But it may introduce coordination cost, operational complexity, or workload constraints. A NoSQL system may give simpler write scale, higher availability, or data-model flexibility. But application logic often absorbs more burden.

NewSQL ─→ stronger relational comfort, harder coordination cost
NoSQL  ─→ specialized scale paths, more app-side tradeoffs

So what is the honest answer? There is no permanent winner. The right choice depends on access patterns, team skill, failure tolerance, and data shape. A product with complex joins may regret premature NoSQL. A write-heavy global event stream may regret over-coordinated relational assumptions.

The shelf choice still matters deeply. But system labels alone do not finish the design. Simple, no? A senior answer names the tradeoff and the unknowns. Not just the favorite technology badge.


2. The best index strategy often appears only after real query evidence

Students want one perfect indexing recipe. Look. Indexes are workload-shaped tools. A beautiful composite index for one query may hurt writes everywhere else. Another index may become useless after product behavior changes. This is why indexing remains empirical.

Suppose a table stores orders with fields: tenant_id, status, created_at, customer_id, total_amount. One team first adds an index on tenant_id only. Then dashboards start filtering by tenant and status. Later, support tooling sorts by created time. Suddenly the original index is not enough. A composite index might help. But which column order wins? The answer depends on actual selectivity and query mix.

query shape changes → useful index changes → write cost changes too

Explain plans help. Production query samples help more. A/B testing different indexes sometimes helps most. This is trial-and-error, but not random trial-and-error. It is instrumented iteration.

So do not pretend the optimal index is obvious from the schema alone. The shelf can store data perfectly well, while the access path still needs measured tuning. That is normal engineering, not failure.


3. Distributed transactions remain painful, even when tools improve

Now let us admit the painful part clearly. Cross-service transactions are still hard. Two-phase commit exists. Sagas exist. Outbox patterns exist. None of them make distributed correctness feel pleasant.

Two-phase commit gives stronger coordination. It also creates blocking risk, coordinator sensitivity, and operational complexity. Sagas reduce lockstep coordination. But they push failure handling into compensating actions. Compensation is often not a true rollback. Refunding money is not the same as never charging it. Sending a cancellation email is not the same as no email.

service A success
service B failure
└─→ now you need compensation, retries, or manual repair

Worked example. An ecommerce checkout reserves inventory, charges payment, and creates shipment. If shipment creation fails after payment succeeds, what is the universal perfect rollback? Usually there is none. You refund, restock, and notify. That is business repair logic, not magical atomic reversal across everything.

So the honest admission is useful. Distributed transactions remain painful. The goal is not perfect elegance. The goal is bounded inconsistency, observable recovery, and safe compensation. That answer sounds more senior than pretending sagas solved everything.


4. Storage cost models are usually directional estimates, not exact science

People ask, "How much will this database design cost?" You should estimate. You should also admit uncertainty. Storage cost depends on data growth, compression, index amplification, replicas, backups, and access patterns. Small assumption errors can multiply quickly.

Suppose raw user events are 2 terabytes per month. Compression may reduce them to 0.7 terabytes. Three replicas raise that to 2.1 terabytes. Secondary indexes add 40% more. Backups add another 50% of logical size. Now your rough stored footprint is: 2.1 × 1.4 × 1.5 = 4.41 terabytes. That is already far from the raw input guess.

And that still ignores compaction overhead, tombstones, hot-tier retention, and query egress. So cost modeling is often heuristic. You build a range, not a fake precise prophecy. That is especially true when workload shape is still moving.

raw data → compression → replicas → indexes → backups → surprise overhead

The shelf itself may be cheap. The surrounding copies, metadata, and lifecycle rules often are not. Therefore speak in ranges. Say what assumption would move the range. That is honest and useful.


5. Vector database tooling is promising, but the landscape is still immature

Now the current hot area. Vector databases are useful. They are also young compared with mature relational engines. Features vary widely across products. Operational lessons are still accumulating. Benchmark claims are often query-set specific.

One tool may shine on ANN recall. Another may shine on filtering with metadata. A third may shine on managed operations. But teams still debate durability semantics, backup flows, hybrid search quality, and cost predictability. Even embedding model changes can reshape the storage plan. That means architecture choices may age quickly.

The honest answer is not pessimism. It is disciplined caution. Use vectors where semantic retrieval clearly helps. Evaluate with your own data. Keep migration paths possible. Avoid deep lock-in until workload shape stabilizes.

useful today ─→ yes
fully settled ecosystem ─→ no

The shelf may now hold vectors, postings, rows, and blobs together. But our collective playbook is still evolving. So close with structured honesty: state what you know, state what remains uncertain, and state how you would validate before scaling harder. That is mature system design.


Where this lives in the wild

  • CockroachDB platform evaluator at a fintech weighs SQL comfort against coordination cost and migration complexity.
  • MongoDB-backed product engineer at a fast-moving startup accepts application-side joins and denormalization for delivery speed.
  • Stripe or Adyen payments engineer still treats cross-service transaction boundaries as careful compensation problems, not trivial atomic workflows.
  • Snowflake or BigQuery cost analyst models storage and compute in ranges because compression and access patterns shift actual bills.
  • RAG infrastructure engineer comparing Pinecone, Milvus, and pgvector tests recall, filtering, and operations with real queries before committing heavily.

Pause and recall

  1. Why is NewSQL versus NoSQL still a live decision rather than a settled winner?
  2. Why does index tuning remain empirical even when schemas look clear?
  3. Why are sagas helpful but still not equal to one perfect rollback?
  4. Why should vector database choices be made with extra caution today?

Interview Q&A

Q: Why admit uncertainty in database design instead of forcing a single confident winner? A: Because many storage choices depend on workload evidence, team capability, and failure costs that are not fully known upfront. Structured honesty leads to better validation plans.

Common wrong answer to avoid: "Because system design is subjective" — it is not random opinion. It is evidence-guided judgment under uncertainty.

Q: Why is the optimal index strategy often discovered through iteration? A: Query mix, selectivity, and write amplification interact in ways the schema alone cannot reveal. Explain plans and production measurements uncover the useful path.

Common wrong answer to avoid: "Because indexes are unpredictable" — they are understandable, but workload-dependent.

Q: Why are distributed transactions still painful even with sagas and outbox patterns? A: Because real-world side effects are not always perfectly reversible. These patterns manage failure better, but they do not erase business-level repair complexity.

Common wrong answer to avoid: "Because engineers have not built the right framework yet" — frameworks help, but the deeper pain comes from coordinating separate systems and side effects.

Q: Why should teams stay cautious about vector database lock-in? A: The ecosystem is evolving quickly, evaluation standards vary, and workload-specific behavior can change the best choice. Leaving migration room is often wise.

Common wrong answer to avoid: "Because vector databases are hype" — they are useful, but the surrounding operational playbook is still maturing.


Apply now (5 min)

Exercise: Pick one product idea and list five unknowns before choosing its primary data store. Include one uncertainty about indexing, one about transaction boundaries, and one about cost growth. Then write a two-line validation plan for each unknown.

Sketch from memory: draw the decision chain from workload assumptions to shelf choice, index tuning, transaction pattern, and cost range. Then circle the places where evidence must replace opinion.


Bridge. We leave storage now and move into event-driven distributed systems, where messages, retries, and asynchronous boundaries become the main story. → ../06_event_driven_distributed_systems/00-eli5.md