04. Managed databases and caches — RDS, DynamoDB, Redis, and tradeoffs¶

⏱️ Estimated time: 18 min | Level: intermediate

ELI5 callback: In the dragon farm, the barn runs the work, the feeding trough holds the data, the fence limits access, the breeding ground scales the herd, and the ledger stops waste. Today we choose managed data services instead of babysitting boxes.

1) See the shape clearly¶

managed relational databases, NoSQL key-value stores, and managed caches all matter here. They do not optimise the same pressure. See. Start with workload shape, not vendor branding. Check startup time, runtime length, and host control. Check who patches the base layer. Check whether scale is steady or bursty. Check whether warm state must survive. Simple, no? Managed relational databases give SQL, joins, and transactional comfort. NoSQL key-value stores give elastic scale for simple access patterns. Managed caches give speed when hot reads must avoid the database. The trick is matching access pattern, not collecting logos. So what to do? Write the fit matrix before provisioning anything. - Prioritise the slowest or costliest path. - Measure idle time honestly. - Record operational ownership. - Record rollback method. - Record debugging path. - Record compliance limits. Good teams choose boring defaults first. Fancy choices can wait.

2) Read the decision signals¶

Use relational services when the data model needs transactions and rich querying. Use key-value or document systems when access patterns are predictable and massive. Use caches when repeated reads dominate and stale data is acceptable briefly. Managed services remove patching and backups, but reduce low-level tuning freedom. Self-hosting may help at extreme customisation, but raises operational load sharply. AI platforms often mix all three: metadata, features, and fast session state. Now use thresholds, not feelings. If latency is sacred, keep readiness. If cost is sacred, chase utilisation carefully. If control is sacred, reduce abstraction. If delivery speed is sacred, buy managed pieces. Quick decision prompts: - Do you need joins or only key lookups? - What is the read to write ratio? - Can stale data live for seconds or minutes? - Who handles backups, failover, and upgrades? - Is latency driven by hot keys or full scans? - Will scale be smooth or spiky? See. One clear 'no' can eliminate a whole option. Trade-offs are normal. Document the fallback path. Now watch.

3) Map the working path¶

Data services usually sit behind application or model services. The hot path should be very obvious. Cache hits and misses need different cost shapes. Now watch the minimal flow. ┌────────────┐ ┌────────────┐ ┌────────────┐ │ App/API │──→│ Cache │──→│ PrimaryDB │ └────────────┘ └─────┬──────┘ └─────┬──────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ Replica/NoSQL │ │ Backup │ └────────────┘ └────────────┘ Reads can hit the cache first for hot objects. Misses fall through to the primary database or key-value store. Replicas or specialised stores offload selective workloads. Backups and snapshots must be automatic, not manual hope. Connection pooling matters when many workers share the same database. If the access pattern is unclear, write it down before picking the engine. At every arrow, ask who retries. At every box, ask who pays. At every store, ask what expires. Now watch. One metric should sit beside each box. That is how operations stays sane.

4) Notice the common traps¶

Using Redis as the only system of record by accident. Forcing relational joins onto a key-value database. Ignoring connection limits until production traffic arrives. Caching everything and forgetting invalidation rules. Self-hosting databases for learning when the business needs speed. Assuming managed means zero tuning or zero monitoring. See. Most outages start as silent assumptions. Review these traps before launch: - Hot partitions can flatten NoSQL throughput. - Connection storms can knock over relational services. - Cache stampedes can hurt both latency and cost. - Unplanned failovers can surprise untested clients. - Large scans can compete with hot transactional traffic. - Backups without restore drills create false confidence. Simple, no? Write failure drills for the top three risks. Decide what degrades first. Decide what must never degrade. Review quotas before launch day. Prefer explicit limits over wishful thinking. Now watch.

5) Lock the operating routine¶

Write the top five queries or key lookups first. Choose the system that matches those access paths. Enable backups, restore tests, and automated patch windows. Add pooling, timeouts, and retry discipline. Define cache TTLs and invalidation ownership. Keep self-hosting as an exception, not a default. Lock the language across the team. Use the same terms in code, dashboards, and reviews. Review this quick operating list: - Measure hit rate, not only cache size. - Measure p95 query latency. - Review failover behaviour early. - Partition with access patterns in mind. - Keep schemas documented. - Test restore before boasting about backups. Good platform design keeps the barn, feeding trough, fence, breeding ground, and ledger aligned. So what to do? Create a one-page runbook. Create a one-page cost note. Create a one-page rollback note. Teach the team the same words. That alignment saves real money. See. Consistency beats cleverness. Benchmark first; opinions come second. Name the owner of every limit. Prefer reversible choices whenever the future is foggy. Document what changes during incidents. Keep one small default path for newcomers. Automate the boring thing as soon as it stabilises. Vendor docs help, but workload data matters more. Good naming prevents bad tickets. Observe p95, not only averages. Small runbooks beat heroic memory. Teach cost with the same seriousness as latency. Now watch how much confusion disappears.

Where this lives in the wild¶

Amazon Aurora or RDS for transactional application data. Common choice when teams want SQL without running database nodes themselves.
DynamoDB or Bigtable for predictable key-based scale. Useful for event state, feature lookups, and metadata paths.
ElastiCache or Memorystore for hot reads. Typical layer for sessions, feature caches, and rate counters.
Firestore or Cosmos DB for flexible document access. Common in product teams that need fast iteration with managed operations.
Managed PostgreSQL plus Redis in inference platforms. Very common mix for metadata plus low-latency serving state.

Pause and recall¶

When does relational beat NoSQL here? Say it without looking up vendor names.
What is a cache actually buying you? Give one concrete example.
Why does managed not mean free from operational thinking? State the trade-off in one line.
What is the first thing to write before choosing the database? Mention one failure mode too.

Interview Q&A¶

Q. When should you pick RDS or Aurora? A. Pick it when transactions, consistency, and flexible SQL queries matter. Common wrong answer to avoid: Use SQL only when scale is small. Better direction: Describe access pattern and operational burden together.

Q. When does DynamoDB style storage fit better? A. It fits when key-based access is clear, scale is high, and latency must stay predictable. Common wrong answer to avoid: NoSQL is always faster than SQL. Better direction: Mention partition design and limited query flexibility.

Q. Why put Redis in front of a database? A. Redis can absorb hot reads, reduce latency, and protect the primary store from repetition. Common wrong answer to avoid: Redis makes the database unnecessary. Better direction: Talk about TTLs and invalidation responsibility.

Q. Managed versus self-hosted: what is the real trade-off? A. Managed buys time and resilience features; self-hosted buys low-level control at high ops cost. Common wrong answer to avoid: Managed is expensive, so serious teams self-host. Better direction: Tie the answer to team maturity and business urgency.

Apply now (5 min)¶

List one transactional workload.
List one simple key lookup workload.
List one hot read that repeats often.
Assign each to relational, NoSQL, or cache.
Write one reason about access pattern.
Write one reason about operations.
Add one backup or failover requirement.
Mark one place where caching could backfire.

Bridge. Databases managed. But AI needs special hardware — GPUs. → 05