02. Object storage and data — buckets, lifecycle, and lake patterns¶

⏱️ Estimated time: 19 min | Level: intermediate

ELI5 callback: In the dragon farm, the barn runs the work, the feeding trough holds the data, the fence limits access, the breeding ground scales the herd, and the ledger stops waste. Today we organise the data that feeds every model and pipeline.

1) See the shape clearly¶

object storage, lifecycle policies, and data lake zones all matter here. They do not optimise the same pressure. See. Start with workload shape, not vendor branding. Check startup time, runtime length, and host control. Check who patches the base layer. Check whether scale is steady or bursty. Check whether warm state must survive. Simple, no? Object storage is cheap, durable, and almost endlessly scalable. Lifecycle policies move old data to colder and cheaper tiers. Data lake zones separate raw, cleaned, and curated assets. AI teams struggle when buckets become giant junk drawers. So what to do? Write the fit matrix before provisioning anything. - Prioritise the slowest or costliest path. - Measure idle time honestly. - Record operational ownership. - Record rollback method. - Record debugging path. - Record compliance limits. Good teams choose boring defaults first. Fancy choices can wait.

2) Read the decision signals¶

Use object storage for large files, model weights, logs, and datasets. Keep raw data immutable so audits and reprocessing stay possible. Use lifecycle rules to expire scratch files and archive old snapshots. Partition by date, tenant, or dataset family when reads need filtering. Very small files can destroy throughput and metadata efficiency. Lake layouts need naming discipline before scale arrives. Now use thresholds, not feelings. If latency is sacred, keep readiness. If cost is sacred, chase utilisation carefully. If control is sacred, reduce abstraction. If delivery speed is sacred, buy managed pieces. Quick decision prompts: - Who writes the bucket, and who only reads it? - Which prefixes represent raw, clean, and curated zones? - When should temporary outputs expire? - How will schemas and versions be tracked? - Will training jobs read many tiny files? - Do you need cross-region replication? See. One clear 'no' can eliminate a whole option. Trade-offs are normal. Document the fallback path. Now watch.

3) Map the working path¶

Most AI data paths look simple from far away. In reality, zones, versions, and expiry rules matter a lot. You want clean movement from landing to serving. Now watch the minimal map. ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Sources │──→│ Landing │──→│ Transform │ └────────────┘ └─────┬──────┘ └─────┬──────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ Lake Zones │ │ Catalog │ └────────────┘ └────────────┘ The landing area stores raw inputs exactly as received. Transforms produce cleaner tables, chunks, or features. Lake zones should separate temporary work from trusted outputs. Catalogs or manifests stop teams from guessing file meaning. Lifecycle rules belong close to the bucket, not in memory only. Without ownership, bucket sprawl becomes invisible waste. At every arrow, ask who retries. At every box, ask who pays. At every store, ask what expires. Now watch. One metric should sit beside each box. That is how operations stays sane.

4) Notice the common traps¶

Putting raw and curated data in one flat bucket. Letting temporary training artifacts live forever. Ignoring object counts while focusing only on bytes. Renaming prefixes casually and breaking downstream jobs. Skipping checksums and version markers on critical datasets. Using database thinking on object storage latency. See. Most outages start as silent assumptions. Review these traps before launch: - Tiny-file storms can make training startup painfully slow. - Wrong lifecycle rules can delete useful data early. - Cross-account sharing can become messy and unsafe. - Missing manifests can create silent data drift. - Uncompressed logs can bloat bills fast. - Replication without purpose can double cost. Simple, no? Write failure drills for the top three risks. Decide what degrades first. Decide what must never degrade. Review quotas before launch day. Prefer explicit limits over wishful thinking. Now watch.

5) Lock the operating routine¶

Define bucket names, prefixes, and ownership up front. Separate raw, clean, curated, and scratch locations. Turn on versioning for critical datasets and model artifacts. Attach lifecycle rules to temporary and archival classes. Publish file format, partition, and naming rules. Measure read amplification before calling the lake slow. Lock the language across the team. Use the same terms in code, dashboards, and reviews. Review this quick operating list: - Prefer columnar formats for analytics paths. - Bundle tiny files before large scans. - Tag buckets by owner and retention. - Document reprocessing strategy. - Protect raw data from accidental edits. - Keep schema notes discoverable. Good platform design keeps the barn, feeding trough, fence, breeding ground, and ledger aligned. So what to do? Create a one-page runbook. Create a one-page cost note. Create a one-page rollback note. Teach the team the same words. That alignment saves real money. See. Consistency beats cleverness. Benchmark first; opinions come second. Name the owner of every limit. Prefer reversible choices whenever the future is foggy. Document what changes during incidents. Keep one small default path for newcomers. Automate the boring thing as soon as it stabilises. Vendor docs help, but workload data matters more. Good naming prevents bad tickets. Observe p95, not only averages. Small runbooks beat heroic memory. Teach cost with the same seriousness as latency. Now watch how much confusion disappears.

Where this lives in the wild¶

Amazon S3 data lakes with Glue catalogs. Very common pattern for raw ingestion, transformation, and feature storage.
Google Cloud Storage with BigLake or BigQuery external tables. Useful when lake files must feed analytics and ML together.
Azure Blob and ADLS Gen2 lake layouts. Common in enterprise environments that mix Spark, BI, and ML workloads.
Databricks lakehouse patterns on S3, GCS, or ADLS. Good example of structure, metadata, and compute meeting properly.
Feature stores that persist offline data in object storage. Shows how the bucket often becomes the system of record for training.

Pause and recall¶

Why is immutable raw storage such a big deal? Say it without looking up vendor names.
What problem do lifecycle policies actually solve? Give one concrete example.
Why can tiny files hurt object storage based pipelines? State the trade-off in one line.
What must every data lake naming scheme make obvious? Mention one failure mode too.

Interview Q&A¶

Q. Why use object storage instead of a database for AI datasets? A. Because it scales cheaply for large blobs, logs, checkpoints, and batch reads. Common wrong answer to avoid: Databases are old; buckets are the modern replacement. Better direction: Explain size, access pattern, and cost differences.

Q. What does a basic data lake layout look like? A. Keep clear raw, cleaned, curated, and scratch zones with ownership and retention rules. Common wrong answer to avoid: Just create one bucket per team and let them manage it. Better direction: Talk about discoverability and consistent prefixes.

Q. How do lifecycle policies help? A. They delete or tier old data automatically so storage does not drift into waste. Common wrong answer to avoid: They are mainly for compliance paperwork. Better direction: Tie them to scratch outputs, archives, and bill control.

Q. What is the trap with many small files? A. Listing, opening, and scanning them can dominate job startup time and metadata overhead. Common wrong answer to avoid: Small files are fine because storage is infinite. Better direction: Mention compaction or bundling as the fix.

Apply now (5 min)¶

Draw your current or imagined bucket layout.
Mark raw, clean, curated, and scratch prefixes.
Add one owner name beside each prefix.
Add one retention rule beside each prefix.
List one dataset that must stay immutable.
List one dataset that should expire quickly.
Write one compaction idea for tiny files.
Check whether the naming scheme explains itself.

Bridge. Data stored. But who can access the feeding trough? → 03