10. Object Storage and Data Lakes — Storing Everything You Cannot Afford to Keep in a Database¶

~13 min read. Petabytes of cold data. Databases cry. Object storage does not.

Built on the ELI5 in 00-eli5.md. The shelf — long rows of archive boxes that nobody touches daily but must be accessible on demand — is exactly what object storage models. Cheap, durable, not fast.

1. What Object Storage Is¶

Object storage is a flat namespace of blobs (binary large objects) addressed by a unique key. No folders, no relations, no transactions. Just a key and its bytes.

Traditional filesystem:           Object storage:
/data/books/2024/january.csv      bucket: library-archive
  └─ folder hierarchy               key: books/2024/january.csv
  └─ file permissions                 value: 42 MB blob
  └─ POSIX interface                  metadata: {content-type, size, etag}
  └─ local disk                       storage: distributed, geo-redundant

The major object storage systems are AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage. All offer 11 nines (99.999999999%) durability by distributing erasure-coded chunks across multiple availability zones.

Durability vs availability: S3 Standard is 99.99% available (52 minutes downtime per year) but 11 nines durable (essentially zero data loss). These are separate guarantees. Know both.

2. S3 and GCS — Core Concepts¶

Storage Classes (Cost Tiers)¶

Class                Access frequency   Latency        Cost/GB/month
─────────────────────────────────────────────────────────────────────
S3 Standard          Frequent           ms             ~$0.023
S3 Standard-IA       Infrequent         ms             ~$0.0125
S3 Glacier Instant   Quarterly          ms             ~$0.004
S3 Glacier Flexible  Annual             minutes        ~$0.0036
S3 Glacier Deep      Rarely             hours          ~$0.00099

The shelf analogy: Standard is the front shelf you reach daily. Glacier Deep Archive is the basement warehouse you open once a year.

S3 Intelligent-Tiering moves objects automatically between tiers based on access patterns. Pay a small monitoring fee; save significantly on rarely accessed data.

Consistency Model¶

S3 is strongly consistent for object reads, writes, and listings. Every PUT and DELETE is immediately visible to subsequent GETs and LISTs on the same key.

Key Design Matters¶

S3 partitions based on key prefix. Avoid sequential numeric prefixes like 000001.parquet, 000002.parquet. All go to the same partition, throttling throughput.

Bad:   000001.parquet, 000002.parquet, 000003.parquet
Good:  a1b2/000001.parquet, c3d4/000002.parquet, e5f6/000003.parquet
       (hash prefix distributes across S3 partitions)

3. Data Lake Pattern¶

A data lake stores raw, schema-on-read data at massive scale. No pre-defined schema required on ingest. Schema is applied at query time.

Ingest:  App → S3 (raw JSON, CSV, logs)     ← schema-on-write not required
Process: Spark / Flink reads S3, transforms ← apply schema during processing
Serve:   Athena / BigQuery queries S3        ← SQL on object storage

Data flow:
┌──────────┐  raw data   ┌───────────┐  cleaned data  ┌───────────┐
│  Source  │ ──────────► │  Bronze   │ ─────────────► │  Silver   │
│ (App DB, │             │  Zone     │                │  Zone     │
│  Logs)   │             │ (raw S3)  │                │(Parquet)  │
└──────────┘             └───────────┘                └───────────┘
                                                           │
                                                     aggregated
                                                           ▼
                                                    ┌───────────┐
                                                    │   Gold    │
                                                    │   Zone    │
                                                    │(BI-ready) │
                                                    └───────────┘

This Bronze-Silver-Gold (Medallion) architecture is standard at Databricks and widely adopted.

4. Columnar Formats — Parquet and ORC¶

Row-based formats (CSV, JSON) store all columns of a row together. Columnar formats store all values of a column together. For analytics, columnar wins dramatically.

Row-based (CSV):
  Row 1: [id=1, title="Dune", genre="scifi", year=1965, copies=3]
  Row 2: [id=2, title="Neuromancer", genre="scifi", year=1984, copies=1]

Columnar (Parquet):
  id column:     [1, 2, 3, 4, ...]
  title column:  ["Dune", "Neuromancer", ...]
  genre column:  ["scifi", "scifi", ...]
  year column:   [1965, 1984, ...]
  copies column: [3, 1, ...]

Why columnar is faster for analytics:

Column pruning: SELECT genre, COUNT(*) FROM books GROUP BY genre reads only the genre column. Row-based must read every column for every row.
Compression ratio: repeated values in a column compress dramatically (run-length encoding, dictionary encoding).
Vectorised execution: CPUs process a batch of column values in a single SIMD instruction.

Query: SELECT AVG(copies) FROM books WHERE genre = 'scifi'
Row store: scan 1 TB of all columns, discard 90%
Parquet:   scan only genre column (100 GB) + copies column (100 GB) = 200 GB
           10× less I/O

Parquet vs ORC: - Parquet: preferred in Spark, AWS ecosystem, widely supported. - ORC: preferred in Hive, slightly better compression on highly structured data. - Both support nested schemas, predicate pushdown, and column statistics.

5. Data Lakehouse — The Best of Both Worlds¶

A data lakehouse puts ACID transaction semantics on top of object storage.

Traditional data warehouse:   Fast SQL, ACID, expensive, rigid schema
Traditional data lake:        Cheap storage, flexible, no ACID, no indexing

Data lakehouse:
  ┌─────────────────────────────────────────────────────┐
  │  Metadata layer (Delta Lake / Iceberg / Hudi)       │
  │  ─ Transaction log (ACID commits on Parquet files)  │
  │  ─ Schema enforcement and evolution                 │
  │  ─ Time travel (query data as of past version)      │
  └─────────────────────────────────────────────────────┘
           │
  ┌────────────────────┐
  │   Parquet files    │    ← still on cheap object storage
  │   on S3 / GCS      │
  └────────────────────┘

Delta Lake (Databricks) and Apache Iceberg (Netflix, Apple) are the dominant open formats. Both maintain a transaction log that records every insert, update, and delete as a commit. This enables ACID semantics without a traditional database engine.

Time travel: Query the shelf as it was 30 days ago.

SELECT * FROM loans VERSION AS OF 30  -- Delta Lake syntax
SELECT * FROM loans FOR SYSTEM_TIME AS OF '2024-01-01'  -- Iceberg

6. Cost Comparison — Object Storage vs Database Storage¶

Storage type          Cost/GB/month    Latency    IOPS         Use for
──────────────────────────────────────────────────────────────────────────
EBS gp3 (SSD)         ~$0.08           <1 ms      16,000       DB hot data
EBS io2 (high IOPS)   ~$0.125          <1 ms      256,000      DB critical
S3 Standard           ~$0.023          1–100 ms   unlimited    Data lake
S3 Glacier Deep       ~$0.001          hours      n/a          Archive
RDS storage (gp2)     ~$0.115          <1 ms      3,000        Managed DB

Rule of thumb: object storage is ~5× cheaper than SSD block storage. At petabyte scale this is millions of dollars per year.

Egress costs are the hidden trap. Moving data out of S3 to the internet costs ~$0.09/GB. Cross-region transfer costs ~$0.02/GB. Within the same region, transfer to EC2 is free. Design your compute to run in the same region as your data lake.

Where this lives in the wild¶

Netflix (Data Platform) — Entire data warehouse is an Apache Iceberg lakehouse on S3. 100+ PB of Parquet files. Engineers query with Spark and Trino (PrestoSQL). The shelf holds every view event ever recorded, queryable in seconds via predicate pushdown.

Airbnb (Data Engineering) — Apache Iceberg on S3 supports ACID tables, schema evolution, and time-travel debugging for listing and booking analytics.

Databricks (Platform, Delta Lake) — Delta Lake on object storage gives teams ACID commits, schema enforcement, and efficient batch-plus-stream processing on one lakehouse path.

Stripe (Fraud Detection) — Raw event logs land in S3 as JSON, get transformed to Parquet via Spark jobs, and feed ML feature pipelines efficiently.

Twitter / X (Archive) — Large-scale analytics datasets in ORC or Parquet benefit from predicate pushdown on timestamp filters, keeping date-range scans manageable.

Pause and recall¶

What is the difference between durability and availability? Give S3 numbers for both.
Why is columnar storage (Parquet) faster than row storage (CSV) for analytics queries? Name two specific mechanisms.
What does a data lakehouse add on top of a plain data lake? Name two concrete capabilities.
You need to store 10 TB of library scan logs for 7 years, rarely accessed. Which S3 storage class and format would you choose and why?

Interview Q&A¶

Q: What is the difference between a data lake and a data warehouse? When would you use each?

A data lake stores raw, unstructured or semi-structured data in object storage with schema applied at read time. A data warehouse stores structured, cleaned data with a pre-defined schema, optimised for SQL analytics. Use a data lake for raw event storage and flexible exploration. Use a data warehouse for governed BI reporting with known query patterns. A data lakehouse combines both via a transaction layer over object storage.

Common wrong answer to avoid: Saying a data lake is "just cheaper storage." The architectural difference is schema-on-read versus schema-on-write. This affects how you ingest, govern, and evolve data. Cost is a consequence, not the definition.

Q: Why is Parquet significantly faster than CSV for analytical queries?

Parquet is columnar: it stores all values of a column together. Analytical queries typically touch a few columns across many rows. Parquet reads only the queried columns (column pruning), skips row groups that fail predicate checks (predicate pushdown), and compresses repetitive column values far better than row-oriented formats.

Common wrong answer to avoid: Saying "Parquet is binary so it is faster to parse." Binary encoding helps slightly but is not the main reason. Column pruning and predicate pushdown are the primary performance advantages. A candidate who only mentions binary encoding has a shallow understanding.

Q: What is time travel in a data lakehouse and how does it work?

Time travel allows querying data as it existed at a past point in time or version. Delta Lake and Iceberg maintain a transaction log of all commits (insert, update, delete). To query a past state, the engine reads the log up to the target version and reconstructs which Parquet files were valid at that point. No data is physically deleted on update — old files are retained until explicitly vacuumed.

Common wrong answer to avoid: Saying time travel requires copying data snapshots. The transaction log approach stores only the metadata of what changed, not full copies of the data. Full copies would multiply storage costs with every write. The log-based approach is efficient and is how both Delta Lake and Iceberg implement it.

Q: What are the hidden cost traps in an object storage data lake?

Egress costs: transferring data out of object storage to the internet or across regions costs per GB and can exceed storage costs at high query volumes. API call costs: each LIST and GET request has a per-request charge; inefficient small-file patterns can generate millions of API calls. Small file problem: many small files mean many API calls and poor Parquet compression; compaction jobs are needed to merge them.

Common wrong answer to avoid: Focusing only on storage cost per GB. The raw storage cost is the cheapest part. Compute (Spark clusters querying the lake), egress, and API call costs routinely dominate the total bill in production data lakes.

Apply now (5 min)¶

Exercise: The shelf in the library system stores a scan log: every time a book is scanned in or out, an event is written. The system generates 5 million events per day. Design the storage strategy: choose object storage class, file format, partitioning scheme (how the S3 prefix is structured), and whether to use a lakehouse format. Justify each choice. Then estimate monthly storage cost at $0.023/GB for Standard and $0.001/GB for Glacier Deep after 365 days.

Sketch from memory: Draw the Bronze-Silver-Gold data lake architecture. Label what format the data is in at each stage, what process moves data between stages, and who queries each stage.

Bridge. You can now store petabytes cheaply and query them with SQL. Next: what if your queries are "find me documents similar to this one" rather than exact lookups? → 11-search-and-vector-stores.md