Skip to content

00. Kafka — ELI5

Kafka is a giant, append-only diary. Writers add pages to the back. Readers each carry their own bookmark. Nobody erases anything; old pages just expire on a schedule. Many readers can read the same diary at different speeds, from any point in time, without disturbing each other.


Picture a vast library with one rule: every event that happens anywhere in the company is written into a numbered diary. The diary is split across many shelves (partitions); events about customer 42 always go to shelf X, events about customer 17 always go to shelf Y. Each shelf has its own chronological page numbers.

Readers come in. Each reader carries a bookmark — "I'm at page 1042 on shelf X." They read the next page, do something with it, and advance their bookmark. The library doesn't care; it doesn't track what's been read. It just keeps the diary and lets readers manage their own progress.

The kicker: many readers can read the same shelf simultaneously, each at their own bookmark. A new reader joining tomorrow can start from page 0 and replay everything that ever happened. A reader who fell behind can catch up. A new analytics service can subscribe without disturbing the existing ones.

This is Kafka. A distributed commit log. Partitioned for parallelism. Retained for hours, days, or forever. Replayable. Multi-reader.


The recurring vocabulary

Name What it is
topic a named log; the unit of subscription
partition a shard of a topic; events within a partition are strictly ordered
broker a Kafka server; multiple brokers form a cluster
producer a client that appends events to a topic
consumer a client that reads events from a topic
consumer group a group of consumers cooperating to read a topic; each partition assigned to one consumer in the group
offset the page number — each event's position within a partition
commit a consumer's record of "I've processed up to offset X"
replica a copy of a partition on another broker; for fault tolerance
ISR (in-sync replicas) replicas currently caught up to the leader
retention how long events stay in the topic before deletion

The picture

        producers (1+)
              │ send to topic 'orders', key=order_id
   ┌───────────────────────────────────────────┐
   │ Topic: orders                              │
   │                                            │
   │  Partition 0:  [e][e][e][e][e][e][e][e]   │
   │  Partition 1:  [e][e][e][e][e]            │
   │  Partition 2:  [e][e][e][e][e][e][e]      │
   │  Partition 3:  [e][e][e]                  │
   │                                            │
   │  (each partition replicated across 3 brokers)
   └────────┬──────────┬──────────┬──────┬─────┘
            │          │          │      │
       Consumer group A (3 consumers)
            │          │          │      │
            ▼          ▼          ▼      ▼
        consumer 1  consumer 2  consumer 3  (consumer 3 owns 2 partitions)

       Consumer group B (independent — reads same partitions)
            │          │          │      │
            ▼          ▼          ▼      ▼
        consumer 1  consumer 2  consumer 3  consumer 4

Two consumer groups read the same topic. Within a group, each partition is assigned to one consumer. Across groups, the same partition can be read by multiple consumers independently — each group has its own offsets.


Two facts that surprise new Kafka users

Kafka does not push to consumers; consumers pull. Each consumer asks the broker "what's at offset X+1?" The broker responds. No callbacks, no push. The consumer is in control of pace and position.

The partition is the unit of order, not the topic. Events within a partition are strictly ordered. Events across partitions have no ordering relationship. If a customer's events must be ordered, all their events must go to the same partition — typically by setting the message key to the customer ID, which Kafka uses to hash to a partition.


When to use Kafka (and when not)

Kafka is right for:

  • Event-driven architectures with multiple consumers.
  • High-throughput data pipelines (millions of events per second).
  • Workloads needing replay (debugging, new consumer onboarding, audit).
  • Stream processing — Flink, Kafka Streams, ksqlDB, Spark Structured Streaming.
  • Cross-team event distribution where producers don't know consumers.

Kafka is wrong for:

  • Request-response (use RPC).
  • Small workloads where simplicity matters (use SQS or Redis pub/sub).
  • Strict point-to-point delivery (use a queue, not a log).
  • Workloads needing ad-hoc filtering or queries (use a database).

The cost of operating Kafka is real: ZooKeeper or KRaft mode, brokers, replication, monitoring. For workloads under ~100K events/day, a managed queue is usually the right choice.


What this module covers

  1. 01-partitions-log-replicas-internals.md — How partitions, the log structure, and replication actually work; the model that produces Kafka's throughput and durability.
  2. 02-producer-consumer-day-to-day.md — Writing producers and consumers, choosing keys, consumer groups, rebalancing, the patterns developers write daily.
  3. 03-rebalance-retention-prod-gotchas.md — Rebalancing storms, retention surprises, schema evolution, monitoring, the production catalogue.

Bridge. Before writing producers and consumers, we open the partition and see what makes Kafka different from a queue. → 01-partitions-log-replicas-internals.md