01. Partitions, log structure, replication — what Kafka actually does¶
~12 min read. One event sent to a Kafka topic looks like one append. It is — but the append goes to a specific partition, gets replicated to followers, and lands on disk in a particular log segment. Understanding the layers is the difference between a senior interview pass and "I've used Kafka."
Builds on: 00-eli5.md.
The diary picture is enough to start. To debug "why did this consumer fall behind?" or "why did this partition lag?" you need the model.
1) The topic-partition-segment layout¶
A topic is logical. A partition is physical. A segment is what's actually on disk.
topic 'orders' (logical)
├── partition 0 (physical, on broker 1 as leader, replicas on brokers 2 and 3)
│ ├── 00000000000000000000.log (segment 0, offsets 0 to N)
│ ├── 00000000000000000000.index (offset → byte offset)
│ ├── 00000000000000005000.log (segment 1, offsets 5000 to M)
│ ├── 00000000000000005000.index
│ └── ...
├── partition 1 (on broker 2 as leader, replicas on 1 and 3)
├── partition 2 (on broker 3 as leader, replicas on 1 and 2)
└── partition 3 (on broker 1 as leader, replicas on 2 and 3)
Each partition is a directory on a broker. The directory contains numbered .log files (segments) and .index files (for fast offset lookup).
Appends go to the active (latest) segment. When the segment hits a size threshold (default 1 GB), Kafka closes it and opens a new one. Older segments age out based on retention policy.
The naming convention: each log/index file is named after the first offset in that segment. The file 00000000000000005000.log starts at offset 5000.
2) The producer path — what producer.send() actually does¶
What happens, layer by layer:
-
Partition assignment. The producer's partitioner takes the key, hashes it (default murmur2), modulos by number of partitions. Same key always lands on the same partition.
-
Batching. The producer doesn't send immediately. It buffers messages by partition, up to
batch.sizebytes orlinger.msmilliseconds (default 0 — no waiting). Larger batches mean better throughput, slightly higher latency. -
Send to broker. The producer knows which broker is the leader for that partition (from a cached metadata request). It sends the batch to that broker.
-
Broker writes. The leader writes to its local log file (append). The append is in memory (page cache); not yet fsync'd.
-
Replication. Followers fetch the new data from the leader. They write to their local logs.
-
Ack. Depending on
ackssetting: acks=0— broker doesn't ack; producer fires and forgets.acks=1— leader acks after writing to its local log. Followers may not have replicated yet.acks=all(or-1) — leader waits until all in-sync replicas have written before acking. Strongest guarantee.
The acks=all setting + min.insync.replicas >= 2 is the durability standard. Less than that, and a broker failure can lose recently-written messages.
3) The consumer path — what consumer.poll() actually does¶
What happens:
-
Group coordinator. On consumer startup, the consumer registers with the group coordinator (a broker chosen by hashing the group ID). The coordinator tracks group membership.
-
Partition assignment. The coordinator assigns each partition to one consumer in the group (the rebalance). Assignment strategy: range, round-robin, sticky, cooperative-sticky.
-
Fetch. The consumer asks each leader broker for messages from the assigned partitions, starting at the current committed offset.
-
Process. The consumer processes the messages in its main loop.
-
Commit. The consumer tells the coordinator "I'm now at offset X." The coordinator stores this in a special internal topic,
__consumer_offsets.
If the consumer crashes after processing but before commit, the message will be re-delivered on restart. Kafka is at-least-once by default. Application code must be idempotent.
For at-most-once: commit before processing (you lose messages on crash but never duplicate).
For "exactly-once" (Kafka's transactional guarantee): use Kafka transactions. The producer writes and the consumer commits within a transaction; either both happen or neither. Useful for Kafka-to-Kafka pipelines. For Kafka-to-external (database, API), still need idempotency.
4) Replication and the ISR¶
Each partition has a leader and N-1 followers (where N is the replication factor, typically 3). Producers and consumers only talk to the leader. Followers fetch from the leader continuously.
ISR — in-sync replicas. A follower is "in sync" if it has fetched up to within replica.lag.time.max.ms (default 30 seconds) of the leader. Out-of-sync followers are removed from the ISR.
Leader election. If the leader fails, a controller elects a new leader from the ISR. The new leader has all committed messages (anything acked with acks=all is by definition in the ISR).
min.insync.replicas. A topic-level setting: the minimum ISR size for acks=all to succeed. If min.insync.replicas=2 and only the leader is in sync, producers with acks=all get errors (NotEnoughReplicasException). This trades availability for durability — you'd rather block writes than risk loss.
The standard production config:
- Replication factor 3.
min.insync.replicas=2.acks=allon the producer.
This survives one broker failure with no data loss.
5) Page cache and the OS¶
Kafka's throughput comes largely from the OS page cache.
Writes go to memory (page cache), not disk, immediately. The OS flushes to disk asynchronously. Reads almost always come from page cache (recent data) or sequential disk read (older data) — both are fast.
This is why Kafka brokers benefit from large RAM machines: more page cache means more reads served without disk I/O.
Kafka does not rely on fsync on every write. Durability comes from replication — if the leader's page cache dies before flushing, the followers have the data. With acks=all and min.insync.replicas=2, two brokers have to fail simultaneously to lose data.
For very strict durability (regulated workloads), flush.messages=1 forces fsync per message — at significant throughput cost. Most workloads accept the replication-based durability and skip fsync per message.
6) Log compaction vs. time-based retention¶
Two retention modes per topic:
Time-based retention (default). Segments older than retention.ms are deleted. Set the retention to suit the consumers' replay needs — 1 day, 7 days, 30 days.
Log compaction (cleanup.policy=compact). Kafka keeps the latest message per key forever. Older messages with the same key are removed. Useful for state stores (e.g., "current price per product"). The topic acts as a key-value store you can replay.
Compaction is a background process; older messages are removed in batches. It's not synchronous.
For audit / event sourcing: time-based retention, long horizon.
For state distribution: log compaction.
You can combine both — keep latest per key, AND delete keys older than N days.
7) Offsets, committing, and __consumer_offsets¶
Consumer offsets are stored in a special internal topic: __consumer_offsets. The topic has 50 partitions by default; each consumer group's offsets are hashed to one partition.
Commits can be:
- Auto-commit.
enable.auto.commit=true. Everyauto.commit.interval.ms(default 5000ms), the consumer commits the current position. Simple; risk of losing progress if the consumer crashes between commits. - Manual commit. The consumer calls
commit()after processing. More control; you commit only when work is done.
The standard production pattern:
consumer = KafkaConsumer(
'orders',
group_id='order-processor',
enable_auto_commit=False,
auto_offset_reset='earliest',
)
for message in consumer:
process(message)
consumer.commit()
auto_offset_reset='earliest' — if no committed offset (new group), start from the beginning. 'latest' skips old data.
8) Rebalancing — the worst-case behaviour¶
When a consumer joins or leaves a group, the coordinator triggers a rebalance. During rebalance, all consumers in the group pause processing while partitions are reassigned.
The classic rebalance (eager): stop everyone, reassign all partitions, resume. Painful — consumer downtime proportional to group size.
Modern rebalance (cooperative-sticky, default in Kafka 2.4+): only reassign the partitions that need to move; the rest keep processing. Much less disruptive.
Even with cooperative rebalance, frequent rebalances are a problem. Causes: consumer crashes, scaling events, network issues that delay heartbeats. The defence is stable consumers (long-lived; large heartbeat timeout; resilient to transient errors).
9) The threaded example — one event end to end¶
Producer at app server in Bengaluru sends:
producer.send(
'orders',
key=b'customer-42',
value=json.dumps({'order_id': 1234, 'amount': 5000}).encode(),
)
What happens:
T+0 Producer partitioner: hash('customer-42') % 4 = 2. Partition 2.
T+0.1 Producer buffers message in partition 2 batch.
T+5 Batch fills (or linger.ms hits); producer sends to leader of partition 2 (broker 3).
T+5.5 Broker 3 appends to its partition 2 log. Append in page cache.
T+5.6 Broker 3 sends ack only after followers (brokers 1 and 2) fetch and write.
T+6 Followers fetch. Broker 1 writes. Broker 2 writes. Both report success.
T+7 Broker 3 acks the producer. Send succeeds.
T+8 A consumer in group 'order-processor' polls. The consumer assigned to
partition 2 fetches from broker 3. Gets the new message.
T+9 Consumer processes. Calls commit() — writes offset to __consumer_offsets.
T+10 A consumer in group 'analytics' (independent group) also reads the same
partition. It gets the same message at its own pace. The two groups
don't interfere.
Three properties to notice:
- The producer doesn't know how many consumers exist.
- Multiple consumer groups read the same data independently.
- The message is on three brokers before the ack returns — durable through one broker failure.
Operational signals¶
Healthy. ISR matches replication factor; consumer lag stable; producer throughput stable; no rebalancing storms; disk usage trending with retention setting.
First degrading metric. Consumer lag climbing in a group — consumers can't keep up with producers.
Misleading metric. Total topic throughput — masks per-partition skew (one hot partition can be the bottleneck).
Expert graph. Per-partition lag in each consumer group; ISR size per partition; under-replicated partition count.
Where this appears in production¶
- LinkedIn (Kafka's origin) — multi-trillion messages per day; the reference deployment.
- Netflix — Kafka for the Keystone pipeline; event distribution at petabyte scale.
- Uber — Kafka for ride events, surge calculation, real-time analytics.
- Pinterest — Kafka for activity streams and event-driven services.
- Goldman Sachs / financial institutions — Kafka with log compaction for state distribution; strict durability config.
- A Bengaluru fintech — Kafka for payment events; replication factor 3,
acks=all, multi-region replication. - A Mumbai logistics platform — Kafka topic per region with cooperative-sticky assignment.
- A Pune analytics SaaS — log-compacted topics for tenant configuration state.
Recall / checkpoint¶
- What is the difference between a topic and a partition?
- What does
acks=allmean and what does it trade off? - What is the ISR and how is it maintained?
- What is the difference between time-based retention and log compaction?
- How are consumer offsets stored?
- What is rebalancing and how does cooperative-sticky reduce its cost?
- Why does Kafka's throughput depend on the OS page cache?
Interview Q&A¶
Q1. A producer sets acks=1. Walk through the durability risk.
With acks=1, the leader acks after writing to its local log; followers may not have replicated. If the leader fails after the ack but before followers fetch the new data, the message is lost. The fix: acks=all with min.insync.replicas>=2. This requires two brokers to have the data before acking; one broker failure is safe. The trade-off is latency (waiting for replication) and availability (if ISR drops below min.insync.replicas, writes fail). For payments, regulated, or audit workloads, acks=all is non-negotiable. Common wrong answer to avoid: "acks=1 is good enough" — it's the default but not the production default for durability-critical workloads.
Q2. The team's consumer group has high lag on one partition but not others. Walk through the diagnosis. Hot-partition skew. The message key is mapped to a partition by hash; if one key dominates (a single popular customer, a popular product, or a poor key choice), all that key's messages land on one partition. That partition can't be parallelised within the group. Fix: improve the key (compound key, include random suffix where order isn't critical); add more partitions (only helps if keys are diverse); or accept the skew and scale the slow consumer. Common wrong answer to avoid: "add consumers" — partition is the unit of parallelism; more consumers without more partitions don't help.
Q3. The consumer commits before processing. Walk through the failure mode. At-most-once: if the consumer commits, then crashes before processing, the message is lost — restart picks up at the committed offset, which is past the failure point. For most workloads, this is wrong; at-least-once (commit after process) plus idempotent handlers is the standard. At-most-once is appropriate only for workloads where loss is acceptable (some metrics, debug logs). Common wrong answer to avoid: "commit timing doesn't matter" — it determines the delivery guarantee.
Q4. Rebalancing in the consumer group takes 60 seconds; processing pauses. Walk through the fix.
Use cooperative-sticky assignment strategy. With the default eager strategy, all consumers pause and all partitions reassign. With cooperative-sticky, only the partitions that need to move reassign; consumers keep processing their existing partitions. Set partition_assignment_strategy=['cooperative-sticky'] on consumers. Also investigate why rebalances are frequent — usually consumer crashes, scaling churn, or session timeouts; stabilise those. Common wrong answer to avoid: "reduce session timeout" — produces more rebalances on transient network issues.
Q5. The team wants to add a new analytics consumer without affecting the existing pipeline. Walk through the pattern.
A new consumer group. Kafka's design: multiple consumer groups read the same topic independently. The new group reads from auto.offset.reset='earliest' to replay history, or 'latest' for current-only. The existing group's offsets and processing are unchanged. The new consumer can fall behind, fail, or scale — none of it affects the existing one. This is the structural benefit of Kafka over a queue. Common wrong answer to avoid: "fan out at the producer" — Kafka does this for free via groups.
Q6. The team uses log compaction for a config topic but new consumers see stale config briefly. Walk through what's happening. Compaction is asynchronous. The compactor runs in the background, periodically; between runs, the topic may have multiple versions per key. A new consumer reading from offset 0 sees the historical sequence — including outdated versions — before catching up to the current state. Mitigation: producers always emit the current state; consumers read all messages but apply only the latest per key (using an in-memory map). The pattern works because compaction is eventually consistent. Common wrong answer to avoid: "tune compaction frequency" — helps but doesn't eliminate the gap; in-memory state in consumers is the structural fix.
Operational memory¶
This chapter explained Kafka's internals: topics and partitions, the log structure, replication and ISR, page cache, retention modes, consumer offsets, and rebalancing. The important idea is that Kafka is a partitioned, replicated, append-only log — and most operational behaviour follows from those three properties.
You learned to follow a message from producer to consumer, choose acks for durability, configure replication and ISR, choose retention mode, commit offsets correctly, and use cooperative rebalancing. That solves the internals; day-to-day code patterns come next.
Carry this diagnostic forward: when Kafka behaves unexpectedly, ask which structural property is involved — partition assignment, replication, retention, offsets, or rebalance.
Remember:
- Topic = logical; partition = physical; segment = on-disk file.
acks=all+min.insync.replicas=2+ replication factor 3 = production durability.- Partition is the unit of order and parallelism.
- Page cache > fsync; replication is the durability mechanism.
- Multiple consumer groups read independently; that's Kafka's structural advantage.
Bridge. The model is clear. Day-to-day, you write producers, consumers, choose keys, manage rebalancing. The next chapter is the developer's surface. → 02-producer-consumer-day-to-day.md