03. Rebalance, retention, broker failure — the production surface¶

~11 min read. Kafka in development is forgiving. Kafka in production reveals rebalance storms, disk-full incidents, broker-failover edge cases, and a long tail of operational subtleties. The catalogue.

Builds on: 02-producer-consumer-day-to-day.md.

The previous chapters covered model and SDK. This is what production teaches.

1) Rebalance storms¶

A rebalance is the group coordinator reassigning partitions among consumers. A storm is repeated rebalances that prevent steady-state processing.

Causes:

max.poll.interval.ms too small. Consumer takes longer to process than the interval; coordinator thinks consumer is dead; rebalance.
Frequent consumer crashes. Each crash triggers rebalance.
Autoscaling churn. Scale-up adds members; scale-down removes; each is a rebalance.
Network blips. Heartbeats miss; consumer marked dead.

Defences:

Cooperative-sticky strategy (since Kafka 2.4). Only the moving partitions reassign; the rest keep processing.
max.poll.interval.ms large. Set to ~5× your worst-case batch time.
Stable consumer pods. Avoid OOM, GC pauses long enough to miss heartbeats.
Static group membership (group.instance.id). When set, consumers retain their identity across restarts — no rebalance on a restart if the consumer comes back within session.timeout.ms. Useful for stateful consumers (Kafka Streams, ksqlDB).

consumer = KafkaConsumer(
    'orders',
    group_id='order-processor',
    group_instance_id='order-processor-1',  # static membership
    session_timeout_ms=30000,
    ...
)

2) Retention surprises¶

Kafka retention is per-topic. Two settings:

retention.ms — time-based. Default 7 days.
retention.bytes — size-based. Default unlimited.

The disk fills up when:

Producers spike and retention.bytes is unlimited.
A consumer falls far behind; you can't shorten retention without losing data the lagging consumer needs.
Compaction is misconfigured and doesn't keep up.

Defences:

Set both retention.ms and retention.bytes. Belt and suspenders. The first to trip wins.
Monitor disk usage. Alert at 70%; page at 85%.
Alert on consumer lag. A consumer falling behind retention means imminent data loss for that consumer.

Disk-full on a broker is one of the worst Kafka incidents: the broker can't accept new appends; producers backpressure or fail; depending on min.insync.replicas, the topic may become unwritable.

3) Broker failure and partition reassignment¶

When a broker fails:

Leaders on that broker are re-elected from the ISR on remaining brokers. Should be near-instant.
Replicas on that broker are out of date when the broker comes back. They catch up by fetching from the leader.
Producers see brief errors; the client refreshes metadata and retries; resumes on the new leader.

If the broker is permanently lost, you decommission it; the cluster auto-rebalances replicas onto healthy brokers (or you run a manual reassignment).

The slow case: under-replicated partitions. If a broker is down for hours, the surviving replicas are over-loaded with fetch traffic from clients and with replication catchup when the broker returns. Monitor under_replicated_partitions metric; large values indicate cluster stress.

The data-loss case: unclean leader election. If unclean.leader.election.enable=true (not default), an out-of-sync replica can become leader on failure. This can lose committed data. Default false — better to be unavailable than lose data. Almost never enable.

4) Schema evolution failures¶

Schema registry compatibility rules prevent some evolution bugs. They don't prevent all of them.

Compatible at registry, broken at consumer. A new schema is backward-compatible (consumer can read old data), but the consumer's code expects fields that the producer might omit. Compatibility rules check schema; they don't check application logic.

Default values matter. When adding a field, set a default. Old data without the field deserialises to the default. Without a default, deserialisation fails — production incident waiting.

Removing fields. Removing a field is forward-incompatible: old consumers expect it. Coordinate removal: deploy consumers that ignore the field; then deploy producers that omit it; then schema migration.

Topic-level vs. subject-level compatibility. Schema registry's "subject" by default is topicname-value (one subject per topic). Multi-schema topics (different event types in one topic) need careful naming and tooling.

5) Producer idempotence vs. transactions¶

Producer idempotence (enable_idempotence=True) prevents duplicates on retry within a single producer session. The Kafka broker tracks producer ID + sequence number; duplicates are silently dropped.

Producer transactions (transactional.id + init_transactions()) span multiple sends and Kafka offset commits in one atomic unit. Use for:

Kafka-to-Kafka pipelines (read from topic A, transform, write to topic B atomically).
Exactly-once Kafka Streams jobs.

producer = KafkaProducer(
    transactional_id='order-transformer-1',
    enable_idempotence=True,
    acks='all',
)
producer.init_transactions()
producer.begin_transaction()
try:
    producer.send('output', value=transformed)
    producer.send_offsets_to_transaction({tp: offset}, consumer_group_id)
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

For Kafka-to-external (database, REST API), transactions don't help — the external system isn't part of Kafka's transaction. You still need application-level idempotency at the destination.

6) Monitoring — what to watch¶

Broker-level:

Under-replicated partitions count.
Offline partition count (catastrophic; should be 0).
Disk usage per broker.
Request latency per request type (produce, fetch, fetch-consumer).
Network throughput.

Topic-level:

Bytes-in/sec per topic.
Messages-in/sec per topic.
Bytes-out-rate / messages-out-rate.

Consumer-level:

Consumer lag per partition (max age, total messages behind).
Rebalance frequency.

Producer-level:

Send rate, send errors.
Batch size distribution.
Request latency.

Tools: Confluent Control Center (commercial), Kafka Manager / Cruise Control (open source), Prometheus JMX exporter for self-hosted, Datadog/New Relic Kafka integrations.

The single most useful metric: consumer lag per partition. Climbing lag means consumers can't keep up; left unchecked, lag exceeds retention and data is lost.

7) Capacity planning — sizing brokers and partitions¶

Brokers. Sized for IOPS (writes are random across topics) and network (replication amplifies traffic 3× for RF=3). Typical: 12-16 cores, 64-128 GB RAM, SSDs with high IOPS, 10+ Gbps NIC.

Partitions. More partitions = more parallelism, more metadata overhead. Per-broker partition count practical limit: ~4000. Per-cluster: 200K-500K. Most teams have far fewer.

Partition count per topic. Sized for peak parallelism: max concurrent consumers in the largest consumer group. If you expect 50 consumers in a group, you need at least 50 partitions. Add headroom (75-100 partitions) so you can scale consumers without re-partitioning.

Re-partitioning is painful. Kafka has no online re-partition. Changing partition count breaks ordering for existing keys (keys remap to different partitions). The fix is creating a new topic with the new count and migrating producers/consumers — a deliberate cutover.

8) Compaction subtleties¶

Log-compacted topics keep the latest message per key, forever. Useful for state distribution.

Subtleties:

Tombstones. To delete a key, produce a message with value=null. Compaction removes the tombstone after delete.retention.ms (default 24 hours) — giving consumers time to see it.
Compaction is asynchronous. Newly-written messages aren't compacted immediately; you may see multiple versions of the same key briefly.
Empty segments not compacted. The active segment isn't compacted; only closed segments are. So fast-changing keys may have many recent versions until segment rolls.
Compaction doesn't reduce storage of unique keys. If every key is unique, compaction does nothing; the topic grows linearly.

For state distribution, log compaction is correct. For event logs (audit, replay), time-based retention is correct. Compaction is not "save disk space"; it's "keep current state forever."

9) Alternatives — when Kafka is overkill¶

Kafka brings overhead: brokers, ZooKeeper/KRaft, schemas, partition planning, consumer groups, retention tuning. For small workloads:

SQS. Managed; simpler; sufficient for queues. No replay.
Redis Streams. Lightweight; in-process if Redis is already there.
RabbitMQ. Better for traditional queues; some pub/sub.
NATS / NATS JetStream. Lighter than Kafka; persistent streaming option.
Pub/Sub (GCP) / EventBridge (AWS). Managed event-distribution.

When to use Kafka:

Throughput > 100K events/sec.
Multiple consumer groups on the same data.
Replay / time travel required.
Stream processing required (Flink, Kafka Streams).
Cross-team event distribution at scale.

Below this, simpler tools win on operational cost.

10) The "Kafka is slow" diagnosis tree¶

Common complaint: "Kafka is slow." Walk through layers:

Producer side. Check request.timeout.ms, batch size, compression. If sends are timing out, broker may be overloaded.
Broker side. Check disk I/O (sluggish disk delays writes), network (saturated NIC), GC pauses (long GC = leader unavailability).
Replication. Under-replicated partitions slow producers with acks='all'. Check ISR size.
Consumer side. Slow processing → consumer lag → reads from older segments → page cache misses → slower reads.
Partition skew. One hot partition slows the whole consumer group.

Each layer has a measurement. Don't guess; instrument.

Operational signals¶

Healthy. Under-replicated partitions = 0; consumer lag stable; producer send latency p99 within target; rebalances rare and quick.

First degrading metric. Consumer lag climbing on one partition (hot key) or all partitions (consumer slow / unscaled).

Misleading metric. Aggregate cluster throughput — masks hot brokers, hot partitions, hot consumer groups.

Expert graph. Per-partition lag heatmap; under-replicated partitions count; rebalance rate.

Where this appears in production¶

LinkedIn — the reference deployment; thousands of brokers, hundreds of thousands of partitions.
Netflix — Cruise Control for cluster balance; KRaft mode for ZK-less operation.
Uber — schema registry + Avro; KRaft mode in transition.
Confluent Cloud / MSK / Aiven — managed Kafka services for teams that don't want to operate it.
A Bengaluru fintech — Kafka with acks='all', replication factor 3, min.insync.replicas=2; uptime > 99.99%.
A Mumbai SaaS — log-compacted topics for tenant config; tombstones for deletion.
A Pune analytics platform — MirrorMaker 2 for multi-region disaster recovery.
A Goa-based startup — moved off Kafka to SQS when realised event volume was 10K/day; operational savings substantial.

Recall / checkpoint¶

What is a rebalance storm and what is the cooperative-sticky defence?
What is static group membership and when does it help?
What goes wrong when a Kafka broker's disk fills up?
Why is unclean.leader.election.enable=false the production default?
What is the difference between idempotent producer and transactional producer?
How do tombstones work in compacted topics?
When is Kafka overkill?

Interview Q&A¶

Q1. The team's Kafka cluster is rebalancing every minute. Walk through the diagnosis. Likely consumers are crashing or being marked dead by the coordinator. Check consumer logs for "max poll interval exceeded" or "session timed out". Causes: long batch processing, GC pauses, network blips, autoscaling churn. Fixes: switch to cooperative-sticky assignment; raise max.poll.interval.ms to 5× worst-case batch; set group.instance.id for static membership; reduce autoscaling thrash. Static membership especially helps when consumers restart cleanly within session.timeout.ms. Common wrong answer to avoid: "scale consumers" — more consumers means more rebalance churn.

Q2. A Kafka broker is running out of disk. Walk through the response. Immediate: investigate which topics are largest; alert teams owning them. Short-term: lower retention on non-critical topics; delete topics no longer in use; expand disk if possible. Medium-term: enable both retention.ms and retention.bytes per topic; monitor disk and alert at 70%. Never expand by enabling unclean.leader.election to "free up" replicas — risks data loss. Adding a broker to the cluster and rebalancing is the safe expansion path. Common wrong answer to avoid: "delete old segments manually" — Kafka manages segments; manual deletion corrupts indexes.

Q3. A consumer falls behind retention. Walk through the loss and the fix. If the consumer's committed offset becomes older than the oldest retained offset, the data between them is lost — Kafka returns offset-out-of-range. The consumer typically resets to auto.offset.reset (earliest or latest); either way, data is lost. Prevention: monitor per-consumer-group lag; alert when lag approaches retention; scale or fix the consumer before crossing. Recovery: if loss is unacceptable, restore from upstream / backup / a replica system. Common wrong answer to avoid: "extend retention retroactively" — only helps for future messages; loss already happened.

Q4. A team enables unclean.leader.election. What is the risk? An out-of-sync replica can become leader on failure. That replica is missing messages that the ISR had. When clients resume against the new leader, those messages are absent — silent data loss. The defence (unclean.leader.election=false) prefers unavailability to data loss. Enable unclean only in workloads where availability is more important than completeness — rarely. Discuss the trade-off explicitly with stakeholders before enabling. Common wrong answer to avoid: "unclean is fine for availability" — depends on whether the application can tolerate silent data loss.

Q5. The team has a hot partition causing per-partition lag of 10×. Walk through the responses. Three options. (1) Revise the key. If customer_id is the key and one customer dominates, compose with a suffix that distributes (e.g., customer_id:1 to customer_id:10 for that one customer). Loses strict order across the suffixes; trades for balance. (2) Add more partitions. Helps if keys are diverse; doesn't help if one key is hot. (3) Accept the skew. Scale the slow consumer separately — multiple consumers in the group cannot share one partition, so the hot partition's consumer is the constraint. Per-partition scaling means giving that partition its own thread / process / pod. Common wrong answer to avoid: "just scale up" — partition is the unit of parallelism; you can't parallelise within a partition for a single consumer group.

Q6. A team uses Kafka for 5K events per day. Walk through the recommendation. Kafka is overkill at that volume. The operational cost — brokers, ZK/KRaft, schema registry, monitoring — dwarfs the value. Alternatives: SQS for queue semantics, Redis Streams for lightweight pub/sub, EventBridge for AWS event distribution, NATS for in-process simplicity. Migrate to a simpler tool; save the operational budget for problems that actually need Kafka. Migration cost is real but bounded; ongoing operations cost recovers it quickly. Common wrong answer to avoid: "keep Kafka, you'll grow into it" — speculative; pay the cost when the volume justifies.

Operational memory¶

This chapter explained Kafka's production surface: rebalance storms, retention surprises, broker failure, schema evolution at the application layer, idempotence vs. transactions, capacity planning, compaction subtleties, and when Kafka is overkill. The important idea is that Kafka is operationally rich; production maturity comes from explicit attention to each surface.

You learned to defend against rebalance storms, manage retention and disk, handle broker failures safely, evolve schemas without breaking consumers, plan capacity, configure compaction correctly, and recognise when to use a simpler tool. That completes the production surface.

Carry this diagnostic forward: when Kafka is suspected in production, ask which production layer is involved — rebalance, retention, replication, schema, partition, or compaction. Each has a structural fix.

Remember:

Cooperative-sticky + large max.poll.interval.ms + static membership = stable groups.
Both retention.ms and retention.bytes; monitor disk; alert on consumer lag.
unclean.leader.election=false for durability; rarely enable.
Compaction is for state distribution; time-based retention for event logs.
Re-partitioning is offline; size partition count with headroom.
Below ~100K events/sec, simpler tools usually win.

Bridge. Kafka covers the streaming-log surface. The next module — 09_aws_core — covers the broader AWS surface: IAM, VPC, EC2, S3, RDS, the core services every cloud-native infrastructure rests on. → ../09_aws_core/00-eli5.md