Kafka looks deceptively simple from the outside — you publish to a topic, someone reads from it. Under the hood it is a fairly intricate distributed system where several pieces have to agree on who owns what before a single byte gets delivered. I spent a good amount of time untangling this, and most articles stop at “partitions give you parallelism” without explaining the actual handshake. Let me go deeper.
What a Kafka Cluster Actually Is
A Kafka cluster is a group of brokers — ordinary JVM processes, each running on its own machine (or container). Every broker stores a slice of the data and knows about the rest of the cluster [1].
Topics are the logical unit you interact with. Physically, a topic is split into partitions, and each partition is spread across multiple brokers as replicas. One of those replicas is the leader — the only one that handles reads and writes for that partition. The rest are followers that silently replicate the leader’s data [1].
So when people say “Kafka scales horizontally,” what they mean is: you can have 100 partitions for a topic, and those 100 partitions can be spread across N brokers — each broker handling a different slice, each consumer in your app reading a different slice in parallel.
Who Coordinates the Cluster — ZooKeeper vs KRaft
This is where a lot of old articles confuse people. Historically, Kafka outsourced its cluster coordination to ZooKeeper. Every broker raced to create an ephemeral node in ZooKeeper; the first one to succeed became the Controller broker — the one responsible for assigning partition leaders across the cluster [2].
That architecture is gone. Since Kafka 3.3 it has been deprecated, and since Kafka 4.0 it is simply not supported [3]. The replacement is called KRaft (Kafka Raft).
KRaft: Kafka Runs Its Own Consensus
In KRaft mode, a subset of brokers (or dedicated nodes) act as controllers and form a Raft quorum. They elect a leader among themselves using the Raft consensus algorithm — majority vote, term numbers to prevent stale leaders, and log replication to keep the quorum consistent [4].
All cluster metadata — topic configs, partition assignments, broker registrations — lives in a single internal Kafka topic called __cluster_metadata. The active controller leader writes events to this log; the follower controllers replay them [3]. Brokers subscribe to this metadata stream and keep their local view updated.
Why does this matter? Because:
- Failover is faster — no external ZooKeeper coordination round-trip [2]
- The metadata log itself is replicated with Kafka’s own guarantees
- You deploy fewer moving parts
How Partition Leaders Are Elected
When a broker dies, someone has to take over its partition leaderships. That job belongs to the KRaft controller.
The controller uses the ISR list (In-Sync Replicas) to decide who can be the new leader [5]. ISR is the set of follower replicas that are fully caught up with the leader — lagging replicas are kicked out of ISR. So if a leader fails:
- The controller detects the broker is gone
- It looks at the ISR for each partition the dead broker led
- It picks one ISR member and promotes it to leader
- It writes that decision into
__cluster_metadata - All other brokers learn the new leader from the metadata stream
Only ISR members are eligible for leader election. This is important — it guarantees the new leader has all committed data and consumers see no gaps [5].
How a Consumer Actually Finds the Right Broker
Here is the part most people gloss over. A consumer doesn’t just magically connect to the right broker. There is a specific handshake.
Step 1 — Bootstrap (One-Time Cluster Discovery)
You configure bootstrap.servers with one or more broker addresses. This list is only for initial discovery, not for ongoing traffic [6].
The client picks any address from the list, opens a TCP connection, and sends a Metadata request. Any broker can answer this — it returns the full broker list and the partition-to-leader mapping for every topic the client cares about [6]. After that single request, the client has a complete map of the cluster and connects directly to the leaders it needs.
If your bootstrap broker dies an hour later, the client is already fine — it knows the rest of the cluster [6].
Step 2 — Finding the Group Coordinator
Consumers don’t just connect to any broker. Every consumer group is anchored to a specific Group Coordinator broker.
The mapping is deterministic: Kafka hashes the group.id string to one of the partitions of the internal __consumer_offsets topic, and the leader of that partition is the Group Coordinator [7]. All consumers with the same group.id end up talking to the same coordinator. Different groups are distributed across different coordinators.
The consumer sends a FindCoordinator request (containing its group.id) to any broker. That broker resolves the hash and replies with the coordinator’s host and port [7].
Step 3 — JoinGroup and Partition Assignment
Once every consumer in the group has found the coordinator, the rebalance protocol kicks in:
- Each consumer sends a
JoinGrouprequest to the coordinator - The coordinator designates one consumer as Group Leader — typically the first to join
- The coordinator sends the full member list to the Group Leader
- The Group Leader runs a partition assignment strategy locally and sends the result back
- The coordinator distributes the assignments to every member via
SyncGroupresponses [7][8]
The Group Leader is just a temporary logical role, not a permanent node. It changes every rebalance [7].
Partition Assignment Strategies
Kafka ships with a few built-in strategies, configurable via partition.assignment.strategy [8]:
| Strategy | How It Works | Best For |
|---|---|---|
| RangeAssignor | Divides partitions numerically across consumers | Simple, predictable splits per topic |
| RoundRobinAssignor | Circular distribution across consumers | Even distribution across topics |
| StickyAssignor | Keeps prior assignments, minimises movement on rebalance | Stateful consumers with warm caches |
| CooperativeStickyAssignor | Like Sticky but allows incremental rebalances (default in 3.0+) | Production — avoids stop-the-world rebalances |
CooperativeStickyAssignor is the default since Kafka 3.0 because it does incremental rebalances — only the partitions that actually need to move are revoked, not all of them at once [8].
What Consumers Can and Cannot Read
One subtle thing: consumers read from the partition leader, not followers [5]. So even if a follower replica is on a broker physically closer to your consumer, the read still goes to the leader.
There is one exception — Follower Fetching (introduced in Kafka 2.4), where a consumer can be configured to read from the nearest replica. But by default, leader reads only.
Consumers also cannot read data beyond the high-water mark — the latest offset that has been acknowledged by all ISR replicas [5]. This prevents consumers from reading data that could be rolled back if the leader dies before followers catch up.
What Happens When a Broker Dies Mid-Consume
Say a consumer is reading partition 1, and its leader broker crashes:
- The KRaft controller detects the broker failure
- It picks a new leader from the ISR for partition 1
- It updates
__cluster_metadata - The consumer’s next fetch request gets a
NOT_LEADER_OR_FOLLOWERerror from the old broker (or a timeout) - The consumer automatically refreshes its metadata and retries the fetch — now against the new leader [6]
The consumer doesn’t need manual intervention. The client library handles the retry and metadata refresh transparently. From your application code’s perspective, there might be a brief pause, but messages are not lost or duplicated (assuming proper offset commit settings).
The Full Picture in One Flow
- Bootstrap broker — only used once, to get a full cluster map
- Group Coordinator — handles join/sync/heartbeat for the consumer group
- Partition Leader — the actual broker the consumer fetches messages from
These can be three different brokers, or the same one. Kafka doesn’t care.
Why This Design Holds Up
The architecture is genuinely elegant once you see it whole. Metadata discovery is separated from coordination which is separated from data transfer. Each layer has a clear owner — KRaft for cluster state, Group Coordinator for consumer group state, Partition Leader for data.
Failures at any layer have defined fallback paths. A dead bootstrap broker is irrelevant after startup. A dead Group Coordinator triggers a FindCoordinator retry. A dead Partition Leader triggers a metadata refresh and reconnect.
Not simple to implement — but very clean to reason about as an operator.
End
Sources
- Kafka Topics, Partitions, and Brokers: Core Architecture — Conduktor
- ZooKeeper and Apache Kafka® Explained: From Legacy to KRaft — Confluent
- KRaft vs ZooKeeper — Apache Kafka Official Docs
- KRaft Explained: How Kafka Implements Raft Consensus — Medium
- Understanding In-Sync Replicas (ISR) in Apache Kafka — GeeksforGeeks
- What is a Kafka Bootstrap Server? — Confluent
- Apache Kafka® Internal Architecture — Consumer Group Protocol — Confluent Developer
- How to Implement Kafka Consumer Assignment Strategies — OneUptime