How Kafka Manages a Cluster and Routes Consumers Right

Kafka looks deceptively simple from the outside — you publish to a topic, someone reads from it. Under the hood it is a fairly intricate distributed system where several pieces have to agree on who owns what before a single byte gets delivered. I spent a good amount of time untangling this, and most articles stop at “partitions give you parallelism” without explaining the actual handshake. Let me go deeper.

What a Kafka Cluster Actually Is

A Kafka cluster is a group of brokers — ordinary JVM processes, each running on its own machine (or container). Every broker stores a slice of the data and knows about the rest of the cluster [1].

Topics are the logical unit you interact with. Physically, a topic is split into partitions, and each partition is spread across multiple brokers as replicas. One of those replicas is the leader — the only one that handles reads and writes for that partition. The rest are followers that silently replicate the leader’s data [1].

So when people say “Kafka scales horizontally,” what they mean is: you can have 100 partitions for a topic, and those 100 partitions can be spread across N brokers — each broker handling a different slice, each consumer in your app reading a different slice in parallel.

kafka cluster overview

Who Coordinates the Cluster — ZooKeeper vs KRaft

This is where a lot of old articles confuse people. Historically, Kafka outsourced its cluster coordination to ZooKeeper. Every broker raced to create an ephemeral node in ZooKeeper; the first one to succeed became the Controller broker — the one responsible for assigning partition leaders across the cluster [2].

That architecture is gone. Since Kafka 3.3 it has been deprecated, and since Kafka 4.0 it is simply not supported [3]. The replacement is called KRaft (Kafka Raft).

KRaft: Kafka Runs Its Own Consensus

In KRaft mode, a subset of brokers (or dedicated nodes) act as controllers and form a Raft quorum. They elect a leader among themselves using the Raft consensus algorithm — majority vote, term numbers to prevent stale leaders, and log replication to keep the quorum consistent [4].

All cluster metadata — topic configs, partition assignments, broker registrations — lives in a single internal Kafka topic called __cluster_metadata. The active controller leader writes events to this log; the follower controllers replay them [3]. Brokers subscribe to this metadata stream and keep their local view updated.

Why does this matter? Because:

Failover is faster — no external ZooKeeper coordination round-trip [2]
The metadata log itself is replicated with Kafka’s own guarantees
You deploy fewer moving parts

How Partition Leaders Are Elected

When a broker dies, someone has to take over its partition leaderships. That job belongs to the KRaft controller.

The controller uses the ISR list (In-Sync Replicas) to decide who can be the new leader [5]. ISR is the set of follower replicas that are fully caught up with the leader — lagging replicas are kicked out of ISR. So if a leader fails:

The controller detects the broker is gone
It looks at the ISR for each partition the dead broker led
It picks one ISR member and promotes it to leader
It writes that decision into __cluster_metadata
All other brokers learn the new leader from the metadata stream

Only ISR members are eligible for leader election. This is important — it guarantees the new leader has all committed data and consumers see no gaps [5].

How a Consumer Actually Finds the Right Broker

Here is the part most people gloss over. A consumer doesn’t just magically connect to the right broker. There is a specific handshake.

Step 1 — Bootstrap (One-Time Cluster Discovery)

You configure bootstrap.servers with one or more broker addresses. This list is only for initial discovery, not for ongoing traffic [6].

The client picks any address from the list, opens a TCP connection, and sends a Metadata request. Any broker can answer this — it returns the full broker list and the partition-to-leader mapping for every topic the client cares about [6]. After that single request, the client has a complete map of the cluster and connects directly to the leaders it needs.

If your bootstrap broker dies an hour later, the client is already fine — it knows the rest of the cluster [6].

Step 2 — Finding the Group Coordinator

Consumers don’t just connect to any broker. Every consumer group is anchored to a specific Group Coordinator broker.

The mapping is deterministic: Kafka hashes the group.id string to one of the partitions of the internal __consumer_offsets topic, and the leader of that partition is the Group Coordinator [7]. All consumers with the same group.id end up talking to the same coordinator. Different groups are distributed across different coordinators.

The consumer sends a FindCoordinator request (containing its group.id) to any broker. That broker resolves the hash and replies with the coordinator’s host and port [7].

Step 3 — JoinGroup and Partition Assignment

Once every consumer in the group has found the coordinator, the rebalance protocol kicks in:

Each consumer sends a JoinGroup request to the coordinator
The coordinator designates one consumer as Group Leader — typically the first to join
The coordinator sends the full member list to the Group Leader
The Group Leader runs a partition assignment strategy locally and sends the result back
The coordinator distributes the assignments to every member via SyncGroup responses [7][8]

The Group Leader is just a temporary logical role, not a permanent node. It changes every rebalance [7].

Partition Assignment Strategies

Kafka ships with a few built-in strategies, configurable via partition.assignment.strategy [8]:

Strategy	How It Works	Best For
RangeAssignor	Divides partitions numerically across consumers	Simple, predictable splits per topic
RoundRobinAssignor	Circular distribution across consumers	Even distribution across topics
StickyAssignor	Keeps prior assignments, minimises movement on rebalance	Stateful consumers with warm caches
CooperativeStickyAssignor	Like Sticky but allows incremental rebalances (default in 3.0+)	Production — avoids stop-the-world rebalances

CooperativeStickyAssignor is the default since Kafka 3.0 because it does incremental rebalances — only the partitions that actually need to move are revoked, not all of them at once [8].

What Consumers Can and Cannot Read

One subtle thing: consumers read from the partition leader, not followers [5]. So even if a follower replica is on a broker physically closer to your consumer, the read still goes to the leader.

There is one exception — Follower Fetching (introduced in Kafka 2.4), where a consumer can be configured to read from the nearest replica. But by default, leader reads only.

Consumers also cannot read data beyond the high-water mark — the latest offset that has been acknowledged by all ISR replicas [5]. This prevents consumers from reading data that could be rolled back if the leader dies before followers catch up.

What Happens When a Broker Dies Mid-Consume

Say a consumer is reading partition 1, and its leader broker crashes:

The KRaft controller detects the broker failure
It picks a new leader from the ISR for partition 1
It updates __cluster_metadata
The consumer’s next fetch request gets a NOT_LEADER_OR_FOLLOWER error from the old broker (or a timeout)
The consumer automatically refreshes its metadata and retries the fetch — now against the new leader [6]

The consumer doesn’t need manual intervention. The client library handles the retry and metadata refresh transparently. From your application code’s perspective, there might be a brief pause, but messages are not lost or duplicated (assuming proper offset commit settings).

The Full Picture in One Flow

kafka consumer flow

Bootstrap broker — only used once, to get a full cluster map
Group Coordinator — handles join/sync/heartbeat for the consumer group
Partition Leader — the actual broker the consumer fetches messages from

These can be three different brokers, or the same one. Kafka doesn’t care.

Why This Design Holds Up

The architecture is genuinely elegant once you see it whole. Metadata discovery is separated from coordination which is separated from data transfer. Each layer has a clear owner — KRaft for cluster state, Group Coordinator for consumer group state, Partition Leader for data.

Failures at any layer have defined fallback paths. A dead bootstrap broker is irrelevant after startup. A dead Group Coordinator triggers a FindCoordinator retry. A dead Partition Leader triggers a metadata refresh and reconnect.

Not simple to implement — but very clean to reason about as an operator.

End