When a ZooKeeper transaction ID (zxid) counter overflows, the ZooKeeper instance forces a leader election and temporarily disconnects clients. No server-side fix exists for this behavior. To prevent service disruption, take client-side steps to handle the temporary disconnection in your Microservices Engine (MSE) ZooKeeper deployment.
Symptom
The ZooKeeper instance triggers leader election unexpectedly. Clients receive a Disconnected event and briefly lose connectivity to the server.
Cause
How zxid works
A zxid is a 64-bit number composed of two parts:
| Part | Bits | Role |
|---|---|---|
| Epoch (high-order 32 bits) | 63--32 | Identifies the current leader's term. Increments each time a new leader is elected. |
| Counter (low-order 32 bits) | 31--0 | Tracks the transaction sequence within the current leader's term. Increments by 1 for each new transaction. |
A zxid can be represented as a single 64-bit integer or as a pair *(epoch, counter)*.
What triggers the overflow
When the counter reaches its maximum value (0xffffffff), ZooKeeper has no more room to assign new transaction IDs under the current epoch. The instance forces a leader election to:
Increment the epoch (high-order 32 bits) to start a new leader term.
Reset the counter (low-order 32 bits) to
0.
This is expected ZooKeeper behavior, not a bug. No server-side fix exists because the zxid overflow mechanism is built into the ZooKeeper protocol.
Impact on MSE ZooKeeper
For MSE-managed ZooKeeper instances, leader election does not affect instance availability. Clients automatically reconnect after the election completes:
The counter reaches
0xffffffff-- ZooKeeper triggers leader election.Clients receive a
Disconnectedevent that indicates temporary loss of connectivity.A new leader is elected. The epoch increments and the counter resets to
0.Clients reconnect automatically through the built-in reconnection mechanism.
Client-side impact by recipe
How the Disconnected event affects your application depends on the Apache Curator recipe you use for leader election:
LeaderLatch
When LeaderLatch receives a Disconnected event, it re-elects another registered instance as a leader. This may trigger other logic in your application.
Known issue with Apache Flink: In Flink 1.14 and earlier, the system uses LeaderLatch for internal leader election by default. A zxid overflow can cause Flink jobs to restart during the leader election.
LeaderSelector
LeaderSelector tolerates temporary disconnections. It does not relinquish leadership on a temporary disconnection event, which avoids unnecessary re-election and downstream disruption.
Solution
No server-side solution exists because the overflow is part of the ZooKeeper protocol. Minimize disruption with the following client-side change.
Switch from LeaderLatch to LeaderSelector
If your application uses LeaderLatch for leader election, switch to LeaderSelector. LeaderSelector supports temporary disconnection tolerance, which prevents unnecessary leadership changes during zxid overflow events.