All Products
Search
Document Center

Microservices Engine:Handle zxid overflow in a ZooKeeper instance

Last Updated:Mar 11, 2026

When a ZooKeeper transaction ID (zxid) counter overflows, the ZooKeeper instance forces a leader election and temporarily disconnects clients. No server-side fix exists for this behavior. To prevent service disruption, take client-side steps to handle the temporary disconnection in your Microservices Engine (MSE) ZooKeeper deployment.

Symptom

The ZooKeeper instance triggers leader election unexpectedly. Clients receive a Disconnected event and briefly lose connectivity to the server.

Cause

How zxid works

A zxid is a 64-bit number composed of two parts:

PartBitsRole
Epoch (high-order 32 bits)63--32Identifies the current leader's term. Increments each time a new leader is elected.
Counter (low-order 32 bits)31--0Tracks the transaction sequence within the current leader's term. Increments by 1 for each new transaction.

A zxid can be represented as a single 64-bit integer or as a pair *(epoch, counter)*.

What triggers the overflow

When the counter reaches its maximum value (0xffffffff), ZooKeeper has no more room to assign new transaction IDs under the current epoch. The instance forces a leader election to:

  1. Increment the epoch (high-order 32 bits) to start a new leader term.

  2. Reset the counter (low-order 32 bits) to 0.

This is expected ZooKeeper behavior, not a bug. No server-side fix exists because the zxid overflow mechanism is built into the ZooKeeper protocol.

Impact on MSE ZooKeeper

For MSE-managed ZooKeeper instances, leader election does not affect instance availability. Clients automatically reconnect after the election completes:

  1. The counter reaches 0xffffffff -- ZooKeeper triggers leader election.

  2. Clients receive a Disconnected event that indicates temporary loss of connectivity.

  3. A new leader is elected. The epoch increments and the counter resets to 0.

  4. Clients reconnect automatically through the built-in reconnection mechanism.

Client-side impact by recipe

How the Disconnected event affects your application depends on the Apache Curator recipe you use for leader election:

LeaderLatch

When LeaderLatch receives a Disconnected event, it re-elects another registered instance as a leader. This may trigger other logic in your application.

Known issue with Apache Flink: In Flink 1.14 and earlier, the system uses LeaderLatch for internal leader election by default. A zxid overflow can cause Flink jobs to restart during the leader election.

LeaderSelector

LeaderSelector tolerates temporary disconnections. It does not relinquish leadership on a temporary disconnection event, which avoids unnecessary re-election and downstream disruption.

Solution

No server-side solution exists because the overflow is part of the ZooKeeper protocol. Minimize disruption with the following client-side change.

Switch from LeaderLatch to LeaderSelector

If your application uses LeaderLatch for leader election, switch to LeaderSelector. LeaderSelector supports temporary disconnection tolerance, which prevents unnecessary leadership changes during zxid overflow events.