Why does a zxid overflow occur when a ZooKeeper instance is used? - Microservices Engine

This topic describes the issue that a ZooKeeper transaction ID (zxid) overflow occurs when a ZooKeeper instance is used and provides the cause of and solutions to the issue.

Problem description

A ZooKeeper instance forcefully performs leader election and resets the low-order 32-bit value for the counter of the zxid.

Cause

zxid is a 64-bit number. The high-order 32-bit value specifies the cycle of the current leader, and the low-order 32-bit value specifies the location of the transaction generated by the current request in the cycle of the current leader. Each time a new transaction is generated, the low-order 32-bit value of the zxid is automatically increased by 1. If the low-order 32-bit value of the zxid reaches 0xffffffff, the ZooKeeper instance forcefully triggers leader election and resets the low-order 32-bit value for the counter of the zxid. As a result, the high-order 32-bit value of the zxid specifies the cycle of the new leader, and the low-order 32-bit value of the zxid is changed to 0.

Solution

No solution is provided to resolve the zxid overflow issue on the server. You must prevent the issue on the client.

In scenarios in which a Microservices Engine (MSE) ZooKeeper instance is used, leader election does not affect the use of the instance, and the client is automatically reconnected to the server after the leader election is complete.
When a Disconnected event is reported to the client in scenarios such as LeaderLatch is used as a Curator recipe, the Disconnected event is generated by the server during the leader election. The event indicates that the client is disconnected from the server. After the leader election is complete, the client reconnects to the server based on the reconnection mechanism.
After the Disconnected event is generated, the LeaderLatch recipe re-elects another registered instance as a leader. In this process, other logic of the client may be triggered. For example, in Flink 1.14 or earlier, the system automatically designates LeaderLatch as a Curator recipe to perform leader election. In this case, a Flink job may be restarted during the leader election.

For service implementation, you can use LeaderSelector as a Curator recipe instead of LeaderLatch to support the temporary disconnection tolerance for the ZooKeeper instance.