By Zikui
In the daily ZooKeeper O&M, we often encounter issues that the leader election can’t be performed for a long time and that the process starts and then exits during recovery, which leads to a surge in memory usage, a spike in CPU usage, and frequent GC, thus impacting the availability of business services. All these problems may be related to the setting of jute.maxbuffer. This article delves into the source code of ZooKeeper to explore the best practices for the jute.maxbuffer parameter in ZooKeeper.
First, let's look at the description of jute.maxbuffer on the official website of ZooKeeper:
jute.maxbuffer :
(Java system property:jute.maxbuffer).
......, It specifies the maximum size of the data that can be stored in a znode. The unit is: byte. The default is 0xfffff(1048575) bytes, or just under 1M.When jute.maxbuffer in the client side is greater than the server side, the client wants to write the data exceeds jute.maxbuffer in the server side, the server side will get java.io.IOException: Len error
When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!
From the description on the official website, we know that jute.maxbuffer can limit the size of Znodes and needs to be set properly on the Server side and Client side. Otherwise, it may cause problems.
However, this is not the case. We can look for the definition and reference of jute.maxbuffer in ZooKeeper's code:
public static final int maxBuffer = Integer.getInteger("jute.maxbuffer", 0xfffff);
In the org.apache.jute.BinaryInputArchive
type, the value of jute.maxbuffer is read through System Properties. You can see that the default value is 1M, and the checkLength method references this static value:
// Since this is a rough sanity check, add some padding to maxBuffer to
// make up for extra fields, etc. (otherwise e.g. clients may be able to
// write buffers larger than we can read from disk!)
private void checkLength(int len) throws IOException {
if (len < 0 || len > maxBufferSize + extraMaxBufferSize) {
throw new IOException(UNREASONBLE_LENGTH + len);
}
}
If the value of the len parameter exceeds the sum of maxBufferSize and extraMaxBufferSize, an Unreasonable length exception is thrown. In the production environment, this exception often leads to the unexpected leader election or the server start failure.
Let's look at the extraMaxBufferSize assignment:
static {
final Integer configuredExtraMaxBuffer =
Integer.getInteger("zookeeper.jute.maxbuffer.extrasize", maxBuffer);
if (configuredExtraMaxBuffer < 1024) {
extraMaxBuffer = 1024;
} else {
extraMaxBuffer = configuredExtraMaxBuffer;
}
}
It can be seen that extraMaxBufferSize will use the value of maxBuffer by default, and the minimum value is 1024 (for compatibility with previous versions). Therefore, by default, the threshold for the checkLength method to throw an exception is 1M + 1K.
Then, let's look at the reference chain of the checkLength method:
There are two places where the checkLength method is referenced: the readString and readBuffer methods in the org.apache.jute.BinaryInputArchive
type.
public String readString(String tag) throws IOException {
......
checkLength(len);
......
}
public byte[] readBuffer(String tag) throws IOException {
......
checkLength(len);
......
}
However, these two methods are referenced in almost all org.apache.jute.Recod
types. That is to say, almost all serialized objects in ZooKeeper will be checked for checkLength during deserialization. Therefore, it can be concluded that jute.maxbuffer limits not only the size of Znodes but also the size of all records that call readString and readBuffer.
This includes the org.apache.zookeeper.server.quorum.QuorumPacket
type.
This type is a serialization type used to transmit data when the server performs a proposal. Txn generated by write requests is also serialized to transmit data through this type when data is synchronized between servers. If the transaction is too large, it will cause checkLength to fail and throw an exception. If it's a regular write request, since checkLength will be performed when the request is received, it can prevent the oversized QuorumPacket during the pre-processing of the request. However, if it's a CloseSession request, an exception may occur in this case.
We can view the process of generating CloseSessionTxn
by the processRequest
method of PreRequestProcessor
:
protected void pRequest2Txn(int type, long zxid, Request request, Record record, boolean deserialize) throws KeeperException, IOException, RequestProcessorException {
......
case OpCode.closeSession:
long startTime = Time.currentElapsedTime();
synchronized (zks.outstandingChanges) {
Set<String> es = zks.getZKDatabase().getEphemerals(request.sessionId);
for (ChangeRecord c : zks.outstandingChanges) {
if (c.stat == null) {
// Doing a delete
es.remove(c.path);
} else if (c.stat.getEphemeralOwner() == request.sessionId) {
es.add(c.path);
}
}
if (ZooKeeperServer.isCloseSessionTxnEnabled()) {
request.setTxn(new CloseSessionTxn(new ArrayList<String>(es)));
}
......
}
CloseSession requests are very small and can generally pass the checkLength check, but the transactions generated by CloseSession may be very large. It can be seen from the definition of org.apache.zookeeper.txn.CloseSessionTxn
type that this Txn contains all ephemeral type Znodes created by this Session. Therefore, if a Session has created many ephemeral type Znodes, when the CloseSession request in a Session is processed by the server, an extremely large QuorumPacket will occur when the leader sends a proposal to the follower. This can cause an exception to be thrown during the deserialization checkLength. From the followLeader method of the follower, it can be seen that when an exception occurs, the follower will disconnect from the leader.
void followLeader() throws InterruptedException {
......
......
......
// create a reusable packet to reduce gc impact
QuorumPacket qp = new QuorumPacket();
while (this.isRunning()) {
readPacket(qp);
processPacket(qp);
}
} catch (Exception e) {
LOG.warn("Exception when following the leader", e);
closeSocket();
// clear pending revalidations
pendingRevalidations.clear();
}
} finally {
......
......
}
When more than half of the followers cannot be deserialized due to the extremely large QuorumPacket, it will cause a leader re-election. Moreover, if the original leader wins the election, then the checkLength check will fail when the server loads data and reads transaction logs from disk, specifically when reads the particularly large CloseSessionTxn that was just written. This will cause the leader to return to the LOOKING state. The instance will start re-electing again and continue this process, so the instance remains in a leader election state.
From the analysis above, unexpected leader elections in the cluster, a continuous leader election, or the server start failure may result from an unreasonable setting of jute.maxbuffer. A client connected to the cluster creates a particularly large number of ephemeral type nodes. Therefore, when this Session issues a CloseSession request, the followers disconnect from the leader. Ultimately, this leads to the leader election failure or the server start failure.
Firstly, how can we determine that the cluster's leader election failure or start failure is caused by an unreasonable setting of jute.maxbuffer?
1. When the checkLength method fails, an exception will be thrown, and the keyword is Unreasonable length. At the same time, the follower will disconnect from the leader with the keyword Exception when following the leader. You can quickly confirm this by searching for these keywords.
2. ZooKeeper provides the ·last_proposal_size· metric in the new version. You can use it to monitor the size of proposals in the cluster. When a proposal is larger than the value of jute.maxbuffer, it is necessary to troubleshoot the problem.
How can we set jute.maxbuffer correctly?
In MSE ZooKeeper, you can quickly modify the jute.maxbuffer parameter on the console:
Set MSE ZooKeeper leader election time alerts and node unavailability alerts. First, go to the Alert management page and create a new alert:
Select ZooKeeper Professional Edition for the Alert Contact Group, choose leader election time for the Alert Metric, and then set the threshold:
POD status alert: select ZooKeeper Professional Edition for the Alert Contact Group, and choose ZooKeeper single POD status for the Alert Metric.
Configure alerts for node unavailability and leader election time to detect and troubleshoot problems promptly.
506 posts | 48 followers
FollowAlibaba Cloud Native Community - September 28, 2023
Alibaba Cloud Native Community - October 22, 2024
Alibaba Cloud Native Community - October 21, 2024
Alibaba Cloud Native Community - June 29, 2023
Alibaba Cloud Native Community - October 17, 2024
Alibaba EMR - March 18, 2022
506 posts | 48 followers
FollowFollow our step-by-step best practices guides to build your own business case.
Learn MoreMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreMore Posts by Alibaba Cloud Native Community