ZooKeeper Practice: How to Tune jute.maxbuffer

By Zikui

Background

In the daily ZooKeeper O&M, we often encounter issues that the leader election can’t be performed for a long time and that the process starts and then exits during recovery, which leads to a surge in memory usage, a spike in CPU usage, and frequent GC, thus impacting the availability of business services. All these problems may be related to the setting of jute.maxbuffer. This article delves into the source code of ZooKeeper to explore the best practices for the jute.maxbuffer parameter in ZooKeeper.

Analysis

First, let's look at the description of jute.maxbuffer on the official website of ZooKeeper:

jute.maxbuffer :
(Java system property:jute.maxbuffer).
......, It specifies the maximum size of the data that can be stored in a znode. The unit is: byte. The default is 0xfffff(1048575) bytes, or just under 1M.

When jute.maxbuffer in the client side is greater than the server side, the client wants to write the data exceeds jute.maxbuffer in the server side, the server side will get java.io.IOException: Len error

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

From the description on the official website, we know that jute.maxbuffer can limit the size of Znodes and needs to be set properly on the Server side and Client side. Otherwise, it may cause problems.

However, this is not the case. We can look for the definition and reference of jute.maxbuffer in ZooKeeper's code:

public static final int maxBuffer = Integer.getInteger("jute.maxbuffer", 0xfffff);

In the org.apache.jute.BinaryInputArchive type, the value of jute.maxbuffer is read through System Properties. You can see that the default value is 1M, and the checkLength method references this static value:

// Since this is a rough sanity check, add some padding to maxBuffer to
// make up for extra fields, etc. (otherwise e.g. clients may be able to
// write buffers larger than we can read from disk!)
private void checkLength(int len) throws IOException {
    if (len < 0 || len > maxBufferSize + extraMaxBufferSize) {
        throw new IOException(UNREASONBLE_LENGTH + len);
    }
}

If the value of the len parameter exceeds the sum of maxBufferSize and extraMaxBufferSize, an Unreasonable length exception is thrown. In the production environment, this exception often leads to the unexpected leader election or the server start failure.

Let's look at the extraMaxBufferSize assignment:

static {
    final Integer configuredExtraMaxBuffer =
        Integer.getInteger("zookeeper.jute.maxbuffer.extrasize", maxBuffer);
    if (configuredExtraMaxBuffer < 1024) {
        extraMaxBuffer = 1024;
    } else {
        extraMaxBuffer = configuredExtraMaxBuffer;
    }
}

It can be seen that extraMaxBufferSize will use the value of maxBuffer by default, and the minimum value is 1024 (for compatibility with previous versions). Therefore, by default, the threshold for the checkLength method to throw an exception is 1M + 1K.

Then, let's look at the reference chain of the checkLength method:

There are two places where the checkLength method is referenced: the readString and readBuffer methods in the org.apache.jute.BinaryInputArchive type.

    public String readString(String tag) throws IOException {
        ......
        checkLength(len);
        ......
    }

    public byte[] readBuffer(String tag) throws IOException {
        ......
        checkLength(len);
        ......
    }

However, these two methods are referenced in almost all org.apache.jute.Recod types. That is to say, almost all serialized objects in ZooKeeper will be checked for checkLength during deserialization. Therefore, it can be concluded that jute.maxbuffer limits not only the size of Znodes but also the size of all records that call readString and readBuffer.

This includes the org.apache.zookeeper.server.quorum.QuorumPacket type.

This type is a serialization type used to transmit data when the server performs a proposal. Txn generated by write requests is also serialized to transmit data through this type when data is synchronized between servers. If the transaction is too large, it will cause checkLength to fail and throw an exception. If it's a regular write request, since checkLength will be performed when the request is received, it can prevent the oversized QuorumPacket during the pre-processing of the request. However, if it's a CloseSession request, an exception may occur in this case.

We can view the process of generating CloseSessionTxn by the processRequest method of PreRequestProcessor:

    protected void pRequest2Txn(int type, long zxid, Request request, Record record, boolean deserialize) throws KeeperException, IOException, RequestProcessorException {
        ......
        case OpCode.closeSession:
            long startTime = Time.currentElapsedTime();
            synchronized (zks.outstandingChanges) {
                Set<String> es = zks.getZKDatabase().getEphemerals(request.sessionId);
                for (ChangeRecord c : zks.outstandingChanges) {
                    if (c.stat == null) {
                        // Doing a delete
                        es.remove(c.path);
                    } else if (c.stat.getEphemeralOwner() == request.sessionId) {
                        es.add(c.path);
                    }
                }  
                if (ZooKeeperServer.isCloseSessionTxnEnabled()) {
                    request.setTxn(new CloseSessionTxn(new ArrayList<String>(es)));
                }
                ......
    }

CloseSession requests are very small and can generally pass the checkLength check, but the transactions generated by CloseSession may be very large. It can be seen from the definition of org.apache.zookeeper.txn.CloseSessionTxn type that this Txn contains all ephemeral type Znodes created by this Session. Therefore, if a Session has created many ephemeral type Znodes, when the CloseSession request in a Session is processed by the server, an extremely large QuorumPacket will occur when the leader sends a proposal to the follower. This can cause an exception to be thrown during the deserialization checkLength. From the followLeader method of the follower, it can be seen that when an exception occurs, the follower will disconnect from the leader.

void followLeader() throws InterruptedException {
        ......
        ......
        ......
                // create a reusable packet to reduce gc impact
                QuorumPacket qp = new QuorumPacket();
                while (this.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }
            } catch (Exception e) {
                LOG.warn("Exception when following the leader", e);
                closeSocket();

                // clear pending revalidations
                pendingRevalidations.clear();
            }
        } finally {
        ......
        ......
    }

When more than half of the followers cannot be deserialized due to the extremely large QuorumPacket, it will cause a leader re-election. Moreover, if the original leader wins the election, then the checkLength check will fail when the server loads data and reads transaction logs from disk, specifically when reads the particularly large CloseSessionTxn that was just written. This will cause the leader to return to the LOOKING state. The instance will start re-electing again and continue this process, so the instance remains in a leader election state.

Cause

From the analysis above, unexpected leader elections in the cluster, a continuous leader election, or the server start failure may result from an unreasonable setting of jute.maxbuffer. A client connected to the cluster creates a particularly large number of ephemeral type nodes. Therefore, when this Session issues a CloseSession request, the followers disconnect from the leader. Ultimately, this leads to the leader election failure or the server start failure.

Suggestion

Firstly, how can we determine that the cluster's leader election failure or start failure is caused by an unreasonable setting of jute.maxbuffer?

1. When the checkLength method fails, an exception will be thrown, and the keyword is Unreasonable length. At the same time, the follower will disconnect from the leader with the keyword Exception when following the leader. You can quickly confirm this by searching for these keywords.

2. ZooKeeper provides the ·last_proposal_size· metric in the new version. You can use it to monitor the size of proposals in the cluster. When a proposal is larger than the value of jute.maxbuffer, it is necessary to troubleshoot the problem.

How can we set jute.maxbuffer correctly?

The official document recommends that we set jute.maxbuffer correctly on the client and server, ideally keeping them consistent to avoid unexpected failures in the checkLength check.
The official document recommends that the value of jute.maxbuffer should not be too large, because large Znodes may cause synchronization timeouts between servers, and large data requests will be intercepted when they reach the server.
In practical production, to ensure the stability of the production environment, if the value of jute.maxbuffer is too small, the server may continue to be unavailable, and you need to change the jute.maxbuffer value to start the server normally. Therefore, this value should not be too small either.
Lower versions of Dubbo have a problem with duplicate registrations. When the number of duplicate registrations reaches a certain level, it could trigger this threshold (1M). The path length registered by a single Dubbo node is calculated at 670 bytes, and the default threshold can accommodate up to 1,565 duplicate registrations at most. Therefore, on the business side, it is necessary to avoid the issue of duplicate registrations. To sum up, when you use ZooKeeper, you must configure the jute.maxbuffer parameter to a proper value with the consideration that a single session creates too many ephemeral nodes.

In MSE ZooKeeper, you can quickly modify the jute.maxbuffer parameter on the console:

Set MSE ZooKeeper leader election time alerts and node unavailability alerts. First, go to the Alert management page and create a new alert:

Select ZooKeeper Professional Edition for the Alert Contact Group, choose leader election time for the Alert Metric, and then set the threshold:

POD status alert: select ZooKeeper Professional Edition for the Alert Contact Group, and choose ZooKeeper single POD status for the Alert Metric.

Configure alerts for node unavailability and leader election time to detect and troubleshoot problems promptly.

Community

ZooKeeper Practice: How to Tune jute.maxbuffer

Background

Analysis

Cause

Suggestion

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Best Practices

Microservices Engine (MSE)