ZooKeeper Practice: Exhausted Disk Space and Unavailable Service Caused by Inappropriate snapCount Setting

By Zikui

Background

In the daily use of ZooKeeper, a troublesome issue is the disk capacity of nodes. If there is an excessively high TPS or an inappropriate cleanup strategy, it can lead to the accumulation of data and log files in the cluster, ultimately causing the disk to become full and the server to be down. Recently, an online check revealed that a user's cluster experienced a sudden surge in TPS within a certain period.

This results in a large number of snapshot and transaction log files on the disk.

Ultimately, the disk space is exhausted and the node service becomes unavailable.

This article explores the best practice for solving ZooKeeper disk issues by delving into the mechanism of ZooKeeper data file generation and the parameters related to data file generation in ZooKeeper.

Analysis

What Data Files Are Generated in ZooKeeper

First, we need to explore the basic principle of data persistence in ZooKeeper. To ensure that all data changes are not missing, ZooKeeper uses a state machine to record and restore data.

Simply put, ZooKeeper has a large map that stores all Znodes, where the key is the path of Znodes and the value is the data, ACL, and states on Znodes. ZooKeeper serializes the large map in memory at a certain time point to obtain a snapshot of the data in memory at this time.

At the same time, another file is used to store the modification operations on the data state in this snapshot after this time point.

When a ZooKeeper node is restarted, data is restored by using existing snapshots and transaction logs.

This method can ensure that the changed data in memory will not be lost, and the write performance will not be significantly impacted. Even if a node is down, the data can be completely restored.

From this point of view, there are mainly two types of data files generated by ZooKeeper: snapshot files of data in memory and transaction log files that store changes. These two types of files are specified respectively by dataDir and dataLogDir in the configuration file, and they are also the two types that occupy a large amount of disk space in ZooKeeper.

When an excessive amount of data is stored or data changes very frequently in ZooKeeper, the files to which snapshots of storage in memory are serialized and the transaction log files generated by data changes can become numerous and large. If appropriate data cleanup strategies and parameters are not configured, disk issues will cause cluster nodes down and even service unavailable.

public class ZooKeeperServer implements SessionExpirer, ServerStats.Provider {
    ...
    public void takeSnapshot(boolean syncSnap) {
        long start = Time.currentElapsedTime();
        try {
            txnLogFactory.save(zkDb.getDataTree(), zkDb.getSessionWithTimeOuts(), syncSnap);
        } catch (IOException e) {
            LOG.error("Severe unrecoverable error, exiting", e);
            // This is a severe error that we cannot recover from,
            // so we need to exit
            ServiceUtils.requestSystemExit(ExitCode.TXNLOG_ERROR_TAKING_SNAPSHOT.getValue());
        }
            ...
    }
    ...
}

When the server tries to write the data in memory to disk, an exception will be thrown and the server process exits directly.

For these problems, ZooKeeper officially provides some parameters, so you can effectively reduce the stress on disk by configuring appropriate parameters. Next, let's analyze the relevant parameters in detail.

• autopurge.snapRetainCount: (No Java system property) New in 3.4.0

• autopurge.purgeInterval: (No Java system property) New in 3.4.0

First of all, ZooKeeper officially supports the ability to regularly clean up data files. Through autopurge.snapRetainCount and autopurge.purgeInterval, you can specify the cleanup interval and the number of data files that should be retained during the cleanup.

Note that to enable this feature in ZooKeeper, you must set autopurge.purgeInterval to a value greater than 0. This value indicates the interval (unit: hour) at which cleanup tasks are run.

In general, configuring these two parameters to enable the scheduled cleanup can greatly reduce the stress on the disk capacity. However, the minimum interval for the scheduled cleanup tasks is 1 hour, which cannot avoid the disk issue in some special scenarios.

For example, a large number of data change requests during the interval between two disk cleanups will generate a large number of transaction log files and snapshot files. Since the cleanup event interval has not been reached, the data accumulates and eventually fills up the disk before the next disk cleanup. To better understand other parameters, let's first explore when ZooKeeper generates snapshot files and transaction log files.

When ZooKeeper Generates These Data Files

By referencing the method (ZooKeeperServer.takeSnapshot) in the code, it can be found that a new snapshot file generation is triggered in the following situations:

• The node has been started and the data files have been loaded.

• A new leader is elected in the instance.

• A certain condition is met in SyncRequestProcessor (specified by the shouldSnapshot method).

Since the first two cases are not highly frequent, let's focus on the third case. The takeSnapshot method is called in the run method of SyncRequestProcessor, and before that, the shouldSnapshot method has been called to make a judgment:

private boolean shouldSnapshot() {
    int logCount = zks.getZKDatabase().getTxnCount();
    long logSize = zks.getZKDatabase().getTxnSize();
    return (logCount > (snapCount / 2 + randRoll))
        || (snapSizeInBytes > 0 && logSize > (snapSizeInBytes / 2 + randSize));
}

The value of logCount represents the size of the current transaction log file and the number of transaction logs, while snapCount and snapSizeInBytes are the values configured through Java SystemProperty;

randRoll is a random number (0 < randRoll < snapCount / 2);

randSize is also a random number (0 < randSize < snapSizeInBytes / 2), so the logic of this judgment condition can be summarized as follows:

When the number of transactions in the current transaction log file exceeds a random value related to snapCount that is generated at runtime (snapCount/2 < value < snapCount), or when the size of the current transaction log file exceeds a random value related to snapSizeInBytes that is generated at runtime (snapSizeInBytes/2 < value < snapSizeInBytes), a snapshot file write operation is performed. Moreover, from the code, it can be seen that writing a snapshot will flush the current transaction log file to disk and create a new transaction log file.

public void run() {
    if (shouldSnapshot()) {
        resetSnapshotStats();
        // Roll transaction log files
        zks.getZKDatabase().rollLog();
        // take a snapshot
        if (!snapThreadMutex.tryAcquire()) {
            LOG.warn("Too busy to snap, skipping");
        } else {
            new ZooKeeperThread("Snapshot Thread") {
                public void run() {
                    try {
                        zks.takeSnapshot();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception", e);
                    } finally {
                        snapThreadMutex.release();
                    }
                }
            }.start();
        }
    }
}

The two variables snapCount and snapSizeInBytes can be specified in the ZooKeeper configuration file.

• snapCount : (Java system property: zookeeper.snapCount)

• snapSizeLimitInKb : (Java system property: zookeeper.snapSizeLimitInKb)

It can be seen that the frequency of generating a snapshot file can be controlled by configuring snapCount and snapSizeLimitInKb in the configuration file.

Suggestion

From the above analysis, we know that the generation and cleanup of ZooKeeper's data files can be configured through 4 parameters.

• autopurge.snapRetainCount : (No Java system property) New in 3.4.0

• autopurge.purgeInterval : (No Java system property) New in 3.4.0

• snapCount : (Java system property: zookeeper.snapCount)

• snapSizeLimitInKb : (Java system property: zookeeper.snapSizeLimitInKb)

The first two configuration items specify the scheduled cleanup policy of ZooKeeper while the last two parameters set the frequency of generating snapshot files in ZooKeeper.

According to the previous analysis, by adjusting the snapCount and snapSizeLimitInKb to be larger, the snapshot generation frequency can be reduced, but too large values for these two parameters will also lead to slow loading of data when the node restarts, which increases the service recovery time and business risks.

The autopurge.purgeInterval can only be specified as a minimum of 1, which strictly limits the cleanup interval to 1 hour and can not mitigate the risk for high-frequency write scenarios.

For the ZooKeeper's disk issues analyzed above, MSE ZooKeeper has set appropriate parameters, and users can conveniently change related parameters to adapt to business scenarios. In addition to ZooKeeper's data cleanup mechanism, MSE ZooKeeper provides additional safeguard measures to ensure that the cluster is not affected by disk issues.

Community

ZooKeeper Practice: Exhausted Disk Space and Unavailable Service Caused by Inappropriate snapCount Setting

Background

Analysis

What Data Files Are Generated in ZooKeeper

When ZooKeeper Generates These Data Files

Suggestion

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Microservices Engine (MSE)

Cloud-Native Applications Management Solution

Function Compute

Lindorm