You can use the Hadoop Distributed File System (HDFS) balancer to analyze the distribution of data blocks and redistribute data that is stored in DataNodes. This topic describes how to use the HDFS balancer and configure the tuning parameters of the balancer.
Background information
HDFS adopts a master-slave architecture. The NameNode manages the metadata of the file system, such as file names, block information, and file locations. Actual data blocks are stored in multiple DataNodes. The master-slave architecture allows data to be stored in different locations, which improves the fault tolerance of the file system.
As files are added, removed, or modified, data distribution among DataNodes may be uneven. The storage space of specific DataNodes is nearly full, and the storage space of other DataNodes remains idle. Uneven data distribution affects the storage efficiency of the file system and increases the risk of data loss. This is because DataNodes that store a large amount of data are more susceptible to hardware failures.
To resolve the preceding issues, HDFS provides the balancer tool, which is a command line tool used to rebalance data distribution among DataNodes. The HDFS balancer moves data blocks between DataNodes to rebalance data distribution. This ensures that the storage resources of a cluster can be used in an efficient manner.
View the capacity and the storage space usage of DataNodes
You can view the capacity and the storage space usage of DataNodes to learn the allocation of storage resources, identify and resolve issues about insufficient storage at the earliest opportunity, and make sure that data is evenly distributed among nodes. This improves the overall performance and stability of the system.
Log on to the master node of the cluster that you want to manage. For more information, see Log on to a cluster.
Run the following command to view the capacity and the storage space usage of each DataNode:
hdfs dfsadmin -report
The result describes information, such as the total capacity, used storage space, storage space usage, and remaining storage space, of each DataNode. This helps you identify storage imbalance issues.
If data distribution is extremely uneven, you can start the HDFS balancer. For example, the storage space usage of specific DataNodes is much higher than that of other nodes, and the difference exceeds the default or specified balance threshold. In most cases, the threshold is 10%.
Use the HDFS balancer
Method 1: Run the hdfs balancer command
Run the following command to configure the HDFS balancer:
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-source [-f <hosts-file> | <comma-separated list of hosts>]]
[-blockpools <comma-separated list of blockpool ids>]
[-idleiterations <idleiterations>]
The following table describes the parameters of the HDFS balancer.
Parameter | Description |
threshold | The threshold of the disk usage difference, in percentage. Default value: 10%. This value ensures that the disk usage on each DataNode differs from the overall disk usage of the cluster by no more than 10%. If the overall disk usage of the cluster is high, set this parameter to a smaller value. If you added a large number of nodes to the cluster, you can set this parameter to a larger value to move data from high-usage nodes to low-usage nodes. |
policy | The balancing policy. Valid values:
|
exclude | Excludes specific DataNodes. |
include | Specifies the DataNodes on which you want to perform the balancing operation. |
source | The DataNode that serves as the source node. |
blockpools | The block pools in which you want to run the HDFS balancer. |
idleiterations | The maximum number of idle loops that are allowed. Default value: 5. |
Method 2: Use the start-balancer.sh tool
You can use the start-balancer.sh tool by running the hdfs daemon start balancer command. To use the start-balancer.sh tool, perform the following operations:
Log on to a node of the cluster that you want to manage. For more information, see Log on to a cluster.
Optional. Run the following command to modify the maximum bandwidth of the HDFS balancer:
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
Note<bandwidth in bytes per second>
specifies the maximum bandwidth per second. For example, if you want to configure a maximum bandwidth of 200 MB/s, set <bandwidth in bytes per second> to 209715200, in bytes, which is calculated based on the following formula: 200 × 1024 × 1024. The complete command ishdfs dfsadmin -setBalancerBandwidth 209715200
. To make full use of network resources and ensure the continuity of core business, we recommend that you specify a small value for the maximum bandwidth if the cluster is heavily loaded. For example, you can set the value to 20971520, which indicates 20 MB/s. To accelerate the data balancing process, we recommend that you specify a large value for the maximum bandwidth if the cluster is idle. For example, you can set the value to 1073741824, which indicates 1 GB/s.Run the following commands to switch to the hdfs user and run the HDFS balancer:
DataLake cluster
su hdfs /opt/apps/HDFS/hdfs-current/sbin/start-balancer.sh -threshold 5
Hadoop cluster
su hdfs /usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 5
NoteThe
-threshold
parameter specifies the threshold for data balancing. If you set the threshold to 5%, the HDFS balancer considers that data is evenly distributed and no longer moves data blocks from or to the DataNode when the difference between the data storage capacity of a DataNode and the average storage capacity of the cluster is less than or equal to 5%. You can configure this parameter based on your business requirements to achieve the expected balance effect.
Run the following command to check the status of the HDFS balancer:
DataLake cluster
tail -f /var/log/emr/hadoop-hdfs/hadoop-hdfs-balancer-master-1-1.c-xxx.log
Hadoop cluster
tail -f /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-1.cluster-xxx.log
Notehadoop-hdfs-balancer-master-1-1.c-xxx.log
andhadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log
in the command are the log names obtained in the previous step.
If the command output includes Successfully, the HDFS balancer is run as expected.
Tuning parameters of the HDFS balancer
The HDFS balancer consumes system resources. We recommend that you use the HDFS balancer during off-peak hours. By default, you do not need to modify the parameters of the HDFS balancer. If you want to modify the parameters of the HDFS balancer, go to the Configure tab of the HDFS service page in the E-MapReduce (EMR) console and click hdfs-site.xml. On the hdfs-site.xml tab, adjust the configurations of the client and DataNodes based on your business requirements.
Client configurations
Parameter
Description
dfs.balancer.dispatcherThreads
The number of dispatcher threads used by the HDFS balancer to determine the blocks that need to be moved. Before the HDFS balancer moves a specific amount of data between two DataNodes, the balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled.
NoteThe default value is 200.
dfs.balancer.rpc.per.sec
The number of remote procedure calls (RPCs) sent by dispatcher threads per second. Default value: 20.
Before the HDFS balancer moves data between two DataNodes, the balancer uses dispatcher threads to repeatedly send the getBlocks() RPC to the NameNode. This results in a heavy load on the NameNode. To prevent this issue and balance the cluster load, we recommend that you configure this parameter to limit the number of RPCs sent per second.
For example, you can decrease the value of the parameter by 10 or 5 for a cluster that has a high load to minimize the impact on the overall moving progress.
dfs.balancer.getBlocks.size
The total data size of the blocks moved each time. Before the HDFS balancer moves data between two DataNodes, the balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled. By default, the size of blocks in each block list is 2 GB. When the NameNode receives a getBlocks() RPC, the NameNode is locked. If an RPC queries a large number of blocks, the NameNode is locked for a long period of time. This slows down data writing. To prevent this issue, we recommend that you configure this parameter based on the NameNode load.
dfs.balancer.moverThreads
The total number of threads that are used to move blocks. Each block move requires a thread. Default value: 1000.
DataNode configurations
Parameter
Description
dfs.datanode.balance.bandwidthPerSec
The bandwidth of each DataNode that is used to balance the workloads of the cluster. We recommend that you set the bandwidth to 100 MB/s. You can also configure the dfsadmin -setBalancerBandwidth parameter to adjust the bandwidth. You do not need to restart DataNodes.
For example, you can increase the bandwidth when the cluster load is low and decrease the bandwidth when the cluster load is high.
dfs.datanode.balance.max.concurrent.moves
The maximum number of concurrent block moves that are allowed in a DataNode. Default value: 5. You can configure this parameter based on the number of disks. We recommend that you set this parameter to
4 × Number of disks
as the upper limit for a DataNode.For example, if a DataNode has 28 disks, set this parameter to 28 on the HDFS balancer and 112 on the DataNode. You can adjust the value based on the cluster load. You can increase the value when the cluster load is low and decrease the value when the cluster load is high.
FAQ
Q: Why is the difference approximately 20% after the balancing operation is performed even if the threshold parameter is set to 10%?
A: The threshold parameter is used to prevent the usage of each DataNode from becoming much higher or lower than the average usage of the cluster. As a result, the difference between the DataNodes that have the highest and the lowest usage may be 20% after the balancing operation is performed. To reduce the difference, you can set the threshold parameter to 5%.