Use the Rebalance function to trigger shard rebalancing - Hologres

Hologres V2.0.21 and later support the Rebalance function that is used to trigger shard rebalancing. This topic describes how to use the Rebalance function to trigger shard rebalancing.

Background information

In normal cases, metadata of shards is evenly loaded on worker nodes during the Hologres instance runtime. In specific scenarios, such as scenarios in which fast recovery is triggered, shards are unevenly loaded on worker nodes. In this case, shard rebalancing is required to evenly load metadata of shards on worker nodes.

Limits

Only Hologres V2.0.21 and later support the Rebalance function. If the version of your Hologres instance is earlier than V2.0.21, we recommend that you manually upgrade your Hologres instance or join the Hologres DingTalk group to apply for an upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join a Hologres DingTalk group, see Obtain online support for Hologres.

Syntax

The syntax for triggering shard rebalancing differs based on the instance type.

General-purpose instances and read-only secondary instances

The Rebalance function is used to trigger rebalancing of shards on worker nodes in general-purpose instances and read-only secondary instances. Syntax:

SELECT hg_rebalance_instance();

Returned result:
- true: Shard rebalancing is triggered and the system starts to rebalance shards.
- false: Shard rebalancing is not required.
- Error: Shard rebalancing failed. For example, if a pod is faulty, an error is returned when you try to trigger shard rebalancing.
The system stops shard rebalancing until the number of shards that are loaded on each worker node is basically the same. For example, the difference of the numbers of shards between worker nodes is less than or equal to 1. Example:
- If you configure two worker nodes and two shards, one shard is loaded on each worker node.
- If you configure two worker nodes and three shards, one shard is loaded on one worker node, and two shards are loaded on the other worker node.
The shard rebalancing process takes approximately 2 minutes to 3 minutes. The actual duration varies based on the number of table groups in the Hologres instance. A large number of table groups in a Hologres instance require a longer period of time for shard rebalancing. During the shard rebalancing, data writes are interrupted for approximately 15 seconds.
The shard rebalancing process is asynchronous. You can execute the following SQL statement to query the shard rebalancing progress:

SELECT hg_get_rebalance_instance_status();

Returned result:
- DOING: Shard rebalancing is being performed.
- DONE: Shard rebalancing is complete.

FAQ

How do I detect and troubleshoot the uneven distribution of shards?

In normal cases, the load of each worker node is balanced. However, if no shard is loaded on a worker node, the load of the worker node is much lower than the load of other worker nodes.

Example of even shard distribution: In this example, data queries and writes are performed on a Hologres instance that is configured with 10 worker nodes. The metric data shown in the following figure indicates that the CPU utilization of each worker node is basically the same.
Example of uneven shard distribution: In this example, no shard is loaded on a worker node. The metric data shown in the following figure indicates that the CPU utilization of the work node on which no shard is loaded is much lower than the CPU utilization of other worker nodes.
You can execute the following SQL statement to check whether shards are loaded on each worker node:
```
SELECT DISTINCT worker_id FROM hologres.hg_worker_info;
```
- The following figure shows the returned result.
- The returned result indicates that shards are loaded on nine worker nodes, and no shard is loaded on one worker node.
- Solution: Perform shard rebalancing. The metric data indicates that after shard rebalancing is complete, the low CPU utilization of the worker node is increased to be basically the same as the CPU utilization of other worker nodes.
- Check the shard loading of each worker node. The check result indicates that shards are loaded on all 10 worker nodes.