Configure automatic scale-out and scale-in for a cluster - Elastic High Performance Computing

Elastic High Performance Computing (E-HPC) provides the auto scaling feature that can dynamically allocate compute nodes based on the configured auto scaling policy. The system can automatically add or remove compute nodes based on real-time workloads to improve cluster availability and save costs. This topic describes how to configure auto scaling.

Benefits

Adds compute nodes based on the real-time workloads of your cluster to improve cluster availability.
Reduces the number of compute nodes to save costs without compromising cluster availability.
Stops faulty nodes and creates nodes to improve fault tolerance.

Limits

You can configure auto scaling only for clusters in which all nodes run Linux operating systems.
You can configure auto scaling only for clusters with PBS, Slurm, Deadline, or SGE schedulers.
E-HPC does not support auto scaling based on memory usage.
Important
To effectively implement auto scaling, we recommend that you specify the number of required vCPUs when you submit a job. Note that the memory size that you specify for the job cannot exceed the memory capacity of Elastic Compute Service (ECS) instances.

Usage notes

Before you use the auto scaling service, make sure that the scheduler service and the domain account service work as expected. After you enable auto scaling, the management node must be in the running state.

Important

If you need to shut down or restart the management node, perform the operation after idle nodes are released and no jobs are running on the compute nodes. In this case, we recommend that you disable auto scaling before you shut down or restart the management node, and enable the auto scaling after the management node is restarted.

Procedure

Open the Auto Scale page.
1. Log on to the E-HPC console.
2. In the top navigation bar, select a region.
3. In the left-side navigation pane, choose Elasticity > Auto Scale.
From the Cluster drop-down list on the Auto Scale page, select the cluster for which you want to configure auto scaling.

In the Global Configurations section, configure the parameters. The following table describes the parameters that you can configure.

Parameter	Description
Enable Autoscale	Enable Auto Grow and Auto Shrink for all queues in a cluster. Note If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the settings in the Queue Configuration section take precedence.
Compute Nodes	The range for the number of compute nodes in the cluster after the auto scaling. The upper limit is the sum of the maximum number of compute nodes configured for each queue in the cluster. The lower limit is the sum of the minimum number of compute nodes configured for each queue in the cluster.
Scale-in Time (Minute)	If the continuous idle duration of a compute node exceeds the scale-in duration, the node is released. The continuous idle duration is the scale-in interval multiplied by the number of consecutive idle times. By default, the scale-in interval is 2 minutes. The consecutive idle times of a compute node are the number of consecutive times that the compute node is idle during the resource scale-in check.
Image Type	The image type of the compute nodes that you want to add to the cluster. Only the images that are compatible with the image of the existing compute nodes in the cluster are supported.
Exceptional Nodes	Select the nodes that you want to exclude from auto scaling. If you want to retain a compute node, you can configure the node as an exceptional node. Then, the node is not released regardless of whether it is idle.
Hyper-threading	By default, Hyper-Threading (HT) is enabled for all ECS instances. For specific ECS instance types, you can disable HT for better performance. For more information, see Instance type limits and Disable HT for compute nodes.

In the Queue Configuration section, select a queue and click Edit to configure the parameters.

Parameter	Description
Auto Grow and Auto Shrink	Specifies whether to enable Auto Grow and Auto Shrink. By default, both switches are turned off. Note If the settings in the Queue Configuration section are different from the settings in the Global Configurations section, the settings in the Queue Configuration section take precedence.
Queue Compute Nodes	The range of the number of compute nodes in the queue. Maximum Nodes: The maximum number of compute nodes ranges from 0 to 5000. The value may affect the effect of the scale-out. Minimum Nodes: The minimum number of compute nodes ranges from 0 to 1000. The value may affect the effect of the scale-in. Important If you specify the Minimal Nodes parameter to a non-zero value, the queue retains the number of nodes based on the value you specify during cluster scale-in. Idle nodes are not released. We recommend that you specify the Minimal Nodes parameter with caution to avoid a waste of resources and costs due to idle nodes in the queue.
Prefix of Hostnames	The hostname prefix of the compute nodes. The prefix is used to distinguish between the nodes of different queues.
Maximum Nodes in Each Round of Scale-out	The maximum number of compute nodes that can be added in each round of scale-out. The default value 0 specifies that the maximum number of compute nodes that can be added in each round of scale-out is not limited. We recommend that you configure this parameter to control your costs on compute nodes. If you set this parameter to A and you want to add B nodes, nodes are added based on the following rules: If B is less than or equal to A, B nodes are added. If B is greater than A, A nodes are added. Note In addition to this parameter, the number of nodes in a cluster is also limited by the specified maximum number of nodes that can be added in a single queue and the specified maximum number of nodes that can be added in the cluster.
Minimum Scale-out Nodes in Each Round	The minimum number of compute nodes that must be added in each round of scale-out. The default value 1 specifies that at least one node must be added. In specific scenarios, you may need to add at least a specific number of nodes to ensure that the business can run as expected. In this case, you can specify the minimum number of nodes that must be added in each round. If the number of available ECS instances is less than the specified minimum number of nodes and the number of required nodes, the cluster is not scaled out to avoid wasting resources. If you set this parameter to A and you want to add B nodes, nodes are added in the following scenarios: For example, B is less than or equal to A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B, the cluster is not scaled out. For example, B is greater than A. If the number of available ECS instances is greater than or equal to B, B nodes are added. If the number of available ECS instances is less than B and greater than or equal to A, A nodes are added. If the number of available ECS instances is less than A, the cluster is not scaled out.
Automatic Configuration of the Minimum Node Number for Each Scale-out	If you turn on this switch, the minimum number of nodes for each scale-out is equal to the number of nodes required by the job. The minimum node number cannot be greater than 99.
Hostname Suffix	The suffix of the hostname. The suffix is used to distinguish between the nodes of different queues.
Image Type	The image type of the nodes that you want to add in a queue. You can specify different image types for different queues.
Image ID	The ID of the image to which the nodes that you want to add in a queue belong. You can specify different image IDs for different queues. Note This parameter is valid only for the current queue. If you did not specify the image type or image ID, the image type of the nodes that you want to add is the same as the image type that is specified in the global configurations. If you did not specify the image type in the global configurations, the image type of the nodes that you want to add is the same as the default image type of the cluster.
Whether instance types are unordered	If you turn on this switch, the system selects instance types in descending order of the number of instances in stock during auto scaling to ensure the delivery of resources.
Configuration List	Configure the compute nodes that you want to add. Each configuration list includes the following configurations: Zone: a zone in the region where the cluster resides. vSwitch ID: the vSwitch that is bound to the VPC of the cluster in the selected zone. Instance Type: the instance type of the compute nodes that you want to add in a queue. Note If multiple instance types are configured in the queue, the cluster is scaled out based on the available instance types, task quantity, and GPU quantity in sequence. For example, each node in a queue must have at least 16 cores to meet your business requirements. The queue has nodes with 8 cores, 16 cores, and 32 cores. ECS instances with 16 cores are automatically added to the queue. If no ECS instances with 16 cores are available, instances with 32 cores are automatically added to the queue. Bid Strategy: the bidding method configured for the nodes that you want to add. Maximum Price per Hour: You must set a maximum hourly price only when Bid Strategy is set to Preemptible instance with maximum bid price.
System Disk	The system disk of the compute nodes that you want to add.
Data disk	The data disk that is attached to the compute nodes that you want to add. Configure the type, size, and performance level of the data disk, and specify whether to release the data disk with the compute nodes and whether to encrypt the data disk based on your business requirements.

In the upper-right corner of the page, read and select Alibaba Cloud International Website Product Terms of Service, and click OK.
Optional. View the auto scaling diagram of the cluster.
The auto scaling diagram shows the changes in the number of nodes over time during the auto scaling process based on the auto scaling policy that you configured. The diagram also shows the time consumed by node scale-in and scale-out at key points in time.
Note
You can specify the number of simulated concurrent nodes in the auto scaling diagram to simulate the changes of compute nodes during auto scaling.