How to scale a node pool by adjusting the expected number of nodes - Container Service for Kubernetes

You can manually scale a node pool by adjusting its expected number of nodes, maintaining the node count at the expected value to improve O&M efficiency. You can scale out node pools to meet the requirements of business development and scale in node pools to reduce resource costs. Node pool scaling can be automated to improve the O&M efficiency.

Note

Container Service for Kubernetes (ACK) also supports auto scaling. You can use node auto scaling or node instant scaling to automatically adjust node resources, thereby enhancing scheduling capacity. For more information, see Overview of node scaling.

Introduction of node pool scaling

The expected number of nodes refers to the number of nodes to be retained in a node pool. It indicates the number of nodes in the node pool when the node pool reaches the final state. After you specify the expected number of nodes in a node pool, the nodes in the node pool are automatically scaled to the specified number.

Scale out the node pool

Set the expected number of nodes to a value that is greater than the current value. Then, the node pool is automatically scaled out. This way, the system can automatically retry when it fails to add nodes to the node pool. The scale-out configuration varies based on the node pool configuration. The instance type and zone of the nodes depend on the scaling policy that is used. For more information about node pool scaling policies, see Scaling policies.

During the scaling process of a node pool, billing is based on the actual specifications created and used. For example, if a node pool is configured with two types of instance specifications, the billing method is pay-as-you-go, and the Scaling Policy is set to Priority. Then, during this scaling operation, two nodes of type A are added in the zone of the first priority vSwitch. If the resources of node A are insufficient, three nodes of type B are added in the zone of the second priority vSwitch. The cost for one hour will be calculated as the unit price of the instance specification multiplied by the number of nodes and the billing duration, that is, (Node A unit price × 2 × 1) + (Node B unit price × 3 × 1).

The system performs the following steps to scale out a node pool.

Create ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates Elastic Compute Service (ECS) instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool nased on node pool configurations. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For more information about the Expected Number of Instances feature, see Expected number of instances.
Important
Instances of GPU-accelerated ECS Bare Metal Instance families ebmgn7 and ebmgn7e cannot automatically delete the Multi-Instance GPU (MIG) configuration. When ACK adds instances of the preceding instance families, ACK automatically resets the MIG configuration retained on the instances. The reset may be time-consuming. In this case, you may fail to add the instances to a cluster.
- For more information about how to troubleshoot the issue, see What do I do if I fail to add ECS Bare Metal instances?
- For more information about the ebmgn7e instance family, see GPU-accelerated compute-optimized instance families (gn, ebm, and scc series).
Add the ECS instances to the cluster: After Auto Scaling creates ECS instances, the ECS instances automatically run the cloud-init script maintained by ACK to initialize the nodes and add the nodes to the node pool. The operational log is saved to the /var/log/messages file on each node. You can log on to a node and run the grep cloud-init /var/log/messages command to view the log.
Note
- After a node is added to the node pool, the operational log in the /var/log/messages file is automatically deleted. Therefore, the log records only information about failures to add nodes to the node pool.
- If the system fails to add a node to the node pool, the relevant log data in the /var/log/messages file is synchronized to the task result. You can view the task details on the Cluster Tasks tab of the cluster details page.

Scale in the node pool

Set the expected number of nodes to a value that is smaller than the current value. Then, the node pool is automatically scaled in.

When the system scales in a node pool:
- If the scaling policy is set to Priority, the system preferably removes the newly created ECS instances from the scaling group.
- If the scaling policy is set to Distribution Balancing, the system filters the zones where the ECS instances are deployed based on the policy. Then, the newly created ECS instances are preferably removed from the scaling group to ensure that the numbers of ECS instances in different zones of the scaling group are close or the same.
- If the scaling policy is set to Cost Optimization, the system removes ECS instances from the scaling group in the descending order of vCPU prices.
When a scale-in activity is triggered by changing the expected number of nodes, ACK can remove nodes without the need to drain the nodes first. If you want to drain the nodes before they are removed, refer to Remove a node.
When the system scales in a node pool, only pay-as-you-go ECS instances are released. Subscription ECS instances are not released. If you need to release subscription nodes that have not expired, log on to the ECS console and change their billing method to pay-as-you-go. For more information, see Change the billing method of an instance from subscription to pay-as-you-go.

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
Find the node pool that you want to scale out and choose More > Scale in the Actions column.
(Optional) If CloudOps Orchestration Service (OOS) authorization is not completed, you must perform this step. You can perform the following steps to create the AliyunOOSLifecycleHook4CSRole role and assign the role to OOS.
1. Click AliyunOOSLifecycleHook4CSRole.
  Note
  If the current account is an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole.
  If the current account is a RAM user, make sure that your Alibaba Cloud account is assigned the AliyunOOSLifecycleHook4CSRole role. Then, attach the AliyunRAMReadOnlyAccess policy to the RAM user. For more information, see Grant permissions to a RAM user.
2. On the Cloud Resource Access Authorization page, click Agree to Authorization.
Set the Expected Nodes parameter and configure other parameters as prompted.
- If the status of the node pool in the node pool list displays Expanding, the system is scaling out the node pool. If the status of the node pool changes to Activated, the node pool is scaled out.
  Important
  If the security group of the cluster denies access to 100.64.0.0/10, new nodes cannot be added to the cluster.
- If the status of the node pool in the node pool list displays Removing, the system is scaling in the node pool. If the status of the node pool changes to Activated, the node pool is scaled in.

Unrecommended operations and solutions

The expected number of nodes refers to the number of nodes retained in a node pool. Unrecommended operations may result in node pool scaling failures and cause business losses. The following table describes the unrecommended operations and suggestions on how to fix the issues caused by these operations.

Important

Do not perform the unrecommended operations in the following table.

Unrecommended operation	Node pool behavior	Suggestion
Remove nodes by running the `kubectl delete node` command.	ACK compares the expected number of nodes only with the number of ECS instances in the scaling group. It does not compare the expected number of nodes with the actual number of nodes in the cluster. If you use the API server to remove nodes, the ECS instances that host the nodes are not released. As a result, the actual number of nodes in the node pool does not change. However, the status of the nodes that are removed from the cluster changes to Unknown.	If you have performed this operation, you can click the name of the node pool and then remove the nodes on the Nodes tab to remove the nodes from the node pool. Note You do not need to select Drain the Node because the nodes are already removed from the cluster. You can select Release ECS Instance based on your business requirements. The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances. Nodes that are manually added to the cluster. Subscription nodes.
Manually release ECS instances in the ECS console or by calling the API.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you use the ACK console to remove nodes. For more information, see Remove a node. The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances. Nodes that are manually added to the cluster. Subscription nodes.
Remove ECS instances from the scaling group in the Auto Scaling console without changing the expected number of instances.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.
ECS instances are automatically released when the subscription expires.	Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.	ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you remove or renew subscription ECS instances that are about to expire at the earliest opportunity. For more information about how to remove a node, see Remove a node. For more information, see Renew ECS instances.
Use the Auto Scaling console or API to enable health checks for the scaling group.	After you enable health checks for the scaling group, the system automatically creates new ECS instances when identifying unhealthy ECS instances, such as suspended ECS instances.	By default, health checks are disabled for scaling groups used by ACK. ECS instances are added to ACK clusters only when nodes are released. Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.

Error codes for scaling failures and solutions

Node pool scaling may fail due to reasons such as insufficient inventory. You can click the name of your ACK cluster on the Clusters page of the ACK console, click the Cluster Tasks tab, and then click View Cause to view the cause of a node pool scaling failure.

The following table describes the error codes for common node pool scaling failures.

Error code	Cause	Solution
RecommendEmpty.InstanceTypeNoStock	The inventory of ECS instances in the current zone is insufficient.	Modify the node pool by specifying vSwitches in different zones for the node pool and selecting multiple ECS instance types to improve the success rate of node creation. Note On the Node Pools page, click the name of the node pool that you want to manage. The scalability of the node pool is displayed next to Scaling Group on the Overview tab. You can determine the success rate of scaling the current node group based on the scalability.
NodepoolScaleFailed.FailedJoinCluster	Nodes fail to be added to the ACK cluster.	Log on to one of the nodes and run the `grep cloud-init /var/log/messages` command to view the operational log and check the error message.
InvalidAccountStatus.NotEnoughBalance	Your account does not have a sufficient balance.	Top up your account first.
InvalidParameter.NotMatch	The `Image bootMode BIOS does not match instanceType bootMode` error message indicates that the specified instance type does not support the specified OS image boot mode.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by ACK, see OS images.
QuotaExceed.ElasticQuota	The number of ECS instances created based on the specified instance type in the current region has exceeded the quota limit.	You can perform the following operations: Select another instance type. Reduce the number of existing ECS instances. Go to the Quota Center and request a quota increase.
InvalidResourceType.NotSupported	The specified instance type is not supported in the current zone or out of stock.	Call the DescribeAvailableResource operation to query the instance types supported in the current zone and change the instance type used by the node pool.
InvalidImage.NotSupported	The `The specified image does not support vSGX instance.` error message indicates that the OS image of the node pool does not support security-enhanced instances.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by security-enhanced instances, see Create a trusted instance in the ECS console.
InvalidParameter.NotMatch	The `The specified instanceType only support vTPM image.` error message indicates that the specified OS image does not support security-enhanced instances.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by security-enhanced instances, see Create a trusted instance in the ECS console.
QuotaExceeded.PrivateIpAddress	The idle private IP addresses provided by the current vSwitch are insufficient.	Specify more vSwitches for the node pool and try again.
InvalidParameter.KmsNotEnabled	The specified Key Management Service (KMS) key is disabled.	Log on to the KMS console and enable the key.
InvalidInstanceType.NotSupported	The `The specified instanceType is not supported by the image architecture.` error message indicates that the current instance type does not support the specified OS image.	Select another instance type. You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID. You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK. For more information about the OS images supported by ACK, see OS images.
InsufficientBalance.CreditPay	Your account does not have a sufficient balance.	Top up your account first.
ApiServer.InternalError	The `an error on the server (\"Get \\\"https://192.168.xxx.xxx:xxx/api/v1/nodes\\\": dial tcp 192.168.xxx.xxx:xxx: connect: connection refused\") has prevented the request from succeeding` error message indicates that access to the API server of the ACK cluster fails.	Check whether the API server is accessible or available. For more information, see Troubleshoot cluster access issues in the ACK console.
RecommendEmpty.InstanceTypeNotAuthorized	You do not have the permissions to use the specified instance type.	Submit a ticket to acquire the required permissions on ECS.
Account.Arrearage	Your account does not have a sufficient balance.	Top up your account first.
Err.QueryEndpoints	Access to the API server of the ACK cluster fails.	Check whether the API server is accessible or available. For more information, see Troubleshoot cluster access issues in the ACK console.
RecommendEmpty.DiskTypeNoStock	The inventory of disks is insufficient in the specified zone.	Specify more vSwitches for the node pool or select another disk type.
InvalidParameter.KMSKeyId.KMSUnauthorized	You do not have the permissions to access KMS.	Log on to the ECS console and assign the AliyunECSDiskEncryptDefaultRole role to ECS. For more information, see Encryption-related permissions.
InvalidParameter.Conflict	The `The specified disk category (xxxx) is not support the specified instance type.` error message indicates that the current instance type does not support the specified disk type.	Select another instance type or disk type.
NotSupportSnapshotEncrypted.DiskCategory	System disk encryption supports only enhanced SSDs (ESSDs).	Select another disk type. For more information about disk types and disk encryption, see Create a node pool.
ScalingActivityInProgress	Try again later because the node pool is being scaled.	To avoid scaling conflicts, do not scale node pools in the Auto Scaling console.
Instance.StartInstanceFailed	The ECS instances fail to start up.	Try again later. To troubleshoot this issue, submit a ticket to the ECS team.
OperationDenied.NoStock	The current ECS instance type is out of stock in the specified zone.	Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.
RecommendEmpty.InstanceTypeNoStock	The current ECS instance type is out of stock in the specified zone.	Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.
NodepoolScaleFailed.WaitForDesiredSizeTimeout	The scale-out task times out.	Perform the following steps to view the task details: Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Nodes > Node Pools. Click the name of the node pool that you want to manage and click the Scaling Activities tab to view the details of the scale-out task.
ApiServer.TooManyRequests	The task is throttled by the Kubernetes API server of the cluster.	Reduce the request frequency or try again later.
NodepoolScaleFailed.PartialSuccess	Some nodes failed to be created due to insufficient inventory.	Change the instance types used by the node pool and then try again. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.

References

For more information about the operations and precautions when you remove nodes from a node pool, see Remove a node.
For more information about O&M tasks for node pools, such as upgrading the node pool, auto repair, and patching OS CVE vulnerabilities for node pools, see Node pool O&M.
For more information about best practices for node pools, such as using a deployment set to distribute your ECS instances to different physical servers to ensure high availability and preemptible instance-based node pools, see Best practices for nodes and node pools.