Container Service for Kubernetes (ACK) allows you to scale a node pool by modifying the expected number of nodes in the node pool. You can scale out node pools to meet the requirements of business development and scale in node pools to reduce resource costs. Node pool scaling can be automated to improve the O&M efficiency.
Prerequisites
A cluster is created and there are node pools in the cluster. For more information, see Create a node pool.
A kubectl client is connected to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Introduction of node pool scaling
The expected number of nodes refers to the number of nodes to be retained in a node pool. It indicates the number of nodes in the node pool when the node pool reaches the final state. After you specify the expected number of nodes in a node pool, the nodes in the node pool are automatically scaled to the specified number.
Scale out the node pool
Set the expected number of nodes to a value that is greater than the current value. Then, the node pool is automatically scaled out. If you want to scale out a node pool, we recommend that you use this method. This way, the system can automatically retry when it fails to add nodes to the node pool.
The scale-out configuration varies based on the node pool configuration. The instance type and zone of the nodes depend on the scaling policy that is used. For more information about node pool scaling policies, see Scaling policies.
The system performs the following steps to scale out a node pool.
Create ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates Elastic Compute Service (ECS) instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For more information about the Expected Number of Instances feature, see Expected number of instances.
ImportantInstances of GPU-accelerated ECS Bare Metal Instance families ebmgn7 and ebmgn7e cannot automatically delete the Multi-Instance GPU (MIG) configuration. When ACK adds instances of the preceding instance families, ACK automatically resets the MIG configuration retained on the instances. The reset may be time-consuming. In this case, you may fail to add the instances to a cluster.
For more information about how to troubleshoot the issue, see What do I do if I fail to add ECS Bare Metal instances?
For more information about the ebmgn7e instance family, see ebmgn7e, GPU-accelerated compute-optimized ECS Bare Metal Instance family.
Add the ECS instances to the cluster: After Auto Scaling creates ECS instances, the ECS instances automatically run the
cloud-init
script maintained by ACK to initialize the nodes and add the nodes to the node pool. The operational log is saved to the /var/log/messages file on each node. You can log on to a node and run thegrep cloud-init /var/log/messages
command to view the log.NoteAfter a node is added to the node pool, the operational log in the /var/log/messages file is automatically deleted. Therefore, the log records only information about failures to add nodes to the node pool.
If the system fails to add a node to the node pool, the relevant log data in the /var/log/messages file is synchronized to the task result. You can view the task details on the Cluster Tasks tab of the cluster details page.
Scale in the node pool
Set the expected number of nodes to a value that is smaller than the current value. Then, the node pool is automatically scaled in.
When the system scales in a node pool:
If the scaling policy is set to Priority, the system preferably removes the newly created ECS instances from the scaling group.
If the scaling policy is set to Distribution Balancing, the system filters the zones where the ECS instances are deployed based on the policy. Then, the newly created ECS instances are preferably removed from the scaling group to ensure that the numbers of ECS instances in different zones of the scaling group are close or the same.
If the scaling policy is set to Cost Optimization, the system removes ECS instances from the scaling group in the descending order of vCPU prices.
When a scale-in activity is triggered by changing the expected number of nodes, ACK can remove nodes without the need to drain the nodes first. If you want to drain the nodes before they are removed, refer to Remove a node.
When the system scales in a node pool, only pay-as-you-go ECS instances are released. Subscription ECS instances are not released. If you need to release subscription nodes that have not expired, log on to the ECS console and change their billing method to pay-as-you-go. For more information, see Change the billing method of an instance from subscription to pay-as-you-go.
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
Find the node pool that you want to scale out and choose More > Scale in the Actions column.
(Optional) If CloudOps Orchestration Service (OOS) authorization is not completed, you must perform this step. You can perform the following steps to create the AliyunOOSLifecycleHook4CSRole role and assign the role to OOS.
Click AliyunOOSLifecycleHook4CSRole.
NoteIf the current account is an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole.
If the current account is a RAM user, make sure that your Alibaba Cloud account is assigned the AliyunOOSLifecycleHook4CSRole role. Then, attach the AliyunRAMReadOnlyAccess policy to the RAM user. For more information, see Grant permissions to a RAM user.
On the Cloud Resource Access Authorization page, click Agree to Authorization.
Set the Expected Nodes parameter and configure other parameters as prompted.
If the status of the node pool in the node pool list displays Expanding, the system is scaling out the node pool. If the status of the node pool changes to Activated, the node pool is scaled out.
ImportantIf the security group of the cluster denies access to 100.64.0.0/10, new nodes cannot be added to the cluster.
If the status of the node pool in the node pool list displays Removing, the system is scaling in the node pool. If the status of the node pool changes to Activated, the node pool is scaled in.
Unrecommended operations and solutions
The expected number of nodes refers to the number of nodes retained in a node pool. Unrecommended operations may result in node pool scaling failures and cause business losses. The following table describes the unrecommended operations and suggestions on how to fix the issues caused by these operations.
Do not perform the unrecommended operations in the following table.
Unrecommended operation | Node pool behavior | Suggestion |
Remove nodes by running the | ACK compares the expected number of nodes only with the number of ECS instances in the scaling group. It does not compare the expected number of nodes with the actual number of nodes in the cluster. If you use the API server to remove nodes, the ECS instances that host the nodes are not released. As a result, the actual number of nodes in the node pool does not change. However, the status of the nodes that are removed from the cluster changes to Unknown. |
|
Manually release ECS instances in the ECS console or by calling the API. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. |
|
Remove ECS instances from the scaling group in the Auto Scaling console without changing the expected number of instances. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. | Do not modify the scaling groups used by node pools in case the node pools cannot function as normal. |
ECS instances are automatically released when the subscription expires. | Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes. | ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you remove or renew subscription ECS instances that are about to expire at the earliest opportunity.
|
Use the Auto Scaling console or API to enable health checks for the scaling group. | After you enable health checks for the scaling group, the system automatically creates new ECS instances when identifying unhealthy ECS instances, such as suspended ECS instances. | By default, health checks are disabled for scaling groups used by ACK. ECS instances are added to ACK clusters only when nodes are released. Do not modify the scaling groups used by node pools in case the node pools cannot function as normal. |
Error codes for scaling failures and solutions
Node pool scaling may fail due to reasons such as insufficient inventory. You can click the name of your ACK cluster on the Clusters page of the ACK console, click the Cluster Tasks tab, and then click View Cause to view the cause of a node pool scaling failure.
The following table describes the error codes for common node pool scaling failures.
Error code | Cause | Solution |
RecommendEmpty.InstanceTypeNoStock | The inventory of ECS instances in the current zone is insufficient. | Modify the node pool by specifying vSwitches in different zones for the node pool and selecting multiple ECS instance types to improve the success rate of node creation. Note On the Node Pools page, click the name of the node pool that you want to manage. The scalability of the node pool is displayed next to Scaling Group on the Overview tab. You can determine the success rate of scaling the current node group based on the scalability. |
NodepoolScaleFailed.FailedJoinCluster | Nodes fail to be added to the ACK cluster. | Log on to one of the nodes and run the |
InvalidAccountStatus.NotEnoughBalance | Your account does not have a sufficient balance. | Top up your account first. |
InvalidParameter.NotMatch | The | Select another instance type.
|
QuotaExceed.ElasticQuota | The number of ECS instances created based on the specified instance type in the current region has exceeded the quota limit. | You can perform the following operations:
|
InvalidResourceType.NotSupported | The specified instance type is not supported in the current zone or out of stock. | Call the DescribeAvailableResource operation to query the instance types supported in the current zone and change the instance type used by the node pool. |
InvalidImage.NotSupported | The | Select another instance type.
|
InvalidParameter.NotMatch | The | Select another instance type.
|
QuotaExceeded.PrivateIpAddress | The idle private IP addresses provided by the current vSwitch are insufficient. | Specify more vSwitches for the node pool and try again. |
InvalidParameter.KmsNotEnabled | The specified Key Management Service (KMS) key is disabled. | Log on to the KMS console and enable the key. |
InvalidInstanceType.NotSupported | The | Select another instance type.
|
InsufficientBalance.CreditPay | Your account does not have a sufficient balance. | Top up your account first. |
ApiServer.InternalError | The | Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions). |
RecommendEmpty.InstanceTypeNotAuthorized | You do not have the permissions to use the specified instance type. | Submit a ticket to acquire the required permissions on ECS. |
Account.Arrearage | Your account does not have a sufficient balance. | Top up your account first. |
Err.QueryEndpoints | Access to the API server of the ACK cluster fails. | Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions). |
RecommendEmpty.DiskTypeNoStock | The inventory of disks is insufficient in the specified zone. | Specify more vSwitches for the node pool or select another disk type. |
InvalidParameter.KMSKeyId.KMSUnauthorized | You do not have the permissions to access KMS. | Log on to the ECS console and assign the AliyunECSDiskEncryptDefaultRole role to ECS. For more information, see Grant access to KMS keys through RAM roles. |
InvalidParameter.Conflict | The | Select another instance type or disk type. |
NotSupportSnapshotEncrypted.DiskCategory | System disk encryption supports only enhanced SSDs (ESSDs). | Select another disk type. For more information about disk types and disk encryption, see Create a node pool. |
ScalingActivityInProgress | Try again later because the node pool is being scaled. | To avoid scaling conflicts, do not scale node pools in the Auto Scaling console. |
Instance.StartInstanceFailed | The ECS instances fail to start up. | Try again later. To troubleshoot this issue, submit a ticket to the ECS team. |
OperationDenied.NoStock | The current ECS instance type is out of stock in the specified zone. | Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool. |
RecommendEmpty.InstanceTypeNoStock | The current ECS instance type is out of stock in the specified zone. | Select another instance type. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool. |
NodepoolScaleFailed.WaitForDesiredSizeTimeout | The scale-out task times out. | Perform the following steps to view the task details:
|
ApiServer.TooManyRequests | The task is throttled by the Kubernetes API server of the cluster. | Reduce the request frequency or try again later. |
NodepoolScaleFailed.PartialSuccess | Some nodes failed to be created due to insufficient inventory. | Change the instance types used by the node pool and then try again. The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool. |
References
For more information about the operations and precautions when you remove nodes from a node pool, see Remove a node.
For more information about O&M tasks for node pools, such as upgrading the node pool, auto repair, and patching OS CVE vulnerabilities for node pools, see Node pool O&M.
For more information about best practices for node pools, such as using a deployment set to distribute your ECS instances to different physical servers to ensure high availability and preemptible instance-based node pools, see Best practices for nodes and node pools.