All Products
Search
Document Center

Container Service for Kubernetes:Scale a node pool

Last Updated:Oct 10, 2024

Container Service for Kubernetes (ACK) allows you to scale a node pool by modifying the expected number of nodes in the node pool. You can scale out node pools to meet the requirements of business development and scale in node pools to reduce resource costs. Node pool scaling can be automated to improve the O&M efficiency.

Prerequisites

Introduction of node pool scaling

The expected number of nodes refers to the number of nodes to be retained in a node pool. It indicates the number of nodes in the node pool when the node pool reaches the final state. After you specify the expected number of nodes in a node pool, the nodes in the node pool are automatically scaled to the specified number.

Scale out the node pool

Set the expected number of nodes to a value that is greater than the current value. Then, the node pool is automatically scaled out. If you want to scale out a node pool, we recommend that you use this method. This way, the system can automatically retry when it fails to add nodes to the node pool.

Note

The scale-out configuration varies based on the node pool configuration. The instance type and zone of the nodes depend on the scaling policy that is used. For more information about node pool scaling policies, see Scaling policies.

The system performs the following steps to scale out a node pool.

  1. Create ECS instances: Auto Scaling, the underlying service used by ACK to scale node pools, automatically creates Elastic Compute Service (ECS) instances. After you modify the expected number of nodes, ACK automatically changes the expected number of instances in the scaling group of Auto Scaling to scale out the node pool. The status of the node pool changes to Expanding. After Auto Scaling creates ECS instances, the status of the node pool changes to Activated. For more information about the Expected Number of Instances feature, see Expected number of instances.

    Important

    Instances of GPU-accelerated ECS Bare Metal Instance families ebmgn7 and ebmgn7e cannot automatically delete the Multi-Instance GPU (MIG) configuration. When ACK adds instances of the preceding instance families, ACK automatically resets the MIG configuration retained on the instances. The reset may be time-consuming. In this case, you may fail to add the instances to a cluster.

  2. Add the ECS instances to the cluster: After Auto Scaling creates ECS instances, the ECS instances automatically run the cloud-init script maintained by ACK to initialize the nodes and add the nodes to the node pool. The operational log is saved to the /var/log/messages file on each node. You can log on to a node and run the grep cloud-init /var/log/messages command to view the log.

    Note
    • After a node is added to the node pool, the operational log in the /var/log/messages file is automatically deleted. Therefore, the log records only information about failures to add nodes to the node pool.

    • If the system fails to add a node to the node pool, the relevant log data in the /var/log/messages file is synchronized to the task result. You can view the task details on the Cluster Tasks tab of the cluster details page.

Scale in the node pool

Set the expected number of nodes to a value that is smaller than the current value. Then, the node pool is automatically scaled in.

  • When the system scales in a node pool:

    • If the scaling policy is set to Priority, the system preferably removes the newly created ECS instances from the scaling group.

    • If the scaling policy is set to Distribution Balancing, the system filters the zones where the ECS instances are deployed based on the policy. Then, the newly created ECS instances are preferably removed from the scaling group to ensure that the numbers of ECS instances in different zones of the scaling group are close or the same.

    • If the scaling policy is set to Cost Optimization, the system removes ECS instances from the scaling group in the descending order of vCPU prices.

  • When a scale-in activity is triggered by changing the expected number of nodes, ACK can remove nodes without the need to drain the nodes first. If you want to drain the nodes before they are removed, refer to Remove a node.

  • When the system scales in a node pool, only pay-as-you-go ECS instances are released. Subscription ECS instances are not released. If you need to release subscription nodes that have not expired, log on to the ECS console and change their billing method to pay-as-you-go. For more information, see Change the billing method of an instance from subscription to pay-as-you-go.

Procedure

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.

  3. Find the node pool that you want to scale out and choose More > Scale in the Actions column.

  4. (Optional) If CloudOps Orchestration Service (OOS) authorization is not completed, you must perform this step. You can perform the following steps to create the AliyunOOSLifecycleHook4CSRole role and assign the role to OOS.

    1. Click AliyunOOSLifecycleHook4CSRole.

      Note
      • If the current account is an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole.

      • If the current account is a RAM user, make sure that your Alibaba Cloud account is assigned the AliyunOOSLifecycleHook4CSRole role. Then, attach the AliyunRAMReadOnlyAccess policy to the RAM user. For more information, see Grant permissions to a RAM user.

    2. On the Cloud Resource Access Authorization page, click Agree to Authorization.

  5. Set the Expected Nodes parameter and configure other parameters as prompted.

    • If the status of the node pool in the node pool list displays Expanding, the system is scaling out the node pool. If the status of the node pool changes to Activated, the node pool is scaled out.

      Important

      If the security group of the cluster denies access to 100.64.0.0/10, new nodes cannot be added to the cluster.

    • If the status of the node pool in the node pool list displays Removing, the system is scaling in the node pool. If the status of the node pool changes to Activated, the node pool is scaled in.

Unrecommended operations and solutions

The expected number of nodes refers to the number of nodes retained in a node pool. Unrecommended operations may result in node pool scaling failures and cause business losses. The following table describes the unrecommended operations and suggestions on how to fix the issues caused by these operations.

Important

Do not perform the unrecommended operations in the following table.

Unrecommended operation

Node pool behavior

Suggestion

Remove nodes by running the kubectl delete node command.

ACK compares the expected number of nodes only with the number of ECS instances in the scaling group. It does not compare the expected number of nodes with the actual number of nodes in the cluster.

If you use the API server to remove nodes, the ECS instances that host the nodes are not released. As a result, the actual number of nodes in the node pool does not change. However, the status of the nodes that are removed from the cluster changes to Unknown.

  • If you have performed this operation, you can click the name of the node pool and then remove the nodes on the Nodes tab to remove the nodes from the node pool.

    Note

    You do not need to select Drain the Node because the nodes are already removed from the cluster. You can select Release ECS Instance based on your business requirements.

  • The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances.

    • Nodes that are manually added to the cluster.

    • Subscription nodes.

Manually release ECS instances in the ECS console or by calling the API.

Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.

  • ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you use the ACK console to remove nodes. For more information, see Remove a node.

  • The ECS instances of the following nodes are not released after you perform the preceding operation. You need to log on to the ECS console and manually release these ECS instances.

    • Nodes that are manually added to the cluster.

    • Subscription nodes.

Remove ECS instances from the scaling group in the Auto Scaling console without changing the expected number of instances.

Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.

Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.

ECS instances are automatically released when the subscription expires.

Node pools are aware of the releases of ECS instances and can automatically create ECS instances to reach the expected number of nodes.

ACK compares the expected number of nodes with the actual number of nodes in the node pool to detect ECS instance releases and create new ECS instances. This helps avoid business losses. We recommend that you remove or renew subscription ECS instances that are about to expire at the earliest opportunity.

Use the Auto Scaling console or API to enable health checks for the scaling group.

After you enable health checks for the scaling group, the system automatically creates new ECS instances when identifying unhealthy ECS instances, such as suspended ECS instances.

By default, health checks are disabled for scaling groups used by ACK. ECS instances are added to ACK clusters only when nodes are released. Do not modify the scaling groups used by node pools in case the node pools cannot function as normal.

Error codes for scaling failures and solutions

Node pool scaling may fail due to reasons such as insufficient inventory. You can click the name of your ACK cluster on the Clusters page of the ACK console, click the Cluster Tasks tab, and then click View Cause to view the cause of a node pool scaling failure.

The following table describes the error codes for common node pool scaling failures.

Error code

Cause

Solution

RecommendEmpty.InstanceTypeNoStock

The inventory of ECS instances in the current zone is insufficient.

Modify the node pool by specifying vSwitches in different zones for the node pool and selecting multiple ECS instance types to improve the success rate of node creation.

Note

On the Node Pools page, click the name of the node pool that you want to manage. The scalability of the node pool is displayed next to Scaling Group on the Overview tab. You can determine the success rate of scaling the current node group based on the scalability.

NodepoolScaleFailed.FailedJoinCluster

Nodes fail to be added to the ACK cluster.

Log on to one of the nodes and run the grep cloud-init /var/log/messages command to view the operational log and check the error message.

InvalidAccountStatus.NotEnoughBalance

Your account does not have a sufficient balance.

Top up your account first.

InvalidParameter.NotMatch

The Image bootMode BIOS does not match instanceType bootMode error message indicates that the specified instance type does not support the specified OS image boot mode.

Select another instance type.

  • You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID.

  • You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK.

  • For more information about the OS images supported by ACK, see Overview of OS images.

QuotaExceed.ElasticQuota

The number of ECS instances created based on the specified instance type in the current region has exceeded the quota limit.

You can perform the following operations:

  • Select another instance type.

  • Reduce the number of existing ECS instances.

  • Go to the Quota Center and request a quota increase.

InvalidResourceType.NotSupported

The specified instance type is not supported in the current zone or out of stock.

Call the DescribeAvailableResource operation to query the instance types supported in the current zone and change the instance type used by the node pool.

InvalidImage.NotSupported

The The specified image does not support vSGX instance. error message indicates that the OS image of the node pool does not support security-enhanced instances.

Select another instance type.

  • You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID.

  • You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK.

  • For more information about the OS images supported by security-enhanced instances, see Create a trusted instance in the ECS console.

InvalidParameter.NotMatch

The The specified instanceType only support vTPM image. error message indicates that the specified OS image does not support security-enhanced instances.

Select another instance type.

  • You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID.

  • You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK.

  • For more information about the OS images supported by security-enhanced instances, see Create a trusted instance in the ECS console.

QuotaExceeded.PrivateIpAddress

The idle private IP addresses provided by the current vSwitch are insufficient.

Specify more vSwitches for the node pool and try again.

InvalidParameter.KmsNotEnabled

The specified Key Management Service (KMS) key is disabled.

Log on to the KMS console and enable the key.

InvalidInstanceType.NotSupported

The The specified instanceType is not supported by the image architecture. error message indicates that the current instance type does not support the specified OS image.

Select another instance type.

  • You can click Details in the Actions column of the node pool that you want to manage on the Node Pools page and then click the Overview tab to view basic information about the node pool, such as the OS and image ID.

  • You can call the DescribeImageSupportInstanceTypes operation to query the instance types supported by the OS images used in ACK.

  • For more information about the OS images supported by ACK, see Overview of OS images.

InsufficientBalance.CreditPay

Your account does not have a sufficient balance.

Top up your account first.

ApiServer.InternalError

The an error on the server (\"Get \\\"https://192.168.xxx.xxx:xxx/api/v1/nodes\\\": dial tcp 192.168.xxx.xxx:xxx: connect: connection refused\") has prevented the request from succeeding error message indicates that access to the API server of the ACK cluster fails.

Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions).

RecommendEmpty.InstanceTypeNotAuthorized

You do not have the permissions to use the specified instance type.

Submit a ticket to acquire the required permissions on ECS.

Account.Arrearage

Your account does not have a sufficient balance.

Top up your account first.

Err.QueryEndpoints

Access to the API server of the ACK cluster fails.

Check whether the API server is accessible or available. For more information, see ACK console troubleshooting (cluster access exceptions).

RecommendEmpty.DiskTypeNoStock

The inventory of disks is insufficient in the specified zone.

Specify more vSwitches for the node pool or select another disk type.

InvalidParameter.KMSKeyId.KMSUnauthorized

You do not have the permissions to access KMS.

Log on to the ECS console and assign the AliyunECSDiskEncryptDefaultRole role to ECS. For more information, see Grant access to KMS keys through RAM roles.

InvalidParameter.Conflict

The The specified disk category (xxxx) is not support the specified instance type. error message indicates that the current instance type does not support the specified disk type.

Select another instance type or disk type.

NotSupportSnapshotEncrypted.DiskCategory

System disk encryption supports only enhanced SSDs (ESSDs).

Select another disk type. For more information about disk types and disk encryption, see Create a node pool.

ScalingActivityInProgress

Try again later because the node pool is being scaled.

To avoid scaling conflicts, do not scale node pools in the Auto Scaling console.

Instance.StartInstanceFailed

The ECS instances fail to start up.

Try again later. To troubleshoot this issue, submit a ticket to the ECS team.

OperationDenied.NoStock

The current ECS instance type is out of stock in the specified zone.

Select another instance type.

The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.

RecommendEmpty.InstanceTypeNoStock

The current ECS instance type is out of stock in the specified zone.

Select another instance type.

The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.

NodepoolScaleFailed.WaitForDesiredSizeTimeout

The scale-out task times out.

Perform the following steps to view the task details:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Nodes > Node Pools.

  3. Click the name of the node pool that you want to manage and click the Scaling Activities tab to view the details of the scale-out task.

ApiServer.TooManyRequests

The task is throttled by the Kubernetes API server of the cluster.

Reduce the request frequency or try again later.

NodepoolScaleFailed.PartialSuccess

Some nodes failed to be created due to insufficient inventory.

Change the instance types used by the node pool and then try again.

The scalability of a node pool dynamically changes based on the stock of ECS instances, which affects the success rate of node pool scaling. For more information, see Check the scalability of a node pool.

References

  • For more information about the operations and precautions when you remove nodes from a node pool, see Remove a node.

  • For more information about O&M tasks for node pools, such as upgrading the node pool, auto repair, and patching OS CVE vulnerabilities for node pools, see Node pool O&M.

  • For more information about best practices for node pools, such as using a deployment set to distribute your ECS instances to different physical servers to ensure high availability and preemptible instance-based node pools, see Best practices for nodes and node pools.