Container Service for Kubernetes (ACK) is a managed service that you can use to run Kubernetes without the need to manage the technical architectures and key components of Kubernetes. This means that you no longer have to worry about misconfigured control plane components that cause service downtime or interruptions. We recommend that you read this topic to fully understand the risks that may arise when you use ACK.
Table of contents
Item | References |
Usage notes
Data plane components
Data plane components are system components that run on your Elastic Compute Service (ECS) instances. Data plane components include CoreDNS, Ingress, kube-proxy, Terway, and kubelet. This means that both you and ACK share responsibility for ensuring the stability of data plane components.
ACK provides the following features for data plane components:
Management and maintenance capabilities including custom component configurations, periodic component optimization, bug fixes, Common Vulnerabilities and Exposures (CVE) patches, and the relevant documentation.
Observability into components by providing monitoring and alerting capabilities and generating log files for key components, which are obtainable through Simple Log Service (SLS).
Best practices and suggestions for component configurations based on the size of the cluster in which the components are deployed.
Periodic component inspection and alerting. The inspection items include but are not limited to component versions, component configurations, component loads, component topology, and the number of component pods.
We recommend that you follow these suggestions when you use data plane components:
Use the latest component version. New releases may contain bug fixes and new features. When a new component version is released, choose an appropriate time to update your components based on the instructions provided in the user guide. This helps prevent issues that may be caused by outdated components. For more information, see Component overview.
Specify the email addresses and mobile phone numbers of alert contacts in the alert center of ACK. Then, specify the notification methods. Alibaba Cloud can then use the specified notification methods to send alerts and notifications. For more information, see Alert management.
When you receive alerts that notify you of component stability risks, follow the instructions to mitigate the risks at the earliest opportunity.
If you want to configure custom component parameters, we recommend that you call the ACK API or go to the ACK console. Custom component parameters that are modified by using other methods may cause the component to malfunction. For more information, see Manage components.
page of your cluster in theDo not use the APIs of Infrastructure as a Service (IaaS) services to modify the environment of data plane components. For example, do not use the ECS API to change the status of the ECS instances on which data plane components run, or modify the security groups or network settings of the worker nodes. Do not use the Server Load Balancer (SLB) API to modify the configurations of the SLB instances that are used by your cluster. Improper changes to IaaS resources may cause the components to malfunction.
Some data plane components may inherit bugs or vulnerabilities from their open source versions. We recommend that you update your components when ACK provides updated versions to ensure the stability of your business.
Cluster update
Use the cluster update feature of ACK to update the Kubernetes versions of your ACK clusters. Other methods may cause stability or compatibility issues. For more information, see Manually upgrade ACK clusters.
ACK provides the following features to support cluster updates:
Version updates for ACK clusters.
Pre-update checks to ensure that an ACK cluster meets the conditions for version updates.
Release notes that describe new Kubernetes versions and compare new versions with earlier versions.
Pre-update notifications that inform you of the risks that may arise due to resource changes caused by version updates.
We recommend that you follow these suggestions when you use the cluster update feature:
Before you perform an update, we recommend that you perform a precheck and fix the identified issues.
Read and understand the release notes of new Kubernetes versions. Check the status of your cluster and workloads based on the update risks that are reported by ACK. Then, evaluate the impacts of updating the cluster. For more information, see Overview of Kubernetes versions supported by ACK.
You cannot roll back cluster updates. Before you update a cluster, prepare for the update and make sure that you have a backup plan.
Update your cluster to the latest Kubernetes version before the Kubernetes version that is used by your cluster is deprecated by ACK. For more information, see Support for Kubernetes versions.
Kubernetes configurations
Do not change key Kubernetes configurations. For example, do not change the following directories or modify the paths, links, and content of the files in the directories:
/var/lib/kubelet
/var/lib/docker
/etc/kubernetes
/etc/kubeadm
/var/lib/containerd
Do not use the annotations that are reserved by Kubernetes in YAML templates. Otherwise, your application may fail to locate resources or send requests, and may behave abnormally. Labels prefixed with
kubernetes.io/
ork8s.io/
are reserved for key components. Example: pv.kubernetes.io/bind-completed: "yes".
ACK Serverless clusters
In the following scenarios, ACK Serverless clusters are not eligible for compensation clauses:
To simplify cluster O&M, ACK Serverless clusters provide fully-managed system components, which are deployed and maintained by ACK after you enable these components. ACK Serverless is not liable for any business loss caused by user errors such as accidental deletion of Kubernetes resources used by the fully-managed system components and therefore no compensation is provided.
Cluster registration
When you register an external Kubernetes cluster with ACK in the ACK console, make sure that the network connectivity between the cluster and Alibaba Cloud is stable.
ACK allows you to register external Kubernetes clusters but does not ensure the stability of the external clusters and cannot prevent accidental operations on these clusters. Proceed with caution when you configure labels, annotations, and tags for the nodes in an external cluster by using the cluster registration proxy. Improper configurations may cause applications to malfunction.
App catalogs
The application marketplace of ACK provides the app catalog feature to help you install applications that are developed based on open source versions. ACK cannot prevent the defects in open source applications. Proceed with caution when you install these applications. For more information, see App Marketplace.
High-risk operations
The following operations are considered high-risk operations in ACK. Improper usage may cause stability issues, and in severe circumstances may cause your cluster to fail. Read and understand the impacts of the following high-risk operations before you perform these operations:
High-risk operations on clusters
Category | High-risk operation | Impact | How to recover |
API Server | Delete the SLB instance that is used to expose the API server. | You cannot manage the cluster. | Unrecoverable. You must create a new cluster. For more information about how to create a cluster, see Create an ACK managed cluster. |
Worker nodes | Modify the security group of nodes. | The nodes may become unavailable. | Add the nodes to the original security group again. The security group is created when you create the cluster. For more information, see Manage ECS instances in security groups. |
The subscriptions of nodes expire or nodes are removed. | The nodes become unavailable. | Unrecoverable. | |
Reinstall the node OS. | Components are uninstalled from nodes. | Remove the nodes and then add the nodes to the cluster again. For more information, see Remove nodes and Add existing ECS instances to an ACK cluster. | |
Update component versions. | The nodes may become unavailable. | Roll back to the original component versions. | |
Change the IP addresses of nodes. | The nodes become unavailable. | Change the IP addresses of the nodes to the original IP addresses. | |
Modify the parameters of key components, such as kubelet, docker, and containerd. | The nodes may become unavailable. | Refer to the ACK official documentation and configure the component parameters. | |
Modify node OS configurations. | The nodes may become unavailable. | Restore the configurations, or remove the worker nodes and then purchase new nodes. | |
Modify the system time of nodes. | The components on the nodes do not work as expected. | Reset the system time of the nodes. | |
Master nodes in ACK dedicated clusters | Modify the security group of master nodes. | The master nodes may become unavailable. | Add the master nodes to the original security group again. The security group is created when you create the cluster. For more information, see Manage ECS instances in security groups. |
The subscriptions of master nodes expire or master nodes are removed. | The master nodes become unavailable. | Unrecoverable. | |
Reinstall the node OS. | Components are uninstalled from master nodes. | Unrecoverable. | |
Update master nodes or the etcd component. | The cluster may become unavailable. | Roll back to the original component versions. | |
Delete or format the directories that store business-critical data on nodes, for example, /etc/kubernetes. | The master nodes become unavailable. | Unrecoverable. | |
Change the IP addresses of master nodes. | The master nodes become unavailable. | Change the IP addresses of the master nodes to the original IP addresses. | |
Modify the parameters of key components, such as etcd, kube-apiser, and docker. | The master nodes may become unavailable. | Refer to the ACK official documentation and configure the component parameters. | |
Replace the certificates of master nodes or the etcd component. | The cluster may become unavailable. | Unrecoverable. | |
Increase or decrease the number of master nodes. | The cluster may become unavailable. | Unrecoverable. | |
Modify the system time of nodes. | The components on the nodes do not work as expected. | Reset the system time of the nodes. | |
Other services | Use Resource Access Management (RAM) to modify permissions. | Resources such as SLB instances may fail to be created. | Restore the permissions. |
High-risk operations on node pools
High-risk operation | Impact | How to recover |
Delete scaling groups. | Node pool exceptions occur. | Unrecoverable. You must create new node pools. For more information about how to create a node pool, see Procedure. |
Use kubectl to remove nodes from a node pool. | The number of nodes in the node pool that is displayed in the ACK console is different from the actual number. | Remove nodes in the ACK console, by calling the ACK API, or by configuring the Expected Nodes parameter of the node pool. For more information, see Remove nodes and Create a node pool. |
Manually release ECS instances. | Incorrect information may be displayed on the node pool details page. A node pool is configured with the Expected Nodes parameter when the node pool is created. After you release the ECS instances, ACK automatically scales out the node pool to the value of the Expected Nodes parameter. | Unrecoverable. To release ECS instances in a node pool, configure the Expected Nodes parameter of the node pool in the ACK console or by calling the ACK API. You can also remove the nodes that are deployed on the ECS instances. For more information, see Create a node pool and Remove nodes. |
Manually scale in or scale out a node pool that has auto scaling enabled. | The auto scaling component automatically adjusts the number of nodes in the node pool after you manually scale in or scale out the node pool. | Unrecoverable. You do not need to manually scale a node pool that has auto scaling enabled. |
Change the upper or lower limit of instances that a scaling group can contain. | Scaling errors may occur. |
|
Add existing nodes to a cluster without backing up the data on the nodes. | The data on the nodes is lost after the nodes are added to the cluster. | Unrecoverable.
|
Store business-critical data on the system disk. | If you enable auto repair for a node pool, the system may handle node exceptions by resetting node configurations. As a result, data on the system disk is lost. | Unrecoverable. Store business-critical data to data disks, cloud disks, File Storage NAS (NAS) file systems, or Object Storage Service (OSS) buckets. |
High-risk operations on networks and load balancing
High-risk operation | Impact | How to recover |
Specify the following kernel parameter setting: | Network connectivity issues occur. | Replace the setting with the following content: |
Specify the following kernel parameter settings:
| Network connectivity issues occur. | Replace the settings with the following content:
|
Specify the following kernel parameter setting: | Pods fail to pass health checks. | Replace the setting with the following |
Specify the following kernel parameter setting: | Network address translation errors occur. | Replace the setting with the following content: |
Specify the following kernel parameter setting: | Network connectivity issues occasionally occur. | Replace the setting with the following content: |
Install firewall software, such as Firewalld or ufw. | The container network becomes inaccessible. | Uninstall the firewall software and restart the nodes. |
The security group of a node does not open UDP port 53 for the pod CIDR block. | DNS cannot work as expected in the cluster. | Refer to the ECS official documentation and modify the security group configuration to open UDP port 53 for the pod CIDR block. |
Modify or delete the tags that ACK added to SLB instances. | The SLB instances do not work as normal. | Restore the tags. |
Modify the configurations of the SLB instances that are managed by ACK, including the configurations of the instances, listeners, and vServer groups. | The SLB instances do not work as normal. | Restore the SLB configurations. |
Remove the | The SLB instances do not work as normal. | Add the annotation to the Service configuration. Note If a Service is configured to use an existing SLB instance, you cannot modify the configuration to create a new SLB instance for the Service. To use a new SLB instance, you must create a new Service. |
Delete the SLB instances that are created by ACK in the SLB console. | Errors may occur in the cluster network. | Delete SLB instances by deleting the Services that are associated with the SLB instances. For more information about how to delete a Service, see Delete a Service. |
Manually delete the | The NGINX Ingress controller does not run as normal or may stop running. | Use the following YAML template to create a Service that has the same name:
|
Configure the | If the DNS server is not configured properly, DNS resolution may fail. As a result, the cluster cannot run as expected. | If you want to use a self-managed DNS server, we recommend that you configure the DNS server in CoreDNS. For more information, see Configure CoreDNS. |
High-risk operations on storage
High-risk operation | Impact | How to recover |
Unmount cloud disks that are mounted to pods in the ECS console. | I/O errors occur when you write data to the pods. | Restart the pods and clear residual data on the nodes. |
Unmount disks from their mount paths on nodes. | Pod data is written to local disks. | Restart the pods. |
Manage cloud disks on the nodes. | Pod data is written to local disks. | Unrecoverable. |
Mount a cloud disk to multiple pods. | Pod data is written to local disks or I/O errors occur when you write data to the pods. | Mount the cloud disk only to one pod. Important Alibaba Cloud disks cannot be shared. Each disk can be mounted only to one pod. |
Manually delete the NAS directories that are mounted to pods. | I/O errors occur when you write data to the pods. | Restart the pods. |
Delete the NAS file systems that are mounted to pods or delete the mount targets that are used to mount NAS file systems. | I/O hangs occur when you write data to the pods. | Restart the ECS instances. For more information about how to restart an ECS instance, see Restart an instance. |
High-risk operations on logs
High-risk operation | Impact | How to recover |
Delete the /tmp/ccs-log-collector/pos directory on a node. | Duplicate logs are collected. | Unrecoverable. The /tmp/ccs-log-collector/pos directory contains information about the path from which logs are collected. |
Delete the /tmp/ccs-log-collector/buffer directory on a node. | Logs are lost. | Unrecoverable. The /tmp/ccs-log-collector/buffer directory stores cached log files that need to be consumed. |
Delete the aliyunlogconfig CustomResourceDefinition (CRD) objects. | Logs cannot be collected. | Recreate the aliyunlogconfig CRD objects that are deleted and the related resources. Logs that are generated within the period of time during which the aliyunlogconfig CRD objects do not exist cannot be collected. If you delete the aliyunlogconfig CRD objects, the related log collection tasks are also deleted. After you recreate the aliyunlogconfig CRD objects, you must also relaunch the log collection tasks. |
Uninstall logging components. | Logs cannot be collected. | Reinstall the logging component and manually create the aliyunlogconfig CRD objects. Logs that are generated within the period of time during which the logging component and the aliyunlogconfig CRD objects do not exist cannot be collected. If you delete the logging component, the aliyunlogconfig CRD objects and Logtail are also deleted. Logs that are generated within the period of time during which the logging component does not exist cannot be collected. |