FAQ about cluster management - Container Service for Kubernetes

This topic provides answers to some frequently asked questions (FAQ) about creating, using, and managing clusters.

How do I migrate a self-managed Kubernetes cluster to ACK?
Are ACK clusters that run Alibaba Cloud Linux compatible with CentOS-based container images?
Can I change the container runtime of a cluster from containerd to Docker?
What are the differences between containerd, Docker, and Sandboxed-Container?
Is ACK certified for Level 3 Cybersecurity?
How do I collect the diagnostic data of an ACK cluster?
How do I troubleshoot ACK cluster issues?
How do I perform fine-grained authentication when I use a RAM user to manage an ACK cluster?
What CIDR blocks do I need to configure in the SLB ACLs to allow access to the API server of an ACK cluster?
Can I use Istio in ACK clusters?
How do I connect to master nodes?
Can I update an ACK dedicated cluster after I accidentally delete a master node of the cluster?
Can I remove or add master nodes of an ACK dedicated cluster? What are the high-risk operations?
What do I do if the ACK dedicated cluster API server prompts api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after xxx?

How do I migrate a self-managed Kubernetes cluster to ACK?

Alibaba Cloud Container Service for Kubernetes (ACK) provides a seamless migration solution for migrating self-managed Kubernetes clusters to ACK clusters. This ensures that there is no impact on business during the migration process. For more information, see Migration scheme overview.

Are ACK clusters that run Alibaba Cloud Linux compatible with container images that are based on CentOS?

Yes, they are. ACK clusters that run Alibaba Cloud Linux are compatible with CentOS-based container images. For more information, see the Alibaba Cloud Linux 3.

Can I change the container runtime of a cluster from containerd to Docker?

After a cluster is created, you cannot change the container runtime used by the cluster. However, you can create node pools that use different container runtimes in the cluster. The container runtimes used by node pools in the cluster can be different. For more information, see Create and manage node pools.

You can change the container runtime of a node from Docker to containerd. For more information, see Change the container runtime from Docker to containerd.

Note

Clusters that run Kubernetes 1.24 or later no longer use Docker as the built-in container runtime. You can use containerd as the container runtime for clusters that run Kubernetes 1.24 or later.

What are the differences between containerd, Docker, and Sandboxed-Container?

Container Service for Kubernetes (ACK) supports the following container runtimes: containerd, Docker, and Sandboxed-Container. We recommend that you use containerd as the container runtime. You can use Docker as the container runtime in clusters that run Kubernetes V1.22 and earlier. You can use Sandboxed-Container as the container runtime in clusters that run Kubernetes V1.24 and earlier. For more information the comparison of different container runtimes, see Comparison among Docker, containerd, and Sandboxed-Container. If your cluster uses Docker as the container runtime, you must change the container runtime to containerd before you can update the Kubernetes version of your cluster to 1.24 or later. For more information, see Change the container runtime from Docker to containerd.

Is ACK certified for Level 3 Cybersecurity?

You can enable security hardening and configure baseline check policies for your clusters, based on Alibaba Cloud Linux, to achieve Multi-Level Protection Scheme (MLPS) 2.0 Level-3 compliance. This includes configuring compliance baseline checks to ensure that your clusters meet the following compliance requirements:

Identity verification
Access control
Security auditing
Intrusion prevention
Malicious code protection

For more information, see ACK security hardening based on MLPS.

How do I collect the diagnostic data of an ACK cluster?

ACK provides the cluster diagnostics feature that you can use to diagnose clusters with a few clicks. This feature helps you troubleshoot cluster issues and node anomalies. For more information, see Work with cluster diagnostics.

You can also collect diagnostic data from master nodes and worker nodes for further analysis. The following section describes how to collect diagnostic data from Linux nodes and Windows nodes.

Collect diagnostic data from Linux nodes

Collect diagnostic data from Windows nodes

Worker nodes support Linux and Windows, whereas master nodes support only Linux. The following steps apply to master nodes and worker nodes that run Linux. In this example, the diagnostic data is collected from a master node:

Log on to the master node and run the following command to download a diagnostic script:
```
curl -o /usr/local/bin/diagnose_k8s.sh http://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh
```
Note
You can download the diagnostic script for Linux nodes only from the China (Hangzhou) region.
Run the following command to grant execution permissions to the diagnostic script:
```
chmod u+x /usr/local/bin/diagnose_k8s.sh
```
Run the following command to go to a specified directory:
```
cd /usr/local/bin
```
Run the following command to run the diagnostic script:
```
diagnose_k8s.sh
```
The following output is returned. Each time you run the diagnostic script, a log file with a different name is generated. In this example, the log file is named diagnose_1514939155.tar.gz. The name is subject to the actual conditions.
```
......
+ echo 'please get diagnose_1514939155.tar.gz for diagnostics'
please get diagnose_1514939155.tar.gz for diagnostics
+ echo 'Upload diagnose_1514939155.tar.gz'
Upload diagnose_1514939155.tar.gz
```
Run the following command to query the log file that stores the diagnostic data:
```
ls -ltr | grep diagnose_1514939155.tar.gz
```
Note
Replace diagnose_1514939155.tar.gz with the actual name of the generated log file.

To collect diagnostic data from a Windows worker node, perform the following steps to download and run a diagnostic script:

Note

Windows can run only on worker nodes.

Log on to an abnormal node. Open the Run dialog box, enter cmd, and then click OK to open Command Prompt.
Run the following command to switch to PowerShell:
```
powershell
```
Run the following command to download and run a diagnostic script:
The diagnostic script for a Windows node can be downloaded only from the region in which the node resides. Replace [$Region_ID] in the command with the actual region ID of the node.
```
Invoke-WebRequest -UseBasicParsing -Uri http://aliacs-k8s-[$Region_ID].oss-[$Region_ID].aliyuncs.com/public/pkg/windows/diagnose/diagnose.ps1 | Invoke-Expression
```
If the following output is returned, the diagnostic data of the node is collected.
```
INFO: Compressing diagnosis clues ...
INFO: ...done
INFO: Please get diagnoses_1514939155.zip for diagnostics
```
Note
The diagnoses_1514939155.zip file is stored in the directory where the diagnostic script is run.

How do I troubleshoot ACK cluster issues?

1. Check cluster nodes

Run the following command to check whether all cluster nodes are in the Ready state:
```
kubectl get nodes
```
The following figure shows the expected output.
- If all cluster nodes exist and are in the Ready state, the nodes run as expected.
- If a node is not in the Ready state, perform Step 2.
Run the following command to query the details and events of a node:
Replace [$NODE_NAME] with the actual node name.
```
kubectl describe node [$NODE_NAME]
```
Note
For more information about the kubectl output, see Node status.

2. Check cluster components

If all cluster nodes run as expected, check the logs of cluster components.

Run the following command to view all components in the kube-system namespace:
```
kubectl get pods -n kube-system
```
The following figure shows the expected output. Components whose names start with kube- are system components. Components whose names start with coredns- are Alibaba Cloud DNS (DNS) components. The output shows that all cluster components run as expected. If a component does not run as expected, perform the following step.
Run the following command to query the log of a component:
Replace [$Component_Name] with the actual component name.
```
kubectl logs -f [$Component_Name] -n kube-system
```

3. Check the kubelet

Run the following command to view the status of the kubelet:
```
systemctl status kubelet
```
If the kubelet is not in the Active state, run the following command to view the kubelet log. Identify and resolve issues based on the log.
```
journalctl -u kubelet
```

Common cluster issues

The following table describes common issues and solutions for ACK clusters.

Issue	Solution

Issue	Solution
The API server or a component on the master node stops. As a result, the following issues may occur: You cannot create, stop, or update pods, Services, or Deployments. All existing pods and Services run as expected unless the pods and Services need to call the ACK API to perform operations such as managing Kubernetes dashboards.	The components of ACK support high availability. We recommend that you check whether the components are abnormal. For example, the API server of an ACK cluster uses a Classic Load Balancer (CLB) instance. You can check why your CLB instance is abnormal.
The backend data of the API server is lost. As a result, the following issues may occur: The API server cannot be started. All existing pods and Services run as expected unless the pods and Services need to call the ACK API to perform operations such as managing Kubernetes dashboards. The API server can be started only after the backend data of the API server is restored or recreated.	If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created in advance, contact us for technical support. You can use the following methods to prevent this issue: Use the volume plug-ins provided by ACK to persist data. For more information, see Use a dynamically provisioned disk volume. Create snapshots for the volumes managed by the kubelet on a regular basis. For more information, see Create a snapshot of a disk volume.
A node fails and all pods on the node stop running.	Create pods by using workloads such as Deployments, StatefulSets, and DaemonSets. Do not directly create pods. Otherwise, the system may not be able to schedule the pods to healthy nodes.
The kubelet fails. As a result, the following issues may occur: You cannot create pods on a node on which the kubelet fails. The kubelet may accidentally delete specific pods. Specific nodes are marked as `unhealthy`. Deployments or ReplicationControllers create pods on other nodes.	If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created, contact us for technical support. Create snapshots for the volumes managed by the kubelet on a regular basis. For more information, see Create a snapshot of a disk volume. Create pods by using workloads such as Deployments, StatefulSets, and DaemonSets. Do not directly create pods. Otherwise, the system may not be able to schedule the pods to healthy nodes.
Other issues such as invalid configurations.	If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created, contact us for technical support. Create snapshots for the volumes managed by the kubelet on a regular basis. For more information, see Create a snapshot of a disk volume.

How do I perform fine-grained authentication when I use a RAM user to manage an ACK cluster?

By default, a RAM user or RAM role does not have the permissions to use OpenAPI. To use ACK and manage ACK clusters, you must grant the RAM user or RAM role the AliyunCSFullAccess permissions or the required custom permissions. For more information, see Use RAM to authorize access to clusters and cloud resources.
Based on the Kubernetes role-based access control (RBAC) mechanism, you must use RBAC to manage the operation permissions on resources in a cluster. This way, RAM users can manage the internal resources of the cluster, such as creating Deployments and Services.
For scenarios that require fine-grained control of resource read and write permissions, you can use custom ClusterRoles and Roles to configure more fine-grained RBAC permissions. For more information, see Use custom RBAC roles to restrict resource operations in a cluster.
When you use a RAM user to access the console, you must configure the corresponding cloud service permissions. For example, you can view the scaling activities of node pools and view the cluster monitoring dashboard in the console. For more information, see Required permissions for the ACK console.

What CIDR blocks do I need to configure in the SLB ACLs to allow access to the API server of an ACK cluster?

You need to configure the access control lists (ACLs) for the Server Load Balancer (SLB) of the API server to accept access from the following CIDR blocks:

The control plane CIDR block of Container Service for Kubernetes: 100.104.0.0/16.
The primary CIDR block and the secondary CIDR blocks (if any) of the virtual private cloud (VPC) where the cluster resides, or the vSwitch CIDR block of the nodes in the cluster.
The public CIDR blocks used by clients that need to access the CLB instance of the API server.
The public CIDR blocks used by edge nodes if your cluster is an ACK Edge cluster.
The Vital Product Data (VPD) CIDR blocks if your cluster is an ACK Lingjun cluster.

For more information, see Configure network ACLs for the API server of an ACK cluster.

Can I use Istio in ACK clusters?

Yes, you can use Service Mesh(ASM) in ACK clusters. ASM is a service mesh that is fully compatible with the open-source Istio and offers a fully managed control plane, allowing you to focus on developing and deploying your applications. ASM supports multiple operating systems of ACK nodes and different network plug-ins deployed within the cluster. You can add an existing ACK cluster to an ASM instance to use features provided by ASM, such as traffic management, fault handling, centralized monitoring, and log management. For more information, see Add a cluster to an ASM instance. For more information about the billing of ASM, see Billing rules.

How do I connect to master nodes?

ACK dedicated cluster: You can connect to the master nodes of an ACK dedicated cluster by using SSH.
ACK managed cluster: In an ACK managed cluster, the nodes of control planes are completely managed, and you cannot log on to the terminals of these nodes. If you want to log on to a control plane node, create an ACK dedicated cluster.

Can I update an ACK dedicated cluster after I accidentally delete a master node of the cluster?

No, you cannot update an ACK dedicated cluster after you accidentally delete a master node of the cluster. After a master node of an ACK dedicated cluster is deleted, you cannot add another master node or update the Kubernetes version of the cluster. You can create an ACK dedicated cluster (discontinued).

Can I remove or add master nodes of an ACK dedicated cluster? What are the high-risk operations?

No, you cannot remove or add master nodes of an ACK dedicated cluster. If you add or reduce the number of master nodes of an ACK dedicated cluster, the cluster may become unavailable and cannot be recovered.

If you perform an incorrect operation on a master node of an ACK dedicated cluster, the master node or the cluster may become unavailable. High-risk operations include changing the master or etcd certificates, modifying core components, deleting or formatting data in core directories such as /etc/kubernetes on the nodes, and reinstalling the operating system. For more information, see High-risk operations on clusters.

What do I do if the ACK dedicated cluster API server prompts `api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after xxx`?

Issue

When you create a pod in an ACK dedicated cluster, the API server returns an error indicating that the certificate is expired, or the logs of kube-controller-manager or the events show an error about the expired certificate. The following error message is returned:

"https://localhost:6443/api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after XXX

"https://[::1]:6443/api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after XXX

Cause

In Kubernetes, API server have built-in certificates for their internal LoopbackClient server certificate. The validity period of the certificate provided by the community is one year. The certificate is automatically rotated only when the API server pod restarts. If the Kubernetes version of your cluster has not been upgraded for a long time (more than one year), the internal certificate will expire and the API request will fail. For more information, see #86552.

To reduce the risk caused by the short validity period of certificates in ACK clusters that run Kubernetes 1.24 or later, the default validity period of the built-in certificates is extended to 10 years. For more information about the changes and the scope of impact, see [Product Changes] The validity period of the API server internal certificate is increased.

Solution

Log on to a master node and run the following command to query the expiration date of the LoopbackClient certificate.

Replace XX.XX.XX.XX with the local IP address of the master node.

curl --resolve apiserver-loopback-client:6443:XX.XX.XX.XX -k -v https://apiserver-loopback-client:6443/healthz 2>&1 |grep expire

For ACK clusters whose LoopbackClient certificates are about to expire or expired (one-year validity period), upgrade the clusters to Kubernetes 1.24 or a later. For more information, see Manually upgrade ACK clusters. We recommend that you migrate to an ACK managed Pro cluster. For more information, see Hot migration from ACK dedicated clusters to ACK managed Pro clusters.

For ACK dedicated clusters that cannot be upgraded, log on to all master nodes and manually restart the API server to generate new certificates.

containerd node

crictl pods  |grep kube-apiserver- | awk '{print $1}' | xargs -I '{}' crictl stopp {}

Docker node

docker ps | grep kube-apiserver- | awk '{print $1}' | xargs -I '{}' docker restart {}