FAQ about adding edge nodes and solutions - Container Service for Kubernetes

This topic provides answers to some frequently asked questions (FAQ) about using edge nodes in ACK Edge clusters.

How do ACK Edge components distinguish cloud nodes from edge nodes?

ACK Edge determines whether a node is an edge node based on the alibabacloud.com/is-edge-worker label of the node.

If a node is added to a cloud node pool or an edge node pool, the is-edge-worker label is automatically added to the node. If the value of the is-edge-worker label of a node is true, the node is an edge node. If the value of the label is false, the node is a cloud node.

How do I add edge nodes to a node pool over an Express Connect circuit?

Take note of the following requirements when you add edge nodes in ACK Edge clusters to a node pool over an Express Connect circuit. For more information, see Special configurations of ACK Edge clusters when Express Connect circuits are used.

When you create an edge node pool, set the node pool type to dedicated. Then, refer to Add an edge node to generate a script used to add the edge node to the dedicated edge node pool.
For more information about dedicated edge node pools, see Create an edge node pool.
Note
If the Kubernetes version of the ACK Edge cluster is 1.22 or later, you cannot specify the inDedicatedNetwork parameter in the script to add edge nodes to the node pool over an Express Connect circuit. If the version is earlier than 1.22, upgrade the version.
When you add edge nodes to a node pool over an Express Connect circuit, the edge nodes need to communicate with Alibaba Cloud services over private addresses. Make sure that the edge nodes are connected to the relevant Alibaba Cloud services, such as Object Storage Service (OSS), Container Registry, and Server Load Balancer (SLB).

How do I add GPU-accelerated nodes to a node pool?

You must first install the GPU driver.
For more information about the supported driver versions, see NVIDIA driver versions supported by ACK.

You must configure the gpuVersion parameter in the script used to connect the node to the cloud. The following GPU models are supported:

Architecture	GPU model	ACK Edge version
AMD64/x86_64	Nvidia_Tesla_T4	1.16.9-aliyunedge.1 and later
AMD64/x86_64	Nvidia_Tesla_P4	1.16.9-aliyunedge.1 and later
AMD64/x86_64	Nvidia_Tesla_P100	1.16.9-aliyunedge.1 and later
AMD64/x86_64	Nvidia_Tesla_V100	1.18.8-aliyunedge.1 and later
AMD64/x86_64	Nvidia_Tesla_A100	1.20.11-aliyunedge.1 and later
AMD64/x86_64	Nvidia_Tesla_A10	1.20.11-aliyunedge.1 and later
AMD64/x86_64	Nvidia_L20	1.26.3-aliyun.1 and later
AMD64/x86_64	Nvidia_L40	1.26.3-aliyun.1 and later

After you configure the parameter, the connection tool automatically installs nvidia-containerd-runtime. For more information, see NVIDIA Container Runtime.

How do I handle node connection script execution failures?

The following table describes how to handle a script execution failure. If your issue is not described in the following table, collect the node diagnostics information and submit a ticket. For more information about how to collect edge node diagnostic information, see the How do I collect the diagnostic information of nodes in an ACK Edge cluster? section of this topic.

Error message	Cause of failure	Suggested solution
The os XXX unsupport	The operating system version of the edge node is not supported.	For more information about the supported operating system versions, see Add an edge node.
invalid nodeName	The node name is invalid.	The node name can contain lowercase letters, hyphens (-), and periods (.). The node name must be 1 to 253 characters in length. The node name cannot start with localhost.
Node route overlaps with service cidr	The route of the node conflicts with the pod CIDR block or Service CIDR block of the cluster.	Recreate the cluster and reconfigure the pod CIDR block or Service CIDR block. Make sure that these CIDR blocks do not conflict with the NameServer address and route of the node.
response error msg: TOKEN_EXPIRED	The token for connecting the node to the cloud is expired.	Generate another script to connect the node to the cloud. Check whether the system clock of the node is normal.
A node named XXX is already exist in the cluster	A node with the same name already exists in the cluster.	Remove the node from the cluster.
error run phase join-node: failed to get cluster info: failed to get cluster-info configmap, Get "https://xx.xxx.xx.xx:6443/api/v1/namespaces/kube-public/configmaps/cluster-info": dial tcp xx.xxx.xx.xx:6443: i/o timeout	The system fails to obtain the information about the cluster.	When edgeadm connects to an edge node, edgeadm must access an API server by using the IP address. Check whether the access control list (ACL) rules configured for the SLB instance of the API server blocks the IP address.
error run phase join-node: Install edge-hub failed: Copy file /tmp/edge-hub to /usr/bin/edge-hub fail: open /usr/bin/edge-hub: text file busy \| 40009 \| 40009	Installation of edge-hub fails because the binary file for edge-hub already exists on the node.	Run the `edgeadm reset` command to clear the data on the node, and then execute the node connection script again.
error run phase post-check: timed out waiting for the condition	The system components fail to start up.	Download the edgeadm tool again and run the `edgeadm reset` command to re-install the tool. Make sure that the latest version of edgeadm is used. Check whether the edge node can access the relevant public addresses as expected. For more information about the public addresses, see Network management overview. Collect the diagnostics information about the node and submit a ticket. For more information about how to collect diagnostics information, see the How do I collect the diagnostics information of nodes in an ACK Edge cluster? section of this topic.

What do I do if an edge node fails to be upgraded when I upgrade an ACK Edge cluster?

When you upgrade an edge node pool, if the This node has been upgraded successfully message is not returned, troubleshoot the issue by referring to the solutions that are described in the following table.

Error message	Cause of failure	Suggested solution
edgeadm version xxxx does not match cluster version	The version of the upgrade tool is inconsistent with that of the cluster.	Check whether the control plane of the cluster has been upgraded. Check whether the `TARGET_CLUSTER_VERSION` parameter is correctly specified.
node has already been upgraded to xxx	The version of the node is already updated to the desired version.	If specific components on the node have not been upgraded, retain the logs and submit a ticket.
kubelet target version xxxx does not match cluster version xxxx	The version of kubelet is inconsistent with the version of the control plane of the cluster.	If the `kubelet-version` parameter is specified, check whether the value of the parameter is consistent with the version of the control plane of the cluster. If this parameter is left empty, submit a ticket.
Parameter currentVersion cann't null	An earlier version of edgeadm is used.	Check whether edgeadm of the latest version is used. You can update a cluster from Kubernetes 1.18 to 1.20, or from Kubernetes 1.20 to 1.22.
upgrade kubelet failed at phase install, recover to previous state. error run phase upgrade: xxxx	The cluster fails to be upgraded and has been automatically rolled back to the previous state. The node status is not affected.	Retain the logs and submit a ticket.
upgrade kubelet failed at phase install, recover to previous state recover kubelet failed, err: xxx error run phase upgrade: xxxx	The cluster fails to be upgraded and has been automatically rolled back to the previous state. The node status is affected.	Retain the logs and submit a ticket.

How do I collect the diagnostics information about nodes in an ACK Edge cluster?

If an exception occurs on a node in an ACK Edge cluster, perform the following steps to collect the diagnostics information about the node for data analysis:

Log on to the abnormal node in the ACK Edge cluster.

Run the following command to download the diagnostics script:

curl -o /usr/local/bin/diagnose_edge_node.sh https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh

Run the following command to make the diagnostics script executable:
```
chmod u+x /usr/local/bin/diagnose_edge_node.sh
```
Run the following command to switch to the specified directory:
```
cd /usr/local/bin/
```

Run the following command to run the diagnostics script:

./diagnose_edge_node.sh

Expected output: Each time you run the diagnostics script, a file with a different name is generated. In this example, the log file is named diagnose_1578310147.tar.gz.

......
+ echo 'please get diagnose_1578310147.tar.gz for diagnostics'
please get diagnose_1578310147.tar.gz for diagnostics
+ echo 'Submit the file named diagnose_1578310147.tar.gz to request technical support.'
Submit the file named diagnose_1578310147.tar.gz to request technical support.

Run the ll command to verify that the diagnostics report named diagnose_1578310147.tar.gz exists.