This topic provides answers to some frequently asked questions (FAQ) about using edge nodes in ACK Edge clusters.
How do ACK Edge components distinguish cloud nodes from edge nodes?
ACK Edge determines whether a node is an edge node based on the alibabacloud.com/is-edge-worker
label of the node.
If a node is added to a cloud node pool or an edge node pool, the is-edge-worker
label is automatically added to the node. If the value of the is-edge-worker
label of a node is true
, the node is an edge node. If the value of the label is false
, the node is a cloud node.
How do I add edge nodes to a node pool over an Express Connect circuit?
Take note of the following requirements when you add edge nodes in ACK Edge clusters to a node pool over an Express Connect circuit. For more information, see Special configurations of ACK Edge clusters when Express Connect circuits are used.
When you create an edge node pool, set the node pool type to dedicated. Then, refer to Add an edge node to generate a script used to add the edge node to the dedicated edge node pool.
For more information about dedicated edge node pools, see Create an edge node pool.
NoteIf the Kubernetes version of the ACK Edge cluster is 1.22 or later, you cannot specify the
inDedicatedNetwork
parameter in the script to add edge nodes to the node pool over an Express Connect circuit. If the version is earlier than 1.22, upgrade the version.When you add edge nodes to a node pool over an Express Connect circuit, the edge nodes need to communicate with Alibaba Cloud services over private addresses. Make sure that the edge nodes are connected to the relevant Alibaba Cloud services, such as Object Storage Service (OSS), Container Registry, and Server Load Balancer (SLB).
How do I add GPU-accelerated nodes to a node pool?
You must first install the GPU driver.
For more information about the supported driver versions, see NVIDIA driver versions supported by ACK.
You must configure the
gpuVersion
parameter in the script used to connect the node to the cloud. The following GPU models are supported:Architecture
GPU model
ACK Edge version
AMD64/x86_64
Nvidia_Tesla_T4
1.16.9-aliyunedge.1 and later
AMD64/x86_64
Nvidia_Tesla_P4
1.16.9-aliyunedge.1 and later
AMD64/x86_64
Nvidia_Tesla_P100
1.16.9-aliyunedge.1 and later
AMD64/x86_64
Nvidia_Tesla_V100
1.18.8-aliyunedge.1 and later
AMD64/x86_64
Nvidia_Tesla_A100
1.20.11-aliyunedge.1 and later
AMD64/x86_64
Nvidia_Tesla_A10
1.20.11-aliyunedge.1 and later
AMD64/x86_64
Nvidia_L20
1.26.3-aliyun.1 and later
AMD64/x86_64
Nvidia_L40
1.26.3-aliyun.1 and later
After you configure the parameter, the connection tool automatically installs nvidia-containerd-runtime. For more information, see NVIDIA Container Runtime.
How do I handle node connection script execution failures?
The following table describes how to handle a script execution failure. If your issue is not described in the following table, collect the node diagnostics information and submit a ticket. For more information about how to collect edge node diagnostic information, see the How do I collect the diagnostic information of nodes in an ACK Edge cluster? section of this topic.
Error message | Cause of failure | Suggested solution |
The os XXX unsupport | The operating system version of the edge node is not supported. | For more information about the supported operating system versions, see Add an edge node. |
invalid nodeName | The node name is invalid. |
|
Node route overlaps with service cidr | The route of the node conflicts with the pod CIDR block or Service CIDR block of the cluster. | Recreate the cluster and reconfigure the pod CIDR block or Service CIDR block. Make sure that these CIDR blocks do not conflict with the NameServer address and route of the node. |
response error msg: TOKEN_EXPIRED | The token for connecting the node to the cloud is expired. |
|
A node named XXX is already exist in the cluster | A node with the same name already exists in the cluster. | Remove the node from the cluster. |
error run phase join-node: failed to get cluster info: failed to get cluster-info configmap, Get "https://xx.xxx.xx.xx:6443/api/v1/namespaces/kube-public/configmaps/cluster-info": dial tcp xx.xxx.xx.xx:6443: i/o timeout | The system fails to obtain the information about the cluster. | When edgeadm connects to an edge node, edgeadm must access an API server by using the IP address. Check whether the access control list (ACL) rules configured for the SLB instance of the API server blocks the IP address. |
error run phase join-node: Install edge-hub failed: Copy file /tmp/edge-hub to /usr/bin/edge-hub fail: open /usr/bin/edge-hub: text file busy | 40009 | 40009 | Installation of edge-hub fails because the binary file for edge-hub already exists on the node. | Run the |
error run phase post-check: timed out waiting for the condition | The system components fail to start up. |
|
What do I do if an edge node fails to be upgraded when I upgrade an ACK Edge cluster?
When you upgrade an edge node pool, if the This node has been upgraded successfully
message is not returned, troubleshoot the issue by referring to the solutions that are described in the following table.
Error message | Cause of failure | Suggested solution |
edgeadm version xxxx does not match cluster version | The version of the upgrade tool is inconsistent with that of the cluster. |
|
node has already been upgraded to xxx | The version of the node is already updated to the desired version. | If specific components on the node have not been upgraded, retain the logs and submit a ticket. |
kubelet target version xxxx does not match cluster version xxxx | The version of kubelet is inconsistent with the version of the control plane of the cluster. |
|
Parameter currentVersion cann't null | An earlier version of edgeadm is used. |
|
upgrade kubelet failed at phase install, recover to previous state. error run phase upgrade: xxxx | The cluster fails to be upgraded and has been automatically rolled back to the previous state. The node status is not affected. | Retain the logs and submit a ticket. |
upgrade kubelet failed at phase install, recover to previous state recover kubelet failed, err: xxx error run phase upgrade: xxxx | The cluster fails to be upgraded and has been automatically rolled back to the previous state. The node status is affected. | Retain the logs and submit a ticket. |
How do I collect the diagnostics information about nodes in an ACK Edge cluster?
If an exception occurs on a node in an ACK Edge cluster, perform the following steps to collect the diagnostics information about the node for data analysis:
Log on to the abnormal node in the ACK Edge cluster.
Run the following command to download the diagnostics script:
curl -o /usr/local/bin/diagnose_edge_node.sh https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh
Run the following command to make the diagnostics script executable:
chmod u+x /usr/local/bin/diagnose_edge_node.sh
Run the following command to switch to the specified directory:
cd /usr/local/bin/
Run the following command to run the diagnostics script:
./diagnose_edge_node.sh
Expected output: Each time you run the diagnostics script, a file with a different name is generated. In this example, the log file is named
diagnose_1578310147.tar.gz
....... + echo 'please get diagnose_1578310147.tar.gz for diagnostics' please get diagnose_1578310147.tar.gz for diagnostics + echo 'Submit the file named diagnose_1578310147.tar.gz to request technical support.' Submit the file named diagnose_1578310147.tar.gz to request technical support.
Run the
ll
command to verify that the diagnostics report nameddiagnose_1578310147.tar.gz
exists.