Container Service for Kubernetes (ACK) supports GPU sharing. To enable GPU sharing, you must install cGPU on a node. This topic describes how to update the cGPU version on a node by using a CLI and the ACK console.
Prerequisites
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
The ack-ai-installer or the ack-cgpu component is installed in the cluster and updated to the latest version.
For more information about how to update the ack-ai-installer component, see Update the GPU sharing component.
To update the ack-cgpu component, perform the following steps:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Helm page, find the ack-cgpu component, click Update in the Actions column. On the page that appears, select a Version, and then click OK.
To update a node, you must stop the applications running on the node where the cGPU is installed. Perform the update during off-peak hours.
Update solution
Solution 1 involves many steps, but other data on the system disk and data disk is not affected during execution.
Solution 2 resets the operating system disk of the node. If there is data on the operating system disk of your node, select Solution 1.
Solution 1: Run a script
Step 1: Drain a node
Run the following command to mark the node as unschedulable:
kubectl cordon <NODE1_NAME> <NODE2_NAME>...
Run the following command to drain the node:
kubectl drain <NODE1_NAME> <NODE2_NAME>... --grace-period=120 --ignore-daemonsets=true
Step 2: Uninstall the earlier version of cGPU
Log on to the node where the cGPU is installed and run the following command:
bash /usr/local/cgpu-installer/uninstall.sh
If /usr/local/cgpu-installer/uninstall.sh does not exist, run the following command to uninstall the earlier version of cGPU.
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/cgpu-uninstall.sh -O /usr/local/cgpu-installer/uninstall.sh
Step 3: Install the new version of cGPU
Run the following command to configure the name of the node where the cGPU is installed and restart the pods where the cgpu-installer and cgpu-core-installer components reside:
export NODE= cn-beijing.192.168.XXX.XXXX # Specify a node.
kubectl delete pods -n kube-system -l name=cgpu-installer --field-selector spec.nodeName=$NODE
kubectl delete pods -n kube-system -l name=cgpu-core-installer --field-selector spec.nodeName=$NODE
Verify the result
Run the following command to check whether the cGPU version is updated to the new version:
cat /proc/cgpu_km/version
Expected output:
1.5.10
1.5.10
is the latest version. If there are updates to the cGPU, the version number changes.
Scenario 2: Reset a node
Remove and re-add a node
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Nodes page, select the node that you want to update, and click Batch Remove. In the Remove Node dialog box, select Drain Node.
Add the node that you removed to the original node pool again. For more information, see Add existing ECS instances to an ACK cluster.
Verify the result
Run the following command to query the pod that runs cgpu-installer on the node:
kubectl get po -l name=cgpu-installer -n kube-system -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cgpu-installer-***** 1/1 Running 0 4d2h 192.168.XXX.XX1 cn-beijing.192.168.XXX.XX1 <none> <none> cgpu-installer-**2 1/1 Running 0 4d2h 192.168.XXX.XX2 cn-beijing.192.168.XXX.XX2 <none> <none> cgpu-installer-**3 1/1 Running 0 4d2h 192.168.XXX.XX3 cn-beijing.192.168.XXX.XX3 <none> <none>
Run the following command to access the
cgpu-installer-******
pod:kubectl exec -ti cgpu-installer-***** -n kube-system -- bash
Run the following command to query the cGPU version:
nsenter -t 1 -i -p -n -u -m -- cat /proc/cgpu_km/version
Expected output:
1.5.10
Note1.5.10
is the latest version. If there are updates to the cGPU, the version number changes.