All Products
Search
Document Center

Container Service for Kubernetes:Update the cGPU version on a node

最終更新日:Nov 21, 2024

Container Service for Kubernetes (ACK) supports GPU sharing. To enable GPU sharing, you must install cGPU on a node. This topic describes how to update the cGPU version on a node by using a CLI and the ACK console.

Prerequisites

  • A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  • The ack-ai-installer or the ack-cgpu component is installed in the cluster and updated to the latest version.

    • For more information about how to update the ack-ai-installer component, see Update the GPU sharing component.

    • To update the ack-cgpu component, perform the following steps:

      • Log on to the ACK console. In the left-side navigation pane, click Clusters.

      • On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Helm.

      • On the Helm page, find the ack-cgpu component, click Update in the Actions column. On the page that appears, select a Version, and then click OK.

  • To update a node, you must stop the applications running on the node where the cGPU is installed. Perform the update during off-peak hours.

Update solution

Important
  • Solution 1 involves many steps, but other data on the system disk and data disk is not affected during execution.

  • Solution 2 resets the operating system disk of the node. If there is data on the operating system disk of your node, select Solution 1.

Solution 1: Run a script

Step 1: Drain a node

  1. Run the following command to mark the node as unschedulable:

    kubectl cordon <NODE1_NAME> <NODE2_NAME>...
  2. Run the following command to drain the node:

    kubectl drain <NODE1_NAME> <NODE2_NAME>... --grace-period=120 --ignore-daemonsets=true

Step 2: Uninstall the earlier version of cGPU

Log on to the node where the cGPU is installed and run the following command:

bash /usr/local/cgpu-installer/uninstall.sh
Note

If /usr/local/cgpu-installer/uninstall.sh does not exist, run the following command to uninstall the earlier version of cGPU.

wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/cgpu-uninstall.sh -O /usr/local/cgpu-installer/uninstall.sh

Step 3: Install the new version of cGPU

Run the following command to configure the name of the node where the cGPU is installed and restart the pods where the cgpu-installer and cgpu-core-installer components reside:

export NODE= cn-beijing.192.168.XXX.XXXX    # Specify a node. 
kubectl delete pods -n kube-system -l name=cgpu-installer --field-selector spec.nodeName=$NODE
kubectl delete pods -n kube-system -l name=cgpu-core-installer --field-selector spec.nodeName=$NODE

Verify the result

Run the following command to check whether the cGPU version is updated to the new version:

cat /proc/cgpu_km/version

Expected output:

1.5.10
Note

1.5.10 is the latest version. If there are updates to the cGPU, the version number changes.

Scenario 2: Reset a node

Remove and re-add a node

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, select the node that you want to update, and click Batch Remove. In the Remove Node dialog box, select Drain Node.

  4. Add the node that you removed to the original node pool again. For more information, see Add existing ECS instances to an ACK cluster.

Verify the result

  1. Run the following command to query the pod that runs cgpu-installer on the node:

    kubectl get po -l name=cgpu-installer -n kube-system -o wide

    Expected output:

    NAME                   READY   STATUS    RESTARTS   AGE    IP                NODE                         NOMINATED NODE   READINESS GATES
    cgpu-installer-*****   1/1     Running   0          4d2h   192.168.XXX.XX1   cn-beijing.192.168.XXX.XX1   <none>           <none>
    cgpu-installer-**2     1/1     Running   0          4d2h   192.168.XXX.XX2   cn-beijing.192.168.XXX.XX2   <none>           <none>
    cgpu-installer-**3     1/1     Running   0          4d2h   192.168.XXX.XX3   cn-beijing.192.168.XXX.XX3   <none>           <none>
  2. Run the following command to access the cgpu-installer-****** pod:

    kubectl exec -ti cgpu-installer-***** -n kube-system -- bash
  3. Run the following command to query the cGPU version:

    nsenter -t 1 -i -p -n -u -m -- cat /proc/cgpu_km/version

    Expected output:

    1.5.10
    Note

    1.5.10 is the latest version. If there are updates to the cGPU, the version number changes.