How to configure a GPU selection policy for nodes with the GPU sharing feature enabled - Container Service for Kubernetes

By default, the scheduler allocates all resources of a GPU on a node to pods before you switch to another GPU. This helps prevent GPU fragments. In some scenarios, you may want to spread pods to different GPUs on a node in case business interruptions occur when a GPU is faulty. This topic describes how to configure a GPU selection policy for nodes with the GPU sharing feature enabled.

Prerequisites

An ACK Pro cluster is created. For more information, see Create an ACK Pro cluster.
The GPU inspection tool is installed. For more information, see the Step 4: Install and use the GPU inspection tool section of the "Configure the GPU sharing component" topic.

Policy description

If a node with the GPU sharing feature enabled has multiple GPUs, you can choose one of the following GPU selection policies:

Binpack: By default, the binpack policy is used. The scheduler allocates all resources of a GPU to pods before you switch to another GPU. This helps prevent GPU fragments.
Spread: The scheduler attempts to spread pods to different GPUs on the node in case business interruptions occur when a GPU is faulty.

In this example, a node has two GPUs. Each GPU provides 15 GiB of memory. Pod1 requests 2 GiB of memory and Pod2 requests 3 GiB of memory.

Step 1: Create a node pool

By default, the binpack policy is used to select GPUs. To use the spread policy, perform the following steps:

Log on to the ACK console. In the left-side navigation pane, click Cluster.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster to go to the details page of the cluster. In the left-side navigation pane, choose Nodes > Node Pools.

In the upper-right corner of the Node Pools page, click Create Node Pool.

In the Create Node Pool dialog box, configure the parameters for the node pool and click Confirm Order. The following table describes the key parameters. For more information about other parameters, see Create a node pool.

Parameter	Description
Instance Type	Set Architecture to GPU-accelerated and select multiple GPU-accelerated instance types. The spread policy takes effect only if the node has more than one GPU. Therefore, select an instance type that has multiple GPUs.
Expected Nodes	Specify the initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.
Node Label	Click the icon to add two entries: Add a label whose key is `ack.node.gpu.schedule` and value is `cgpu`. This label enables the GPU sharing and GPU memory isolation features. Add a label whose key is `ack.node.gpu.placement` and value is `spread`. This label enables the spread policy.

Step 2: Submit a job

Log on to the ACK console. In the left-side navigation pane, click Cluster.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Workloads > Jobs.

Click Create from YAML in the upper-right part of the page, copy the following code to the Template editor, and then modify the parameters based on the following comments. After you complete the configuration, click Create.

Click to view YAML content

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist-spread
spec:
  parallelism: 3
  template:
    metadata:
      labels:
        app: tensorflow-mnist-spread
    spec:
      nodeSelector:
         kubernetes.io/hostname: <NODE_NAME> # Replace <NODE_NAME> with the name of a GPU-accelerated node in the cluster, such as cn-shanghai.192.0.2.109. 
      containers:
      - name: tensorflow-mnist-spread
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            aliyun.com/gpu-mem: 4 # Request 4 GiB of memory. 
        workingDir: /root
      restartPolicy: Never

YAML template description:

This YAML template defines a TensorFlow MNIST job. The job creates 3 pods and each pod requests 4 GiB of memory.
The pod resource limit aliyun.com/gpu-mem: 4 is used to request memory for the pods.
To make the GPU selection policy take effect on a node, add NodeSelector kubernetes.io/hostname: <NODE_NAME> to the YAML template to schedule the pods to the specified node.

Step 3: Verify whether the spread policy is used

Use the GPU inspection tool to query the GPU allocation information of the node.

kubectl inspect cgpu

NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.0.2.109  192.0.2.109  4/15                   4/15                   0/15                   4/15                   12/60
--------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
12/60 (20%)

The results indicate that the pods are spread to different GPUs. The spread policy is in effect.