Enable scheduling features

In an ACK managed cluster Pro, you can assign scheduling labels to GPU nodes to optimize resource utilization and precisely schedule applications. These labels define properties such as exclusive access, shared use, topology awareness, and specific GPU card models.

Scheduling label overview

GPU scheduling labels identify GPU models and resource allocation policies to support fine-grained resource management and efficient scheduling.

Scheduling mode	Label value	Scenarios
Exclusive scheduling (Default)	`ack.node.gpu.schedule: default`	Performance-critical tasks that require exclusive access to an entire GPU, such as model training and high-performance computing (HPC).
Shared scheduling	`ack.node.gpu.schedule: cgpu` `ack.node.gpu.schedule: core_mem` `ack.node.gpu.schedule: share` `ack.node.gpu.schedule: mps`	Improves GPU utilization. Ideal for scenarios with multiple concurrent lightweight tasks, such as multitenancy and inference. `cgpu`: Shared computing power with isolated GPU memory, based on Alibaba Cloud cGPU sharing technology. `core_mem`: Isolated computing power and GPU memory. `share`: Shared computing power and GPU memory resources with no isolation. `mps`: Shared computing power with isolated GPU memory, based on NVIDIA MPS isolation technology combined with Alibaba Cloud cGPU technology.
Shared scheduling	`ack.node.gpu.placement: binpack` `ack.node.gpu.placement: spread`	Optimizes the resource allocation strategy on multi-GPU nodes when `cgpu`, `core_mem`, `share`, or `mps` sharing is enabled. `binpack`: (Default) Compactly schedules multiple pods. Fills one GPU with pods before assigning pods to the next available GPU. This reduces resource fragmentation and is ideal for maximizing resource utilization or energy savings. `spread`: Distributes pods across different GPUs. This reduces the impact of a single card failure and is suitable for high availability tasks.
Topology-aware scheduling	`ack.node.gpu.schedule: topology`	Automatically assigns the optimal combination of GPUs to a pod based on the physical GPU topology of a single node. This is suitable for tasks that are sensitive to GPU-to-GPU communication latency.
Card model scheduling	`aliyun.accelerator/nvidia_name:<GPU_card_name>` Use these labels with card model scheduling to set the GPU memory and total number of GPU cards for a GPU job. `aliyun.accelerator/nvidia_mem:<memory_per_card>` `aliyun.accelerator/nvidia_count:<total_number_of_GPU_cards>`	Schedules tasks to nodes with a specific GPU model or avoids nodes with a specific model.

A node can use only one GPU scheduling mode at a time: exclusive, shared, or topology-aware. When one mode is enabled, the extended resources for other modes are automatically set to 0.

Exclusive scheduling

If a node has no GPU scheduling labels, exclusive scheduling is enabled by default. In this mode, the node allocates GPU resources to pods in whole-card units.

If you have enabled other GPU scheduling modes, deleting the label does not restore exclusive scheduling. To restore exclusive scheduling, you must manually change the label value to ack.node.gpu.schedule: default.

Shared scheduling

Shared scheduling is available only for ACK managed cluster Pro. For more information, see Limits.

Install the ack-ai-installer component.
1. Log on to the ACK console. In the left navigation pane, click Clusters.
2. On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Applications > Cloud-native AI Suite.
3. On the Cloud-native AI Suite page, click Deploy. On the Deploy Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).
  For more information about how to set the computing power scheduling policy for the cGPU service, see Install and use the cGPU service.
4. On the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
  On the Cloud-native AI Suite page, verify that the ack-ai-installer component appears in the list of installed components.
Enable shared scheduling.
1. On the Clusters page, click the name of your target cluster. In the navigation pane on the left, choose Nodes > Node Pools.
2. On the Node Pools page, click Create Node Pool, configure the node labels, and then click Confirm.
  You can keep the default values for other configuration items. For more information about the function of each label, see Scheduling label overview.
  - Configure basic shared scheduling.
    Click the icon for Node Labels, set the Key to ack.node.gpu.schedule, and select a value such as cgpu, core_mem, share, or mps (requires installing the MPS Control Daemon component).
  - Configure multi-card shared scheduling.
    On multi-GPU nodes, you can add a placement policy to your basic shared scheduling configuration to optimize resource allocation.
    Click the icon for Node Labels, set the Key to ack.node.gpu.placement, and select either binpack or spread as the label value.
Verify that shared scheduling is enabled.
cgpu/share/mps
Replace <NODE_NAME> with the name of a node in the target node pool and run the following command to verify that cgpu, share, or mps shared scheduling is enabled on the node.
kubectl get nodes <NODE_NAME> -o yaml | grep -q "aliyun.com/gpu-mem"
Expected output:
aliyun.com/gpu-mem: "60"
If the aliyun.com/gpu-mem field is not 0, cgpu, share, or mps shared scheduling is enabled.
core_mem
Replace <NODE_NAME> with the name of a node in the target node pool and run the following command to verify that core_mem shared scheduling is enabled.
kubectl get nodes <NODE_NAME> -o yaml | grep -E 'aliyun\.com/gpu-core\.percentage|aliyun\.com/gpu-mem'
Expected output:
aliyun.com/gpu-core.percentage:"80" aliyun.com/gpu-mem:"6"
If the aliyun.com/gpu-core.percentage and aliyun.com/gpu-mem fields are both non-zero, core_mem shared scheduling is enabled.
binpack
Use the shared GPU resource query tool to check the GPU resource allocation on the node:
kubectl inspect cgpu
Expected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU Memory(GiB) cn-shanghai.192.0.2.109 192.0.2.109 15/15 9/15 0/15 0/15 24/60 -------------------------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 24/60 (40%)
The output shows that GPU0 is fully allocated (15/15) while GPU1 is partially allocated (9/15). This confirms that the binpack policy is active. This policy fills one GPU completely before allocating resources on the next.
spread
Use the shared scheduling GPU resource query tool to check the GPU resource allocation on the node:
```
kubectl inspect cgpu
```
Expected output:
```
NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.0.2.109  192.0.2.109  4/15                   4/15                   0/15                   4/15                   12/60
--------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
12/60 (20%)
```
The output shows that the resource allocation is 4/15 on GPU0, 4/15 on GPU1, and 4/15 on GPU3. This is consistent with the scheduling policy that prioritizes spreading pods across different GPUs, which confirms that the spread policy is in effect.

Topology-aware scheduling

Topology-aware scheduling is available only for ACK managed cluster Pro. For more information, see System component version requirements.

Install the ack-ai-installer component.
Enable topology-aware scheduling.
Replace <NODE_NAME> with the name of your target node and run the following command to add a label to the node and enable topology-aware GPU scheduling.
```
kubectl label node <NODE_NAME> ack.node.gpu.schedule=topology
```
After you enable topology-aware scheduling on a node, the node no longer supports GPU workloads that are not topology-aware. To restore exclusive scheduling, run the command kubectl label node <NODE_NAME> ack.node.gpu.schedule=default --overwrite.
Verify that topology-aware scheduling is enabled.
Replace <NODE_NAME> with the name of your target node and run the following command to verify that topology topology-aware scheduling is enabled.
```
kubectl get nodes <NODE_NAME> -o yaml | grep aliyun.com/gpu
```
Expected output:
```
aliyun.com/gpu: "2"
```
If the aliyun.com/gpu field is not 0, topology topology-aware scheduling is enabled.

Card model scheduling

You can schedule Jobs to nodes with a specific GPU model or avoid nodes with a specific model.

View the GPU card model on the node.
Run the following command to query the GPU card model of the nodes in your cluster.
The NVIDIA_NAME field shows the GPU card model.
```
kubectl get nodes -L aliyun.accelerator/nvidia_name
```
The expected output is similar to the following:
```
NAME                        STATUS   ROLES    AGE   VERSION            NVIDIA_NAME
cn-shanghai.192.XX.XX.176   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.177   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
```
Expand to view more ways to check the GPU card model.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Workloads > Pods. In the row of a pod, for example, tensorflow-mnist-multigpu-***, click Terminal in the Actions column. Then, select the container that you want to log on to from the drop-down list and run the following commands.
- Query the card model: nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
- Query the GPU memory of each card: nvidia-smi --id=0 --query-gpu=memory.total --format=csv,noheader | sed -e 's/ //g'
- Query the total number of GPU cards on the node: nvidia-smi -L | wc -l

Enable card model scheduling.

On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Workloads > Jobs.

On the Jobs page, click Create From YAML. Use the following examples to create an application and enable card model scheduling.

Specify a particular card model

Use the GPU card model scheduling label to ensure your application runs on nodes with a specific card model.

In the code aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB", replace Tesla-V100-SXM2-32GB with the card model of your node.

Expand to view the YAML file details

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      nodeSelector:
        aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" # Runs the application on a Tesla V100-SXM2-32GB GPU.
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

After the job is created, choose Workloads > Pods from the navigation pane on the left. The pod list shows that the example pod is scheduled to a matching node. This confirms that scheduling based on the GPU card model label is working.

Exclude a particular card model

Use the GPU card model scheduling label with node affinity and anti-affinity to prevent your application from running on certain card models.

In values: - "Tesla-V100-SXM2-32GB", replace Tesla-V100-SXM2-32GB with the card model of your node.

Expand to view the YAML file details

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: aliyun.accelerator/nvidia_name  # Card model scheduling label
                operator: NotIn
                values:
                - "Tesla-V100-SXM2-32GB"            # Prevents the pod from being scheduled to a node with a Tesla-V100-SXM2-32GB card.
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

After the job is created, the application is not scheduled on nodes with the aliyun.accelerator/nvidia_name tag key and the Tesla-V100-SXM2-32GB tag value. However, it can be scheduled on other GPU nodes.

Container Service for Kubernetes:Enable scheduling features

Scheduling label overview