Use ack-co-scheduler to coordinate resource scheduling - Container Service for Kubernetes

Compared with the Kubernetes scheduler, the scheduler provided by Container Service for Kubernetes (ACK) supports more features, such as gang scheduling, topology-aware CPU scheduling, and Elastic Container Instance-based scheduling. This topic describes how to install ack-co-scheduler in a registered cluster and how to use the scheduling features of ACK. ack-co-scheduler allows you to apply these features in various types of applications, such as big data applications and AI applications, to improve resource utilization.

Prerequisites

A registered cluster is created and an external Kubernetes cluster is connected to the registered cluster. For more information, see Create a registered cluster in the ACK console.
The following table describes the system component versions that are required.
Component
Version
Kubernetes
V1.18.8 and later
Helm
≥ 3.0
Docker
19.03.5
Operating system
CentOS 7.6, CentOS 7.7, Ubuntu 16.04 and 18.04, and Alibaba Cloud Linux

Usage notes

When you deploy a job, you must set the name of the scheduler to ack-co-scheduler. To do this, set .template.spec.schedulerName to ack-co-scheduler.

Install ack-co-scheduler

Use onectl to install ack-co-scheduler

Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.

Run the following command to install ack-co-scheduler:

onectl addon install ack-co-scheduler

Expected output:

Addon ack-co-scheduler, version **** installed.

Use the ACK console to install ack-co-scheduler

Log on to the ACK console and click Clusters in the left-side navigation pane.
On the Clusters page, click the name of the cluster that you want to manage and choose Operations > Add-ons in the left-side navigation pane.
On the Add-ons page, click the Others tab. On the ack-co-scheduler card, click Install.
In the message that appears, click OK.

Gang scheduling

The gang scheduling feature provided by ACK is developed on top of the new kube-scheduler framework. This feature provides a solution to job scheduling in all-or-nothing scenarios.

You can use the following template to deploy a distributed TensorFlow training job that has gang scheduling enabled. For more information about how to run a distributed TensorFlow training job, see Work with gang scheduling.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "2"
        spec:
          schedulerName: ack-co-scheduler   # Specify ack-co-scheduler as the scheduler. 
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '10'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "2"
        spec:
          schedulerName: ack-co-scheduler
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: 10
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

Topology-aware CPU scheduling

Before you enable topology-aware CPU scheduling, you must install resource-controller in the cluster. For more information, see Manage components.

You can use the following template to create a Deployment that has topology-aware CPU scheduling enabled. For more information about topology-aware CPU scheduling, see Enable topology-aware CPU scheduling.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-numa
  labels:
    app: nginx-numa
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-numa
  template:
    metadata:
      annotations:
        cpuset-scheduler: "true"
      labels:
        app: nginx-numa
    spec:
      schedulerName: ack-co-scheduler # Specify ack-co-scheduler as the scheduler. 
      containers:
      - name: nginx-numa
        image: nginx:1.13.3
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 4
          limits:
            cpu: 4

Elastic Container Instance-based scheduling

Elastic Container Instance-based scheduling is a scheduling policy that Alibaba Cloud provides for elastic resource scheduling. You can add annotations to specify the resources that you want to use when you deploy applications. You can specify that only Elastic Compute Service (ECS) instances or elastic container instances are used, or enable the system to request elastic container instances when ECS resources are insufficient. Elastic Container Instance-based scheduling can meet your resource requirements in different workload scenarios.

Before you enable Elastic Container Instance-based scheduling, you must install ack-virtual-node in the cluster. For more information, see Use Elastic Container Instance in ACK clusters.

You can use the following template to create a Deployment that has Elastic Container Instance-based scheduling enabled. For more information about how to use Elastic Container Instance-based scheduling, see Use elastic resources to implement Elastic Container Instance-based scheduling (discontinued).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      annotations:
        alibabacloud.com/burst-resource: eci # Specify the type of resource that you want to use for elastic scheduling. 
      labels:
        app: nginx
    spec:
      schedulerName: ack-co-scheduler # Specify ack-co-scheduler as the scheduler. 
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Add the annotation alibabacloud.com/burst-resource to the template. metadata section of the pod configuration to specify the types of resources that you want to use. Valid values of alibabacloud.com/burst-resource:

If no value is specified, only existing ECS resources in the cluster are used. This is the default setting.
eci: Elastic container instances are used when the ECS resources in the cluster are insufficient.
eci_only: Only elastic container instances are used. The ECS resources in the cluster are not used.

Use cGPU

For more information about how to use cGPU, see Use cGPU, Monitor and isolate GPU resources, and Use node pools to control cGPU.

Component	Version
Kubernetes	V1.18.8 and later
Helm	≥ 3.0
Docker	19.03.5
Operating system	CentOS 7.6, CentOS 7.7, Ubuntu 16.04 and 18.04, and Alibaba Cloud Linux