Compared with the Kubernetes scheduler, the scheduler provided by Container Service for Kubernetes (ACK) supports more features, such as gang scheduling, topology-aware CPU scheduling, and Elastic Container Instance-based scheduling. This topic describes how to install ack-co-scheduler in a registered cluster and how to use the scheduling features of ACK. ack-co-scheduler allows you to apply these features in various types of applications, such as big data applications and AI applications, to improve resource utilization.
Prerequisites
A registered cluster is created and an external Kubernetes cluster is connected to the registered cluster. For more information, see Create a registered cluster in the ACK console.
The following table describes the system component versions that are required.
Component
Version
Kubernetes
V1.18.8 and later
Helm
≥ 3.0
Docker
19.03.5
Operating system
CentOS 7.6, CentOS 7.7, Ubuntu 16.04 and 18.04, and Alibaba Cloud Linux
Usage notes
When you deploy a job, you must set the name of the scheduler to ack-co-scheduler. To do this, set .template.spec.schedulerName
to ack-co-scheduler
.
Install ack-co-scheduler
Use onectl to install ack-co-scheduler
Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.
Run the following command to install ack-co-scheduler:
onectl addon install ack-co-scheduler
Expected output:
Addon ack-co-scheduler, version **** installed.
Use the ACK console to install ack-co-scheduler
Log on to the ACK console and click Clusters in the left-side navigation pane.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Add-ons page, click the Others tab. On the ack-co-scheduler card, click Install.
In the message that appears, click OK.
Gang scheduling
The gang scheduling feature provided by ACK is developed on top of the new kube-scheduler framework. This feature provides a solution to job scheduling in all-or-nothing scenarios.
You can use the following template to deploy a distributed TensorFlow training job that has gang scheduling enabled. For more information about how to run a distributed TensorFlow training job, see Work with gang scheduling.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "2"
spec:
schedulerName: ack-co-scheduler # Specify ack-co-scheduler as the scheduler.
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '10'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "2"
spec:
schedulerName: ack-co-scheduler
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: 10
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Topology-aware CPU scheduling
Before you enable topology-aware CPU scheduling, you must install resource-controller in the cluster. For more information, see Manage components.
You can use the following template to create a Deployment that has topology-aware CPU scheduling enabled. For more information about topology-aware CPU scheduling, see Topology-aware CPU scheduling.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-numa
labels:
app: nginx-numa
spec:
replicas: 2
selector:
matchLabels:
app: nginx-numa
template:
metadata:
annotations:
cpuset-scheduler: "true"
labels:
app: nginx-numa
spec:
schedulerName: ack-co-scheduler # Specify ack-co-scheduler as the scheduler.
containers:
- name: nginx-numa
image: nginx:1.13.3
ports:
- containerPort: 80
resources:
requests:
cpu: 4
limits:
cpu: 4
Elastic Container Instance-based scheduling
Elastic Container Instance-based scheduling is a scheduling policy that Alibaba Cloud provides for elastic resource scheduling. You can add annotations to specify the resources that you want to use when you deploy applications. You can specify that only Elastic Compute Service (ECS) instances or elastic container instances are used, or enable the system to request elastic container instances when ECS resources are insufficient. Elastic Container Instance-based scheduling can meet your resource requirements in different workload scenarios.
Before you enable Elastic Container Instance-based scheduling, you must install ack-virtual-node in the cluster. For more information, see Use Elastic Container Instance in ACK clusters.
You can use the following template to create a Deployment that has Elastic Container Instance-based scheduling enabled. For more information about how to use Elastic Container Instance-based scheduling, see Use Elastic Container Instance-based scheduling.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
annotations:
alibabacloud.com/burst-resource: eci # Specify the type of resource that you want to use for elastic scheduling.
labels:
app: nginx
spec:
schedulerName: ack-co-scheduler # Specify ack-co-scheduler as the scheduler.
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 2
requests:
cpu: 2
Add the annotation alibabacloud.com/burst-resource
to the template. metadata
section of the pod configuration to specify the types of resources that you want to use. Valid values of alibabacloud.com/burst-resource
:
If no value is specified, only existing ECS resources in the cluster are used. This is the default setting.
eci
: Elastic container instances are used when the ECS resources in the cluster are insufficient.eci_only
: Only elastic container instances are used. The ECS resources in the cluster are not used.
Use cGPU
For more information about how to use cGPU, see Use cGPU, Monitor and isolate GPU resources, and Use node pools to control cGPU.