All Products
Search
Document Center

Container Service for Kubernetes:Work with gang scheduling

Last Updated:Aug 28, 2024

The gang scheduling feature provided by Container Service for Kubernetes (ACK) is developed on top of the new kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. It is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. This topic describes how to enable gang scheduling.

Usage notes

Make sure that the resource capacity of the elastic node pool that you use and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to be scheduled to the nodes in the node pool.

Prerequisites

  • An ACK Pro cluster that runs Kubernetes 1.16 or later is created. For more information, see Create an ACK Pro cluster.

    Important

    Gang scheduling is available only for ACK Pro clusters. To enable gang scheduling in ACK dedicated clusters, submit a ticket.

Usage notes

Kubernetes is widely used in online service orchestration. ACK wants to use Kubernetes as a platform for central management of online services and offline jobs. This improves the resource utilization and performance of clusters. However, kube-scheduler cannot migrate specific offline workloads to Kubernetes clusters. For example, if a job requires all-or-nothing scheduling, all tasks of the job must be scheduled at the same time. If only some of the tasks are started, the started jobs must wait until all the remaining tasks are scheduled. If each submitted job contains unscheduled tasks, all submitted jobs remain in the Pending state and the cluster is deadlocked. To avoid this situation, you must enable gang scheduling for kube-scheduler.

Gang scheduling is a scheduling algorithm that schedules multiple correlated processes to different processors in a parallel system and simultaneously starts these processes. Gang scheduling aims to start all correlated processes at the same time. This ensures that the process group is not stuck when the system fails to start some processes. For example, if you submit a batch job that contains multiple tasks, either all of the tasks are scheduled or none of them is scheduled. Task scheduling in all-or-nothing scenarios is known as gang scheduling.

In ACK, a PodGroup is a group of pods that need to be scheduled at the same time. When you submit a job that requires all-or-nothing scheduling, you can add labels to pods. The labels specify the name of the PodGroup to which the job belongs and the minimum number of tasks that must be scheduled to run the job. kube-scheduler schedules tasks based on the minimum number of tasks that must be scheduled. The tasks are scheduled only when the cluster resources are sufficient to schedule the required number of tasks. Otherwise, the job remains in the Pending state.

How to enable gang scheduling

  1. To enable gang scheduling, set min-available and name by adding labels to the pods. When this method is used, kube-scheduler automatically creates a PodGroup named after the value of the pod-group.scheduling.sigs.k8s.io/name label. You must set the value of pod-group.scheduling.sigs.k8s.io/name to a subdomain name. For more information, see Object names and IDs.

    labels:
        pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
        pod-group.scheduling.sigs.k8s.io/min-available: "3"
    • name: the name of a PodGroup.

    • min-available: the minimum number of pods that must be scheduled to run a job.

  2. You can use one of the following methods to enable gang scheduling. In clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.

    • Create a PodGroup and use the pod-group.scheduling.sigs.k8s.io or pod-group.scheduling.sigs.k8s.io/name label to specify the PodGroup to which your pods belong. The pods and the PodGroup must belong to the same namespace.

      Important

      Since version 1.31, ACK no longer supports the PodGroup resource of version scheduling.sigs.k8s.io/v1alpha1. It supports only the PodGroup resource of version scheduling.x-k8s.io/v1alpha1.

      # PodGroup CRD spec
      apiVersion: scheduling.sigs.k8s.io/v1alpha1
      kind: PodGroup
      metadata: 
       name: nginx
      spec: 
       scheduleTimeoutSeconds: 10 
       minMember: 3
      ---
      # Add the pod-group.scheduling.sigs.k8s.io/name label to the pods. 
      labels: 
       pod-group.scheduling.sigs.k8s.io/name: nginx
    • Add the min-available and name annotations to the configurations of the pods that you want to manage. The total-number and mode parameters in the koordinator API are not supported.

      annotations:  
       gang.scheduling.koordinator.sh/name: "gang-example" 
       gang.scheduling.koordinator.sh/min-available: "2"
Note

Pods that belong to the same PodGroup must be assigned the same priority.

Advanced gang scheduling configurations

Limits

To use advanced gang scheduling configurations in clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.

Declare a GangGroup

When you use gang scheduling, some jobs may use different roles that have different requirements on the value of the min-available parameter. For example, PyTorch training jobs use parameter servers and workers. In this case, if you use a PodGroup to manage the pods of all roles, the requirements on the min-available parameter for different roles may not be met at the same time. If you create multiple PodGroups for the roles, the pods of the roles cannot be scheduled in one batch. To resolve this issue, we recommend that you use the GangGroup feature to manage multiple gangs as a group. The job can be run only when the number of pods that are scheduled reaches the value of min-available for each gang. This ensures that the requirements on the min-available parameter for different roles are met.

  • If you use labels to enable gang scheduling, add the following label to the configurations of the pod:

    pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
  • If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:

    pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
  • If you use annotations to enable gang scheduling, add the following label to the configurations of the pod:

    gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"

Declare a matchpolicy

When you enable gang scheduling, you can declare a match-policy to enable a PodGroup to count pods by type.

  • If you use labels to enable gang scheduling, add the following label to the configurations of the pod:

    pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
  • If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:

    pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
  • If you use annotations to enable gang scheduling, only the once-satisfied match method is supported.

The following table describes different match methods.

Match method

Description

only-waiting

Only pods that have completed resource preallocation are matched.

waiting-and-running

Pods in the Running state and pods that have completed resource preallocation are matched.

waiting-running-succeed

Pods in the Succeed state, pods in the Running state, and pods that have completed resource preallocation are matched.

once-satisfied

Only pods that have completed resource preallocation are matched. The PodGroup becomes invalid after pods are matched.

Examples

In this example, a distributed TensorFlow job is used to demonstrate how to enable gang scheduling. The ACK cluster that is used in this example has four GPUs.

  1. Install Arena and deploy an environment in your cluster to run TensorFlow jobs. For more information, see Install Arena.

    Note

    Arena is a subproject of Kubeflow. Kubeflow is an open source project for Kubernetes-based machine learning. Arena allows you to manage the lifecycle of machine learning jobs by using a CLI or SDK. Lifecycle management includes environment setup, data preparation, model development, model training, and model prediction. This improves the working efficiency of data scientists.

  2. Use the following template to submit a distributed TensorFlow job to the ACK cluster. The job runs in one parameter server (PS) pod and four worker pods. Each worker pod requires two GPUs.

    Show complete content

    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "tf-smoke-gpu"
    spec:
      tfReplicaSpecs:
        PS:
          replicas: 1
          template:
            metadata:
              creationTimestamp: null
              labels:
                pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
                pod-group.scheduling.sigs.k8s.io/min-available: "5"
            spec:
              containers:
              - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=cpu
                - --data_format=NHWC
                image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
                name: tensorflow
                ports:
                - containerPort: 2222
                  name: tfjob-port
                resources:
                  limits:
                    cpu: '1'
                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
              restartPolicy: OnFailure
        Worker:
          replicas: 4
          template:
            metadata:
              creationTimestamp: null
              labels:
                pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
                pod-group.scheduling.sigs.k8s.io/min-available: "5"
            spec:
              containers:
              - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=gpu
                - --data_format=NHWC
                image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
                name: tensorflow
                ports:
                - containerPort: 2222
                  name: tfjob-port
                resources:
                  limits:
                    nvidia.com/gpu: 2
                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
              restartPolicy: OnFailure
    • Submit the distributed TensorFlow job without enabling gang scheduling

      Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      The output shows that only two worker pods are running and the other worker pods are in the Pending state.

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
      tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
      tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
      tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
      tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

      Run the following command to query the log data of the running worker pods.

      kubectl logs -f tf-smoke-gpu-worker-0

      The returned log data indicates that the two worker pods are launched and waiting for the system to start the pending worker pods. The GPU resources occupied by the running worker pods are not in use.

      INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
      INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
    • Submit the distributed TensorFlow job with gang scheduling enabled

      Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      The computing resources in the cluster are insufficient to schedule the minimum number of pods. Therefore, the PodGroup cannot be scheduled and all pods are in the Pending state.

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       0/1     Pending   0          43s
      tf-smoke-gpu-worker-0   0/1     Pending   0          43s
      tf-smoke-gpu-worker-1   0/1     Pending   0          43s
      tf-smoke-gpu-worker-2   0/1     Pending   0          43s
      tf-smoke-gpu-worker-3   0/1     Pending   0          43s

      After four GPUs are allocated to the cluster, the computing resources in the cluster are sufficient to schedule the minimum number of pods. After the PodGroup is scheduled, the four worker pods start to run. Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
      tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

      Run the following command to query the log data of a running worker pod: The output shows that the training job

      kubectl logs -f tf-smoke-gpu-worker-0

      The following output indicates that the training job has started.

      INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
      INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
      INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step  Img/sec loss
      INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
      INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10  images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
      INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20  images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142

Error messages

  • Error message: "rejected by podgroup xxx".

  • Possible causes: If you create multiple PodGroups in a cluster, the pods in a PodGroup may fail to be scheduled at the same time due to the BackOff queue of kube-scheduler. In this case, pods that have completed resource pre-allocation may be rejected when the system schedules the pods in subsequent PodGroups. You can ignore the error if the situation lasts no more than 20 minutes. If the situation lasts more than 20 minutes, submit a ticket.

References

  • For more information about release notes for kube-scheduler, see kube-scheduler.

  • Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of ACK clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. For more information, see Work with capacity scheduling.