Container Service for Kubernetes (ACK) allows you to use gang scheduling and topology-aware scheduling to enable the scheduler to loop through a list of topology domains until it finds a topology domain that meets the requirements of all pods created by a job. In addition, you can associate node pools in Container Service for Kubernetes with Elastic Compute Service (ECS) deployment sets to conduct affinity scheduling in lower dimensions, such as topology domains. You can schedule pods to ECS instances in the same deployment set in order to reduce the network latency among the pods. This topic describes how to configure topology-aware scheduling.
Introduction to topology-aware scheduling
Pods created by machine learning or big data analysis jobs usually need to communicate frequently. By default, the Kubernetes scheduler evenly spreads pods across nodes in an ACK cluster. Consequently, pods may communicate at a high latency and the amount of time required for completing the job increases.
To improve the efficiency of the preceding jobs, you can deploy pods in the same zone or even on the same rack. This helps reduce the number of hops between pods so that pods can communicate at a lower latency. Kubernetes allows you to schedule pods based on node affinity and pod affinity. However, node affinity and pod affinity in Kubernetes have the following disadvantages.
You cannot configure the scheduler to loop through all topology domains during pod scheduling. In Kubernetes affinity scheduling, a job schedules pods to the node to which the first pod of the job is scheduled. If the node cannot meet the requirements of other pods, some pods may become pending. The job cannot change the zone of the pods even if another topology domain meets the requirements of these pods.
This is because the node has only zone-specific labels. All pods are scheduled to the same zone by affinity scheduling. The scheduler cannot schedule pods in lower dimensions, such as topology domains.
In ACK, you can use gang scheduling and topology scheduling to enable the scheduler to loop through a list of topology domains until it finds the topology domain that meets the requirements of all pods created by a job. In addition, you can associate node pools in ACK with ECS deployment sets to conduct affinity scheduling in lower dimensions, such as topology domains. You can schedule pods to ECS instances in the same deployment set in order to reduce the network latency among the pods.
Loop through topology domains during pod scheduling
The ACK scheduler named kube-scheduler adds gang scheduling labels to a job to ensure that the resource requests of all pods created by the job are fulfilled at the same time, and then saves the scheduling result. This allows kube-scheduler to loop through multiple topology domains during pod scheduling. To use this feature, you must add the following labels to pods.
Add the following gang scheduling labels to pods. For more information about gang scheduling, see Work with gang scheduling.
labels: pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu # Specify the name of the PodGroup, such as tf-smoke-gpu. pod-group.scheduling.sigs.k8s.io/min-available: "3" # Set the value to the number of pods created by the job.
Add the following topology-aware scheduling constraint to pods in the labels section.
annotations: alibabacloud.com/topology-aware-constraint: {\"name\":\"test\",\"required\":{\"topologies\":[{\"key\":\"kubernetes.io/hostname\"}],\"nodeSelectors\":[{\"matchLabels\":{\"test\":\"abc\"}}]}}
The value of
alibabacloud.com/topology-aware-constraint
must be a valid JSON string. Otherwise, the topology-aware scheduling constraint cannot be parsed. The JSON string must be in the following format.{ "name": xxx # Specify a name. "required": { "topologies": [ { "key": xxx # The key of the topology domain for affinity scheduling. } ], "nodeSelectors": [ { # You can refer to the format of the LabelSelector in Kubernetes node affinity. "matchLabels": {}, "matchExpressions": {} } ] } }
After you complete the configuration, the scheduler searches for a node that fulfills the resource requests of all pods with the
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
label among the nodes with thetest=abc
label. The following output shows the scheduling result:> capacity k get po -ojson | jq '.items[] | {"name":.metadata.name,"ann":.metadata.annotations["alibabacloud.com/topology-aware-constraint"], "node": spec.nodeName}' { "name": "nginx-deployment-basic-69f47fc6db-6****" "ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}} " "node": "cn-shenzhen.10.0.2.4" } { "name":"nginx-deployment-basic-69f47fc6db-h****" "ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}} " "node": "cn-shenzhen.10.0.2.4" } { "name":"nginx-deployment-basic-69f47fc6db-j****" "ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}} " "node": "cn-shenzhen.10.0.2.4" } { "name":"nainx-deployment-basic-69f47fc6db-I****" "ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}} " "node": "cn-shenzhen.10.0.2.4" } { "name" :"nainx-deplovment-basic-69f47fc6db-x****" "ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}} " "node": "cn-shenzhen.10.0.2.4" }
Schedule pods to the same deployment set to reduce the network latency
Scheduling all pods to the same zone may not meet the business requirements in some scenarios. You may need to conduct affinity scheduling in lower dimensions, such as topology domains. ECS allows you to create a deployment set to reduce the network latency among ECS instances in the deployment set. For more information about how to use deployment sets in ACK, see Best practices for associating deployment sets with node pools.
When you create a node pool that is associated with a deployment set, you must add a custom node label to distinguish the node pool from other node pools, as shown in the following figure.
After you complete the configuration, add the following annotations and labels to schedule a job to the deployment set.
Add the following gang scheduling labels to pods. For more information about gang scheduling, see Work with gang scheduling.
labels: pod-group.scheduling.sigs.k8s.io/name: xxx # Specify the name of the PodGroup. pod-group.scheduling.sigs.k8s.io/min-available: "x" # Set the value to the number of pods created by the job.
Add the following topology-aware scheduling constraint to pods in the labels section.
ImportantReplace
matchLabels
in this example with the custom node label of the node pool associated with the deployment set and replacename
with the actual value.annotations: alibabacloud.com/topology-aware-constraint: {\"name\":\"test\",\"required\":{\"topologies\":[{\"key\":\"alibabacloud.com/nodepool-id\"}],\"nodeSelectors\":[{\"matchLabels\":{\"np-type\":\"low-latency\"}}]}}