In scenarios such as machine learning or big data analysis, pods often require frequent communication. By default, the Kubernetes scheduler evenly spreads pods across nodes in a Container Service for Kubernetes (ACK) cluster. Consequently, pods may communicate at a high latency, resulting in increased time required to complete the job. In Lingjun clusters, network topology-aware scheduling can assign pods to the same Layer 1 or Layer 2 forwarding domains. This approach reduces network latency and accelerates job completion.
Solution overview
Network topology-aware scheduling uses the greedy algorithm to assign tasks to nodes with minimal topological spans.
Suppose a Lingjun cluster has two layers of network topology: Pod and Access Switch (ASW). ASW serves as the direct interface for the Lingjun cluster, while Pod represents a broader network topology. Transmissions between Lingjun nodes require at least one hop of network forwarding, and transmissions between ASW nodes require at least two hops.
For a task requiring two nodes, network topology-aware scheduling will assign the task to either Node Pair A-B or Node Pair E-F.
For a task requiring four nodes, network topology-aware scheduling will assign the task to either Node Pair A-D or Node Pair E-H.
To determine whether nodes in the ACK Lingjun cluster are part of the same ASW or Pod, you can check whether the nodes have the labels alibabacloud.com/asw-id
and alibabacloud.com/point-of-delivery
. These labels provide details about network location.
Declare topology structure
To implement network topology-aware scheduling in the Lingjun cluster, define the cluster-level topology scheduling requirements and identify the scheduling information in the task.
Create a file named
cluster-network-topology.yaml
to declare a two-level topology structure.Create a file named
sample-network-topology.yaml
to declare the network topology-aware scheduling requirements for the task.Create a file named
pi.yaml
to declare the network topology-aware scheduling information.When you submit a task, you must include the relevant
JobNetworkTopology
information.Run the following command to deploy the YAML files in the cluster:
kubectl apply -f cluster-network-topology.yaml kubectl apply -f sample-network-topology.yaml kubectl apply -f pi.yaml
Scheduling result demonstration
Use kubectl to connect to the cluster and retrieve the network topology information.
In the Lingjun cluster, the lingjun-networktopology-collector component collects this information and labels it on the nodes.
For other nodes or different types of clusters, you must manually add labels, and ensure that the label keys correspond to the
labelKey
specified in theClusterNetworkTopology
in the preceding YAML template.# Network topology is like: # test-pod-1 test-pod-2 # / | \ | # test-1 test-2 test-3 test-4 # / \ | | | # 0.12 0.14 0.15 0.16 0.17 ➜ network kubectl get no -l alibabacloud.com/asw-id,alibabacloud.com/point-of-delivery -ojson | jq '.items[] | {"Name":.metadata.name, "ASW":.metadata.labels."alibabacloud.com/asw-id", "POD":.metadata.labels."alibabacloud.com/point-of-delivery"}' { "Name": "cn-hongkong.10.1.0.12", "ASW": "test-1", "POD": "test-pod-1" } { "Name": "cn-hongkong.10.1.0.14", "ASW": "test-1", "POD": "test-pod-1" } { "Name": "cn-hongkong.10.1.0.15", "ASW": "test-2", "POD": "test-pod-1" } { "Name": "cn-hongkong.10.1.0.16", "ASW": "test-3", "POD": "test-pod-1" } { "Name": "cn-hongkong.10.1.0.17", "ASW": "test-4", "POD": "test-pod-2" }
Run the following command to submit the configured Job task and observe its execution on two nodes within test-1:
➜ kubectl get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pi-8p89l 1/1 Running 0 4s 172.30.240.197 cn-hongkong.10.1.0.14 <none> <none> pi-p8swv 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.12 <none> <none>
If you set the number of pods in the task to 4, you must update the
parallelism
andpod-group.scheduling.sigs.k8s.io/min-available
parameters in the preceding YAML configuration file for Job, and theworkerNum
parameter in the preceding YAML configuration file for JobNetworkTopology.After making these modifications, you will notice that only the node running test-pod-2 remains unscheduled.
➜ kubectl get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pi-2kwq9 1/1 Running 0 4s 172.30.241.123 cn-hongkong.10.1.0.12 <none> <none> pi-87hm5 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.16 <none> <none> pi-bsvx8 1/1 Running 0 4s 172.30.240.198 cn-hongkong.10.1.0.14 <none> <none> pi-dvwhl 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.15 <none> <none>
If you set the number of pods in the task to 5, you must update the
parallelism
andpod-group.scheduling.sigs.k8s.io/min-available
parameters in the preceding YAML configuration file for Job, and theworkerNum
parameter in the preceding YAML configuration file for JobNetworkTopology.After making these modifications, you will see the task scheduling fail. The first scheduled pod containing a scheduling failure message
all fail topology paths by MustGather reason: [path:RootNode->test-pod-1, freeSlotNum:4], [path:RootNode->DefaultTopologyName, freeSlotNum:0], [path:RootNode->test-pod-2, freeSlotNum:1]
. This message indicates that the scheduling failed because cross-pod scheduling is not permitted.➜ kubectl get pod NAME READY STATUS RESTARTS AGE pi-75qf5 0/1 Pending 0 2s pi-8k4nd 0/1 Pending 0 2s pi-b2pmc 0/1 Pending 0 2s pi-n7c2b 0/1 Pending 0 2s pi-wf4zn 0/1 Pending 0 2s ➜ kubectl get pod -ojson | jq '.items[].status' { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2024-05-29T07:46:27Z", "message": "0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 [NetworkTopology begin] cluster total nodes:6, 5 node provide 5 freeSlot, 1 node unavailable cause Insufficient nvidia.com/gpu, job desireNum:5, all fail topology paths by MustGather reason: [path:RootNode->test-pod-1, freeSlotNum:4], [path:RootNode->DefaultTopologyName, freeSlotNum:0], [path:RootNode->test-pod-2, freeSlotNum:1] [NetworkTopology end], 4 NetworkTopology bestPlan empty. network topology job sample-network-topology/sample-network-topology gets rejected due to pod is unschedulable, preemption: 0/6 nodes are available: 1 No victims found on node cn-hongkong.10.1.0.10 for preemptor pod pi-75qf5, 5 Preemption is not helpful for scheduling..", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "BestEffort" } { "phase": "Pending", "qosClass": "BestEffort" } { "phase": "Pending", "qosClass": "BestEffort" } { "phase": "Pending", "qosClass": "BestEffort" } { "phase": "Pending", "qosClass": "BestEffort" }