Ray provides the Ray autoscaler, which allows you to dynamically adjust the computing resources of the Ray cluster based on workloads. Container Service for Kubernetes (ACK) also provides the ACK autoscaler to implement auto scaling. This component can automatically adjust the number of nodes based on the workloads in the cluster. The auto scaling feature based on the Ray autoscaler and ACK autoscaler completely leverages the elasticity capability of cloud computing and improves the efficiency and cost-effectiveness of computing resources.
Prerequisites
(Optional) You have learned how to submit a job in a Ray cluster. For more information, see Submit a Ray job.
The node auto scaling feature is enabled for the default node pool of the ACK cluster.
Elastic scaling based on the Ray autoscaler and ACK autoscaler
Run the following command to deploy a Ray cluster by using Helm in the ACK cluster:
helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}
Run the following command to view the status of resources in the Ray cluster:
kubectl get pod -n ${RAY_CLUSTER_NS} NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 22m
Run the following command to log on to the head node and view the cluster status:
Replace the value with the actual pod name of the Ray cluster.
kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status
Expected output:
======== Autoscaler status: 2024-01-25 00:00:19.879963 ======== Node status --------------------------------------------------------------- Healthy: 1 head-group Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0B/1.86GiB memory 0B/452.00MiB object_store_memory Demands: (no resource demands)
Submit and run the following jobs in the Ray cluster:
The following code starts 15 tasks, each of which requires one vCPU. By default, the value of
--num-cpus
for the head pod is 0, which means that task scheduling is not allowed. The CPU and memory of the worker pod are set to 1 vCPU and 1 GB by default. Therefore, the Ray cluster automatically creates 15 worker pods. Due to insufficient node resources in the ACK cluster, pods in the pending state automatically trigger the node auto scaling feature.import time import ray import socket ray.init() @ray.remote(num_cpus=1) def get_task_hostname(): time.sleep(120) host = socket.gethostbyname(socket.gethostname()) return host object_refs = [] for _ in range(15): object_refs.append(get_task_hostname.remote()) ray.wait(object_refs) for t in object_refs: print(ray.get(t))
Run the following command to query the status of pods in the Ray cluster:
kubectl get pod -n ${RAY_CLUSTER_NS} -w # Expected output: NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 47m myfirst-ray-cluster-worker-workergroup-btgmm 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-c2lmq 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-gstcc 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-hfshs 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-nrfh8 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-pjbdw 0/1 Pending 0 29s myfirst-ray-cluster-worker-workergroup-qxq7v 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-sm8mt 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-wr87d 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-xc4kn 1/1 Running 0 30s ...
Run the following command to query the node status:
kubectl get node -w # Expected output: cn-hangzhou.172.16.0.204 Ready <none> 44h v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 1s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 11s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 10s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 14s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 31s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 60s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 Ready <none> 61s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 Ready <none> 64s v1.24.6-aliyun.1 ...
References
You can also refer to Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler
You can access Ray Dashboard from the local network. For more information, see Access Ray Dashboard from the local network.