Efficient deployment and optimization of Ray in ACK Clusters - Container Service for Kubernetes

Ray is an open-source unified framework for scaling AI and Python applications. Ray is widely adopted in the machine learning sector. You can quickly create a Ray cluster in a Container Service for Kubernetes (ACK) cluster and integrate the Ray cluster with Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to optimize log management, observability, and availability. The Ray autoscaler can work with the ACK autoscaler to improve the efficiency of computing resource scaling and increase the resource utilization.

Ray introduction

Ray is an open-source unified framework for scaling AI and Python applications. It provides an API to simplify distributed computing to help you efficiently develop parallel processing and distributed Python applications. Ray is widely adopted in the machine learning sector. The unified computing framework of Ray consists of the Ray AI Libraries, Ray Core, and Ray Clusters layers.

View the details of the computing framework

Ray AI libraries

An open source, Python, and domain-specific set of libraries that provide ML engineers, data scientists, and researchers with a scalable and unified toolkit for machine learning applications.

Ray Core

An open-source, Python, general-purpose, and distributed computing library that allows machine learning engineers and Python developers to scale Python applications and accelerate machine learning workloads. Ray Core provides core primitives, including tasks, actors, and objects. Tasks are asynchronous Ray functions executed on workers. Actors extend the Ray API from tasks to objects. In Ray, tasks and actors are created and computed on objects. For more information, see Ray Framework and What is Ray Core?.

Ray Cluster

A Ray cluster consists of a head node and multiple worker nodes. The head node manages the Ray cluster and the worker nodes run computing tasks. Worker nodes communicate and collaborate with the head node through network connections. You can deploy a Ray cluster on physical machines, virtual machines, Kubernetes, or cloud computing platforms. For more information, see Ray Cluster Overview and Key Concepts.

To deploy a Ray application across multiple machines in a production environment, you must first create a Ray cluster that consists of a head node and worker nodes. Ray nodes run as pods in Kubernetes. You can use the Ray autoscaler to perform auto scaling.

The following figure shows the architecture of a Ray cluster.

Ray on Kubernetes

The KubeRay operator provides a Kubernetes-native way to manage Ray clusters. You can use the KubeRay operator to deploy Ray clusters in Kubernetes environments, including ACK clusters. When you install the KubeRay operator, you need to deploy the operator Deployment and the RayCluster, RayJob, and RayService CustomResourceDefinitions (CRDs).

Ray on Kubernetes can greatly simplify the deployment and management of distributed applications. Ray on Kubernetes provides the following benefits. For more information, see Ray on Kubernetes.

Auto scaling: Kubernetes can automatically scale the number of nodes based on workloads. After you deploy the Ray autoscaler in Kubernetes, Kubernetes can dynamically scale a Ray cluster based on workloads, optimize the resource utilization, and simplify the management of distributed applications.
Fault tolerance: Ray is designed with fault tolerance. This capability is enhanced when Ray runs on Kubernetes. When a Ray node fails, Kubernetes automatically replaces the faulty node to ensure the stability and availability of the Ray cluster.
Resource management: In Kubernetes, you can create resource requests and limits to control and manage resources, such as CPU and memory resources, used by Ray nodes in a fine-grained manner. This helps improve resource utilization and avoid resource waste.
Simple deployment: Kubernetes provides a unified system for deploying, managing, and monitoring containerized applications. Ray on Kubernetes provides a consistent experience for configuring and managing Ray clusters in development, staging, and production environments.
Service discovery and load balancing: Kubernetes supports service discovery and load balancing. You can use Kubernetes to automatically manage connections between Ray nodes and connections between clients and a Ray cluster. This helps simplify network configuration and improve network performance.
Multi-tenancy: You can use namespaces in Kubernetes to isolate Ray clusters that belong to different users or teams and share resources in a Kubernetes cluster.
Monitoring and logging: Kubernetes provides observability capabilities for monitoring and logging, which allow you to trace the status and performance of your Ray cluster. For example, you can use Prometheus and Grafana to collect the performance metrics of Ray clusters.
Compatibility: Kubernetes is the core of cloud-native ecosystems. It is compatible with multiple cloud service providers and technology stacks. After you deploy a Ray cluster in Kubernetes, you can migrate or scale the cluster across different cloud computing platforms or hybrid cloud environments.

Ray on ACK

Container Service for Kubernetes (ACK) is one of the first services to participate in the Certified Kubernetes Conformance Program in the world. ACK provides high-performance containerized application management services and supports lifecycle management for enterprise-class containerized applications. You can use KubeRay to create Ray clusters in ACK clusters in the same way you create ACK clusters in the cloud.

Your Ray cluster can work with Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to improve log management, observability, and availability.
You can use a combination of the ray autoscaler and ACK autoscaler to scale computing resources on demand.

Billing

After you create a Ray cluster in an ACK cluster, you can use Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to improve log management, observability, and availability. In addition to fees incurred by ACK, you must also pay for other resources. For more information about billing, see the following topics:

Managed Service for Prometheus: Billing overview
Simple Log Service: Billing overview
ApsaraDB for Redis: Billable items

1 Prepare the environment

Step 1: Create a cluster

For more information about how to create an ACK cluster, see Create an ACK managed cluster. For more information about how to update an ACK cluster, see Manually upgrade ACK clusters. An ACK Pro cluster is created and meets the following requirements.

The Kubernetes version of the cluster is v1.24 or later.
Node specifications: A node that provides at least 8 vCPUs and 32 GB of memory is created.
You can use the recommended minimum specifications in a staging environment. You can configure GPU-accelerated nodes on demand.
For more information about Elastic Compute Service (ECS) instance types, see Overview of instance families.
Simple Log Service is enabled for the cluster.
Manage Service for Prometheus is enabled for the cluster.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

(Optional) A Tair instance

To deploy a Ray cluster that supports high availability and fault tolerance, an ApsaraDB for Redis instance is used in this example. You can choose to create an ApsaraDB for Redis instance on demand. An Tair (Redis OSS-compatible) instance is created. The instance meets the following requirements.

The ApsaraDB for Redis instance must be deployed in the same region and virtual private cloud (VPC) of the ACK Pro cluster. For more information, see Step 1: Create an instance.
Add a whitelist to allow access from the VPC CIDR block. For more information, see Step 2: Configure whitelists.
Obtain the endpoint of the ApsaraDB for Redis instance. We recommend that you use the VPC endpoint. For more information, see View endpoints.
Obtain the password of the ApsaraDB for Redis instance. For more information, see Change or reset the password.

Step 2: Install Kuberay-Operator

Important

Kuberay-Operator provided by ACK clusters is in invitational preview. To use this component, submit a ticket and apply for it.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
Click the name of the cluster you created. On the cluster details page, click Operations > Add-ons > Manage Applications as indicated in the following figure to install Kuberay-Operator.

Step 3: Deploy a Ray cluster

Run the following commands to create a Ray cluster named myfirst-ray-cluster and view the deployment status.

Run the following command to create a Ray cluster:

Expand to view the code

cat <<EOF | kubectl apply -f -
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: myfirst-ray-cluster
  namespace: default
spec:
  suspend: false
  autoscalerOptions:
    env: []
    envFrom: []
    idleTimeoutSeconds: 60
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 2000m
        memory: 2024Mi
      requests:
        cpu: 2000m
        memory: 2024Mi
    securityContext: {}
    upscalingMode: Default
  enableInTreeAutoscaling: false
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
      num-cpus: "0"
    serviceType: ClusterIP
    template:
      spec:
        containers:
        - image: rayproject/ray:2.36.1
          imagePullPolicy: Always
          name: ray-head
          resources:
            limits:
              cpu: "4"
              memory: 4G
            requests:
              cpu: "1"
              memory: 1G
  workerGroupSpecs:
  - groupName: work1
    maxReplicas: 1000
    minReplicas: 0
    numOfHosts: 1
    rayStartParams: {}
    replicas: 1
    template:
      spec:
        containers:
        - image: rayproject/ray:2.36.1
          imagePullPolicy: Always
          name: ray-worker
          resources:
            limits:
              cpu: "4"
              memory: 4G
            requests:
              cpu: "4"
              memory: 4G
EOF

Run the following command to query the deployment status of the Ray cluster:

kubectl get raycluster

NAME                  DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
myfirst-ray-cluster   1                 1                   5      5G       0      ready    4m19s

Query the pod that runs the Ray cluster:

kubectl get pod

NAME                                     READY   STATUS    RESTARTS   AGE
myfirst-ray-cluster-head-5q2hk           1/1     Running   0          4m37s
myfirst-ray-cluster-work1-worker-zkjgq   1/1     Running   0          4m31s

Query the Services associated with the Ray cluster:

kubectl get svc

NAME                           TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                         AGE
kubernetes                     ClusterIP   192.168.0.1   <none>        443/TCP                                         21d
myfirst-ray-cluster-head-svc   ClusterIP   None          <none>        10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP   6m57s

2 Integrate with Simple Log Service

You can integrate Simple Log Service with a Ray cluster to persist logs.

Run the following commands to create a global AliyunLogConfig object to enable the Logtail component in the ACK cluster to collect logs generated by the pods of Ray clusters and deliver the logs to a Simple Log Service project.

Expand to view the code

cat <<EOF | kubectl apply -f -
apiVersion: log.alibabacloud.com/v1alpha1
kind: AliyunLogConfig
metadata:
  name: rayclusters
  namespace: kube-system
spec:
   # The name of the Logstore. If the specified Logstore does not exist, Simple Log Service automatically creates a Logstore. 
  logstore: rayclusters
  # Configure Logtail. 
  logtailConfig:
    # The type of data source. If you want to collect text logs, you must set the value to file. 
    inputType: file
    # The name of the Logtail configuration. The name must be the same as the resource name that is specified in metadata.name. 
    configName: rayclusters
    inputDetail:
      # Configure Logtail to collect text logs in simple mode. 
      logType: common_reg_log
      # The path of the log file. 
      logPath: /tmp/ray/session_*-*-*_*/logs
      # The name of the log file. You can use wildcard characters such as asterisks (*) and question marks (?) when you specify the log file name. Example: log_*.log. 
      filePattern: "*.*"
      # If you want to collect container text logs, you must set dockerFile to true. 
      dockerFile: true
      # The conditions that are used to filter containers. 
      advanced:
        k8s:
          IncludeK8sLabel:
            ray.io/is-ray-node: "yes"
          ExternalK8sLabelTag:
            ray.io/cluster: "_raycluster_name_"
      			ray.io/node-type : "_node_type_"
EOF

Parameter	Description
`logPath`	Collect all logs in the `/tmp/ray/session_--_/logs` directory of the pods. You can specify a custom path.
`advanced.k8s.ExternalK8sLabelTag`	Add tags to the collected logs for log retrieval. By default, the `_raycluster_name_` and `_node_type_` tags are added.

For more information about the AliyunLogConfig parameters, see Use CRDs to collect container logs in DaemonSet mode. Simple Log Service is a paid service. For more information, see Billing overview.

View the log information of the Ray cluster.
Log on to the ACK console. In the left-side navigation pane, click Clusters. Click the name of the cluster you created. On the cluster details page, follow Cluster Information > Basic Information > Cluster Resources as indicated in the following figure, then click the hyperlink to the right of Log Service Project to access the Simple Log Service project.
Select the Logstore that corresponds to rayclusters and view the log content.
You can view the logs of different Ray clusters based on tags, such as _raycluster_name_.

3 Integrate with Managed Service for Prometheus

You can use Prometheus monitoring services in the Ray cluster. For more information about Prometheus monitoring services, see Managed Service for Prometheus.

Run the following commands to deploy the Pod Monitor and Service Monitor to collect metrics data for the Ray cluster.

Run the following commands to deploy the Pod Monitor:

Expand to view the code

cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  annotations:
    arms.prometheus.io/discovery: 'true'
    arms.prometheus.io/resource: arms
  name: ray-workers-monitor
  namespace: arms-prom
  labels:
    # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label.
    release: prometheus
    #ray.io/cluster: raycluster-kuberay # $RAY_CLUSTER_NAME: "kubectl get rayclusters.ray.io"
spec:
  namespaceSelector:
    any: true
  jobLabel: ray-workers
  # Only select Kubernetes Pods with "matchLabels".
  selector:
    matchLabels:
      ray.io/node-type: worker
  # A list of endpoints allowed as part of this PodMonitor.
  podMetricsEndpoints:
  - port: metrics
    relabelings:
    - action: replace
      regex: (.+)
      replacement: $1
      separator: ;
      sourceLabels:
        - __meta_kubernetes_pod_label_ray_io_cluster
      targetLabel: ray_io_cluster
      
EOF

Run the following commands to deploy the Service Monitor:

Expand to view the code

cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    arms.prometheus.io/discovery: 'true'
    arms.prometheus.io/resource: arms
  name: ray-head-monitor
  namespace: arms-prom
  labels:
    # `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.
    release: prometheus
spec:
  namespaceSelector:
    any: true
  jobLabel: ray-head
  # Only select Kubernetes Services with "matchLabels".
  selector:
    matchLabels:
      ray.io/node-type: head
  # A list of endpoints allowed as part of this ServiceMonitor.
  endpoints:
    - port: metrics
      path: /metrics
  targetLabels:
  - ray.io/cluster
  
EOF

Log on to the console to check the status of resource deployment and integration.
1. Log on to the ARMS console. In the left-side navigation pane, choose Integration Center, enter Ray in the search box, then find and select Ray. In the Ray panel, select the ACK cluster that you created and click OK.
2. After the ACK cluster is integrated with Managed Service for Prometheus, choose Integration Management to go to the Integration Management page. On the Component Management tab, click Dashboards in the Component Type section and click Ray Cluster.
3. Specify Namespace, RayClusterName, and SessionName to filter the monitoring data of tasks that run in the Ray clusters.

References

You can access Ray Dashboard from the local network. For more information, see Access Ray Dashboard from the local network.
For more information about how to submit jobs in a Ray cluster, see Submit a Ray job.
For more information about how to use the Ray autoscaler to automatically scale ECS nodes, see Elastic scaling based on the Ray autoscaler and ACK autoscaler.
For more information about how to use the Ray autoscaler to automatically scale virtual Elastic Container Instance nodes, see Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler.