By Yu Zhuang
In the realms of modern software development and data processing, batch processing jobs play a vital role. They are commonly utilized in areas that demand extensive computing resources, such as data processing, simulation, and scientific computing. With the advent of cloud computing, services like Alibaba Cloud Batch Compute offer platforms to manage and execute these batch jobs.
As the cloud-native movement and Kubernetes ecosystem evolve, an increasing number of applications are being hosted on Kubernetes, including online applications, middleware, and databases. The question arises: Can offline tasks and batch computing also be executed on this unified Kubernetes platform? The answer is affirmative. Kubernetes clusters equipped with distributed Argo Workflows[1], which are built on the open-source Argo Workflows[2] project, adhere to open-source workflow standards. They are capable of orchestrating offline tasks and batch computing tasks, running them in a serverless fashion, which simplifies operations and maintenance while reducing operational costs.
With Kubernetes clusters for distributed Argo Workflows, orchestrating workflows becomes a breeze. Each step of a workflow is executed in a container, allowing for efficient handling of compute-intensive jobs such as large-scale machine learning, simulation, and data processing within a brief time frame. These clusters also facilitate rapid execution of CI/CD pipelines.
This article explores the distinctions between mainstream batch computing systems and Kubernetes clusters for distributed Argo Workflows. It also examines how to transition offline tasks and batch computing to these Kubernetes clusters for distributed Argo Workflows.
A job is a task unit, such as a shell script, a Linux executable, or a Docker container image, that you submit to a batch computing system. The system allocates computing resources within a computing environment to execute the job.
Array jobs are a collection of similar or identical jobs submitted and executed as a group. Though sharing the same job definition, each job can be identified by an index and may process a different dataset or perform a variation of the task.
A job definition outlines how to run a job. It is created before executing a job and typically includes the image for running the job, specific commands and parameters, required CPU and memory, environment variables, and disk storage.
Jobs are submitted to a specific job queue in the batch computing system, where they wait until scheduled for execution. Job queues can have assigned priorities and be linked to specific computing environments.
A compute environment consists of the computing resources allocated for running jobs. It requires specifications such as the virtual machine model, the maximum and minimum vCPU numbers, and the pricing for spot instances.
A template defines a task (or job) and is an integral component of a workflow. Each workflow must include at least one template, which specifies the Kubernetes containers to be executed and their corresponding input and output parameters.
A workflow comprises one or multiple tasks (or templates) that can be orchestrated in various ways, such as serializing tasks, running tasks in parallel, or executing specific tasks when conditions are met. Once a workflow is established, the tasks within it are executed as pods within a Kubernetes cluster.
Workflow templates are reusable, static workflow definitions akin to functions. They can be referenced and executed across multiple workflows, allowing you to leverage existing templates when defining complex workflows to minimize redundant definitions.
Distributed Argo Workflows on Kubernetes come with a built-in compute environment, eliminating the need for manual creation and management. Workflows submitted are run serverlessly using Alibaba Cloud Elastic Container Instances (ECIs), which means there is no need for Kubernetes node maintenance. Alibaba Cloud's elastic capabilities enable the execution of large-scale workflows with tens of thousands of pods and the use of hundreds of thousands of CPU cores, which are automatically released upon workflow completion. This results in accelerated workflow execution and cost savings.
cat > helloworld.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow # new type of k8s spec
metadata:
generateName: hello-world- # name of the workflow spec
spec:
entrypoint: main # invoke the main template
templates:
- name: main # name of the template
container:
image: registry.cn-hangzhou.aliyuncs.com/acs/alpine:3.18-update
command: [ "sh", "-c" ]
args: [ "echo helloworld" ]
EOF
argo submit helloworld.yaml
The following workflow creates a pod that uses the alpine image and runs the Shell command echo helloworld. You can modify this workflow to run the specified Shell commands or run the commands in a custom image in Argo.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: loops-
spec:
entrypoint: loop-example
templates:
- name: loop-example
steps:
- - name: print-pet
template: print-pet
arguments:
parameters:
- name: job-index
value: "{{item}}"
withSequence: # loop to run print-pet template with parameter job-index 1 ~ 5 respectively.
start: "1"
end: "5"
- name: print-pet
inputs:
parameters:
- name: job-index
container:
image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/print-pet
command: [/tmp/print-pet.sh]
args: ["{{inputs.parameters.job-index}}"] # input parameter job-index as args of container
In the following loop, a text file named pets.input and a script named print-pet.sh are packaged in the image named print-pet. The input parameter of the print-pet.sh script is job-index. The loop prints the pet in the job-index row of the pets.input file. For more information, visit the GitHub repository[3].
The loop creates five pods at a time and passes an input parameter (from job-index 1 to job-index 5) to each pod. Each pod prints the pet in the job-index row. Loops can be used to quickly process large amounts of data in sharding and parallel computing scenarios. For more sample loops, see Argo Workflows - Loops[4].
Multiple jobs are often required to collaborate in batch compute scenarios. In this case, DAG is the best way to specify the dependencies of each job. However, in a mainstream batch computing system, the ID of a job is returned only after the job is submitted. To solve this problem, you need to write a script to specify the dependencies of each job, as shown in the following sample code. When the number of jobs grows, the dependencies in the script become complex and the cost of maintaining the script also increases.
// The dependencies of each job in a batch computing system. Job B depends on Job A and is started only after Job A is complete.
batch submit JobA | get job-id
batch submit JobB --dependency job-id (JobA)
Argo Workflows allow you to create a DAG to specify the dependencies of each task, as shown in the following figure:
# The following workflow executes a diamond workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: echo
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
In the Git repository[5], we also provide a sample MapReduce workflow which can be used to create shards and aggregate computing results.
Assess the existing batch jobs, including dependencies, resource requests, and parameters. Learn the features and best practices of Argo Workflows, and choose proper Argo Workflows features to replace those used in the batch computing system. You can skip the steps for designing compute environments and configuring job priorities because Kubernetes clusters for distributed Argo Workflows use serverless ECIs.
For more information, see Workflow cluster quickstart[6]
Convert batch computing jobs to Argo Workflows based on the feature mappings between batch computing and Argo workflows. You can also call the Argo Workflows SDK[7] to automate workflow creation and integration.
Ensure that the Kubernetes clusters for distributed Argo Workflows can access the data required for workflow running. The clusters can mount and access Alibaba Cloud OSS, NAS, CPFS, cloud disks, and other storage resources. For more information, see Use volumes[8].
Verify that the workflows, data access, result output, and resource usage are normal and meet expectations.
Enable the observability capability of Kubernetes clusters for distributed Argo Workflows[9] to view the workflow status and logs.
[1] Kubernetes clusters for distributed Argo workflows
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/overview-12
[2] Argo Workflows
https://argoproj.github.io/argo-workflows/
[3] GitHub repository
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/loops
[4] Argo Workflows - Loops
https://argo-workflows.readthedocs.io/en/latest/walk-through/loops/
[5] Git repository
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/map-reduce
[6] Getting started with workflow clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/workflow-cluster-quickstart
[7] SDK
https://argoproj.github.io/argo-workflows/client-libraries/
[8] Use volumes
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/use-volumes
[9] Observability
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/observability/
Practice of End-to-end Canary Release by Using Kruise Rollout
Solutions to Engineering Challenges of Generative AI Model Services in Cloud-native Scenarios
508 posts | 48 followers
FollowAlibaba Container Service - April 12, 2024
Alibaba Container Service - November 21, 2024
Alibaba Container Service - December 18, 2024
Alibaba Container Service - October 15, 2024
Alibaba Container Service - August 30, 2024
Alibaba Developer - September 7, 2020
508 posts | 48 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreResource management and task scheduling for large-scale batch processing
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMore Posts by Alibaba Cloud Native Community