Cloud-native Offline Workflow Orchestration: Kubernetes Clusters for Distributed Argo Workflows

By Yu Zhuang

In the realms of modern software development and data processing, batch processing jobs play a vital role. They are commonly utilized in areas that demand extensive computing resources, such as data processing, simulation, and scientific computing. With the advent of cloud computing, services like Alibaba Cloud Batch Compute offer platforms to manage and execute these batch jobs.

As the cloud-native movement and Kubernetes ecosystem evolve, an increasing number of applications are being hosted on Kubernetes, including online applications, middleware, and databases. The question arises: Can offline tasks and batch computing also be executed on this unified Kubernetes platform? The answer is affirmative. Kubernetes clusters equipped with distributed Argo Workflows[1], which are built on the open-source Argo Workflows[2] project, adhere to open-source workflow standards. They are capable of orchestrating offline tasks and batch computing tasks, running them in a serverless fashion, which simplifies operations and maintenance while reducing operational costs.

With Kubernetes clusters for distributed Argo Workflows, orchestrating workflows becomes a breeze. Each step of a workflow is executed in a container, allowing for efficient handling of compute-intensive jobs such as large-scale machine learning, simulation, and data processing within a brief time frame. These clusters also facilitate rapid execution of CI/CD pipelines.

This article explores the distinctions between mainstream batch computing systems and Kubernetes clusters for distributed Argo Workflows. It also examines how to transition offline tasks and batch computing to these Kubernetes clusters for distributed Argo Workflows.

Concepts Related to Batch Compute

Jobs

A job is a task unit, such as a shell script, a Linux executable, or a Docker container image, that you submit to a batch computing system. The system allocates computing resources within a computing environment to execute the job.

Array Jobs

Array jobs are a collection of similar or identical jobs submitted and executed as a group. Though sharing the same job definition, each job can be identified by an index and may process a different dataset or perform a variation of the task.

Job Definitions

A job definition outlines how to run a job. It is created before executing a job and typically includes the image for running the job, specific commands and parameters, required CPU and memory, environment variables, and disk storage.

Job Queues

Jobs are submitted to a specific job queue in the batch computing system, where they wait until scheduled for execution. Job queues can have assigned priorities and be linked to specific computing environments.

Compute Environments

A compute environment consists of the computing resources allocated for running jobs. It requires specifications such as the virtual machine model, the maximum and minimum vCPU numbers, and the pricing for spot instances.

Summary

Users need to learn the specifications and usage of batch compute job definitions, which might involve the risk of vendor lock-in.
Management of compute environments is also required, including setting up the model and scale of virtual machines, which can be costly in terms of operations and maintenance due to the non-serverless nature of batch computing.
The complexity increases with the need to manage job queues for prioritizing jobs, due to the limitations in the scale of computing environments.

Concepts Related to Kubernetes Clusters for Distributed Argo Workflows

Templates

A template defines a task (or job) and is an integral component of a workflow. Each workflow must include at least one template, which specifies the Kubernetes containers to be executed and their corresponding input and output parameters.

Workflows

A workflow comprises one or multiple tasks (or templates) that can be orchestrated in various ways, such as serializing tasks, running tasks in parallel, or executing specific tasks when conditions are met. Once a workflow is established, the tasks within it are executed as pods within a Kubernetes cluster.

Workflow Templates

Workflow templates are reusable, static workflow definitions akin to functions. They can be referenced and executed across multiple workflows, allowing you to leverage existing templates when defining complex workflows to minimize redundant definitions.

Serverless Kubernetes Clusters

Distributed Argo Workflows on Kubernetes come with a built-in compute environment, eliminating the need for manual creation and management. Workflows submitted are run serverlessly using Alibaba Cloud Elastic Container Instances (ECIs), which means there is no need for Kubernetes node maintenance. Alibaba Cloud's elastic capabilities enable the execution of large-scale workflows with tens of thousands of pods and the use of hundreds of thousands of CPU cores, which are automatically released upon workflow completion. This results in accelerated workflow execution and cost savings.

Summary

Built upon Kubernetes clusters and open-source Argo Workflows, this setup orchestrates and runs workflows in a cloud-native way, avoiding vendor lock-in.
It supports intricate task orchestration to cater to complex scenarios in data processing, simulation, and scientific computing.
The computing environment utilizes Alibaba Cloud ECIs, obviating the need for node maintenance.
It enables the deployment of large-scale computing resources on-demand with a pay-as-you-go billing model, reducing workflow queues, enhancing efficiency, and cutting computational costs.

Feature Mapping Between Batch Compute and Argo Workflows

Table

Examples of Argo Workflows

Simple Workflows

cat > helloworld.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: main         # invoke the main template
  templates:
    - name: main              # name of the template
      container:
        image: registry.cn-hangzhou.aliyuncs.com/acs/alpine:3.18-update
        command: [ "sh", "-c" ]
        args: [ "echo helloworld" ]
EOF
argo submit helloworld.yaml

The following workflow creates a pod that uses the alpine image and runs the Shell command echo helloworld. You can modify this workflow to run the specified Shell commands or run the commands in a custom image in Argo.

Loop Workflows

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: loops-
spec:
  entrypoint: loop-example
  templates:
  - name: loop-example
    steps:
    - - name: print-pet
        template: print-pet
        arguments:
          parameters:
          - name: job-index
            value: "{{item}}"
        withSequence:  # loop to run print-pet template with parameter job-index 1 ~ 5 respectively.
          start: "1"
          end: "5"
  - name: print-pet
    inputs:
      parameters:
      - name: job-index
    container:
      image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/print-pet
      command: [/tmp/print-pet.sh]
      args: ["{{inputs.parameters.job-index}}"] # input parameter job-index as args of container

In the following loop, a text file named pets.input and a script named print-pet.sh are packaged in the image named print-pet. The input parameter of the print-pet.sh script is job-index. The loop prints the pet in the job-index row of the pets.input file. For more information, visit the GitHub repository[3].

The loop creates five pods at a time and passes an input parameter (from job-index 1 to job-index 5) to each pod. Each pod prints the pet in the job-index row. Loops can be used to quickly process large amounts of data in sharding and parallel computing scenarios. For more sample loops, see Argo Workflows - Loops[4].

DAG Workflows (MapReduce)

Multiple jobs are often required to collaborate in batch compute scenarios. In this case, DAG is the best way to specify the dependencies of each job. However, in a mainstream batch computing system, the ID of a job is returned only after the job is submitted. To solve this problem, you need to write a script to specify the dependencies of each job, as shown in the following sample code. When the number of jobs grows, the dependencies in the script become complex and the cost of maintaining the script also increases.

// The dependencies of each job in a batch computing system. Job B depends on Job A and is started only after Job A is complete. 
batch submit JobA | get job-id
batch submit JobB --dependency job-id (JobA)

Argo Workflows allow you to create a DAG to specify the dependencies of each task, as shown in the following figure:

# The following workflow executes a diamond workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: echo
        arguments:
          parameters: [{name: message, value: D}]
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]

In the Git repository[5], we also provide a sample MapReduce workflow which can be used to create shards and aggregate computing results.

How to Migrate Batch Computing Systems to Argo Workflows

1. Assessment and Planning

Assess the existing batch jobs, including dependencies, resource requests, and parameters. Learn the features and best practices of Argo Workflows, and choose proper Argo Workflows features to replace those used in the batch computing system. You can skip the steps for designing compute environments and configuring job priorities because Kubernetes clusters for distributed Argo Workflows use serverless ECIs.

2. Create Kubernetes Clusters for Distributed Argo Workflows

For more information, see Workflow cluster quickstart[6]

3. Convert Job Definitions

Convert batch computing jobs to Argo Workflows based on the feature mappings between batch computing and Argo workflows. You can also call the Argo Workflows SDK[7] to automate workflow creation and integration.

4. Data Storage

Ensure that the Kubernetes clusters for distributed Argo Workflows can access the data required for workflow running. The clusters can mount and access Alibaba Cloud OSS, NAS, CPFS, cloud disks, and other storage resources. For more information, see Use volumes[8].

5. Test and Verification

Verify that the workflows, data access, result output, and resource usage are normal and meet expectations.

6. O&M: Monitoring and Logging

Enable the observability capability of Kubernetes clusters for distributed Argo Workflows[9] to view the workflow status and logs.

Summary

Argo Workflows can provide similar functionalities to mainstream batch computing systems, while surpassing them in terms of user experience, core features, compute environment management, and ecosystem integration. It also excels in complex workflow orchestration.
Distributed Argo Workflows are built on Kubernetes clusters. Workflow definitions adhere to Kubernetes YAML specifications, and subtask definitions follow Kubernetes container specifications. If you already utilize Kubernetes for running online applications, you can easily start orchestrating workflow clusters by leveraging Kubernetes as the technical foundation for both online and offline applications.
Argo Workflows runs compute tasks on Alibaba Cloud's Elastic Container Instances (ECIs), eliminating the need to maintain nodes. It allows you to deploy extensive compute resources based on your business requirements and pay for resources in a pay-as-you-go model. This frees workflows from queuing, enhancing efficiency and reducing costs.
By incorporating Alibaba Cloud's spot instances, you can significantly reduce computing expenses.
Distributed workflows are suitable for various business scenarios such as CICD, data processing, simulation, and scientific computing.

References

[1] Kubernetes clusters for distributed Argo workflows
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/overview-12
[2] Argo Workflows
https://argoproj.github.io/argo-workflows/
[3] GitHub repository
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/loops
[4] Argo Workflows - Loops
https://argo-workflows.readthedocs.io/en/latest/walk-through/loops/
[5] Git repository
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/map-reduce
[6] Getting started with workflow clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/workflow-cluster-quickstart
[7] SDK
https://argoproj.github.io/argo-workflows/client-libraries/
[8] Use volumes
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/use-volumes
[9] Observability
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/observability/

Community

Cloud-native Offline Workflow Orchestration: Kubernetes Clusters for Distributed Argo Workflows

Concepts Related to Batch Compute

Jobs

Array Jobs

Job Definitions

Job Queues

Compute Environments

Summary

Concepts Related to Kubernetes Clusters for Distributed Argo Workflows

Templates

Workflows

Workflow Templates

Serverless Kubernetes Clusters

Summary

Feature Mapping Between Batch Compute and Argo Workflows

Examples of Argo Workflows

Simple Workflows

Loop Workflows

DAG Workflows (MapReduce)

How to Migrate Batch Computing Systems to Argo Workflows

1. Assessment and Planning

2. Create Kubernetes Clusters for Distributed Argo Workflows

3. Convert Job Definitions

4. Data Storage

5. Test and Verification

6. O&M: Monitoring and Logging

Summary

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Container Service for Kubernetes

ACK One

Batch Compute

Cloud-Native Applications Management Solution