Argo Workflows is a powerful workflow management tool that is widely used to configure scheduled tasks, machine learning tasks, and extract, transform, load (ETL) tasks. You may encounter challenges when you use YAML files to orchestrate workflows. Hera is an Argo Workflows SDK for Python. Hera is an alternative to YAML and provides an easy method to orchestrate and test complex workflows in Python. In addition, Hera is seamlessly integrated with the Python ecosystem and greatly simplifies workflow design. This topic describes how to use Hera, Argo Workflows SDK for Python, to create large-scale workflows.
Background information
Argo Workflows is an open source workflow engine for automating complex workflow orchestration on Kubernetes. You can use Argo Workflows to create a collection of tasks and configure the execution sequence and dependencies for the tasks. This helps you efficiently create and manage custom automated workflows.
Argo Workflows is widely used in scenarios such as scheduled tasks, machine learning, simulation, scientific computing, extract, transform, load (ETL) tasks, model training, and continuous integration/continuous delivery (CI/CD) pipelines. Argo Workflows uses YAML files to configure workflows for clarity and simplicity. This may pose challenges to users who are new to or unfamiliar with the YAML syntax that requires strict indentation to build a hierarchical code structure. This may lead to a long learning curve and complicated configuration steps for these users.
Hera is an Argo Workflows SDK for Python intended for workflow creation and submission based on Argo Workflows. Hera aims to simplify the procedures of creating and submitting workflows and is suitable for data scientists who are familiar with Python rather than YAML. Hera provides the following advantages:
Simplicity: Hera provides intuitive and easy-to-use code to greatly improve development efficiency.
Support for complex workflows: Hera helps eliminate YAML syntax errors in complex workflow orchestration.
Integration with the Python ecosystem: Each function can be defined in a template. Hera is integrated with Python frameworks.
Observability: Hera supports Python testing frameworks to help improve code quality and maintainability.
Workflow clusters of Distributed Cloud Container Platform for Kubernetes (ACK One) run in a serverless mode. Argo Workflows is a managed component of workflow clusters. The following figure shows the architecture of Argo Workflows in workflow clusters.
Step 1: Create a workflow cluster and obtain an access token
Use one of the following methods to enable Argo Server for the workflow cluster:
Enable Internet access for Argo Server. This method is optional for users who use Express Connect circuits.
Run the following command to generate and obtain an access token of the cluster:
kubectl create token default -n default
Step 2: Get started with Hera
Run the following command to install Hera:
pip install hera-workflows
Orchestrate and submit a workflow.
Scenario 1: Simple DAG Diamond
Argo Workflows uses directed acyclic graphs (DAGs) to define complex dependencies for tasks in a workflow. The Diamond structure is commonly adopted by workflows. In a Diamond workflow, the execution results of multiple parallel tasks are aggregated into the input of a subsequent task. The Diamond structure can efficiently aggregate data flows and execution results. The following sample code provides an example on how to use Hera to orchestrate a Diamond workflow where Task A and Task B run in parallel and the execution results of Task A and Task B are aggregated into the input of Task C.
Create a file named simpleDAG.py and copy the following content to the file:
# Import the required packages. from hera.workflows import DAG, Workflow, script from hera.shared import global_config import urllib3 urllib3.disable_warnings() # Specify the endpoint and token of the workflow cluster. global_config.host = "https://argo.{{clusterid}}.{{region-id}}.alicontainer.com:2746" global_config.token = "abcdefgxxxxxx" # Enter the token you obtained. global_config.verify_ssl = "" # The script decorator is the key to enabling Python-like function orchestration by using Hera. # You can call the function below a Hera context manager such as a Workflow or Steps context. # The function still runs as normal outside Hera contexts, which means that you can write unit tests on the given function. # The following code provides a sample input. @script() def echo(message: str): print(message) # Orchestrate a workflow. The Workflow is the main resource in Argo and a key class of Hera. The Workflow is responsible for storing templates, setting entry points, and running templates. with Workflow( generate_name="dag-diamond-", entrypoint="diamond", ) as w: with DAG(name="diamond"): A = echo(name="A", arguments={"message": "A"}) # Create a template. B = echo(name="B", arguments={"message": "B"}) C = echo(name="C", arguments={"message": "C"}) D = echo(name="D", arguments={"message": "D"}) A >> [B, C] >> D # Define dependencies. In this example, Task A is the dependency of Task B and Task C. Task B and Task C are the dependencies of Task D. # Create the workflow. w.create()
Run the following command to submit the workflow:
python simpleDAG.py
After the workflow starts running, you can go to the Workflow Console (Argo) to view the DAG process and the result.
Scenario 2: MapReduce
In Argo Workflows, the key to processing data in the MapReduce style is to use DAG templates to organize and coordinate multiple tasks in order to simulate the Map and Reduce phases. The following sample code provides a detailed example on how to use Hera to orchestrate a sample MapReduce workflow that is used to count words in text files. Each step is defined in a Python function to integrate with the Python ecosystem.
Create a file named map-reduce.py and copy the following content to the file:
Run the following command to submit the workflow:
python map-reduce.py
After the workflow starts running, you can go to the Workflow Console (Argo) to view the DAG process and the result.
Configuration method comparison
Argo Workflows supports two configuration methods: YAML and Hera Framework. The following table compares the two methods.
Feature | YAML | Hera Framework |
Simplicity | Relatively high | High. This method is low-code. |
Workflow orchestration complexity | High | Low |
Integration with the Python ecosystem | Low | High. This method is integrated with rich Python libraries) |
Testability | Low. This method is prone to syntax errors. | High. This method supports testing frameworks. |
Hera Framework gracefully integrates the Python ecosystem with Argo Workflows to reduce the complexity of workflow orchestration. Compared with YAML, Hera Framework provides a simplified alternative to large-scale workflow orchestration. In addition, Hera Framework allows data engineers to use Python, which is familiar to them. Hera Framework also enables seamless and efficient workflow orchestration and optimization for machine learning scenarios. This allows you to transform creative ideas into actual deployments through iterations, promoting efficient implementation and sustainable development of intelligent applications.
If you have any questions about ACK One, join the DingTalk group 35688562.
References
The Hera documentation:
For more information about Hera, see Hera overview.
For more information about how to use Hera to train large language models (LLMs), see Train LLM with Hera.
Sample YAML deployment configurations:
For more information about how to use YAML files to deploy simple-diamond, see dag-diamond.yaml.
For more information about how to use YAML files to deploy map-reduce, see map-reduce.yaml.