By Su Yashi and Caishu
Argo Workflows is an open-source, cloud-native workflow engine and a CNCF graduated project. It simplifies the automation and management of complex workflows on Kubernetes, making it suitable for various scenarios, including scheduled tasks, machine learning, ETL and data analysis, model training, data flow pipelines, and CI/CD.
When we use Argo Workflows to orchestrate tasks, especially in scenarios that involve large amounts of data, such as model training, data processing, and bioinformatics analysis, efficient management of Artifacts (usually stored in OSS in the Alibaba Cloud environment) is critical. However, users who adopt the open-source solution may encounter several challenges, including:
Upload failure of oversized files: If the size of a file exceeds 5Gi, the upload will fail due to the upload limit on the client.
Lack of a file cleanup mechanism: If the temporary files generated during the workflow or the output results of completed tasks are not cleaned up in time, it will lead to unnecessary consumption of OSS storage space.
High disk usage of Argo Server: When using Argo Server to download files, we need to persist data before transferring them. This results in a high disk usage which not only affects server performance but may also cause service interruption or data loss.
As a fully managed Argo Workflows service that completely adheres to community standards, ACK One Serverless Argo Workflows is dedicated to responding to the challenges of large-scale and high-security file management tasks. This article introduces a series of enhancements to the service in this regard, including the multipart upload of oversized files, Artifacts automatic garbage collection (GC), and Artifacts streaming. These features are designed to help users manage OSS files in an efficient, secure, and fine-grained manner in the Alibaba Cloud environment.
For the purposes of data persistence and sharing, alleviating temporary storage pressure on Pod, and disaster recovery and backup, when we orchestrate tasks with Argo Workflows, it is necessary to upload data such as intermediate outputs, execution results, and process logs to OSS by using Artifacts. In scenarios such as model training, data processing, bioinformatics analysis, and audio and video processing, we often need to upload a large number of large files.
Open source solutions do not support the upload of oversized objects, which brings significant inconvenience and poor experience to users. However, ACK One Serverless Argo Workflows optimizes the logic of uploading oversized objects to OSS and supports multipart and resumable uploads. This is essential to improve the efficiency and reliability of large file processing, especially in environments with data-intensive tasks and distributed computing. It not only optimizes the resource use but also improves the ability to handle large data sets. In addition, each multipart supports independent integrity verification, which better guarantees data integrity, and enhances the fault tolerance of the system and data security.
Sample code:
This feature is enabled by default in ACK One Serverless Argo Workflows. After configuring Artifacts, we can submit a sample workflow to obtain a 20Gi file named testfile.txt from OSS. This indicates that an oversized object has been uploaded.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-
spec:
entrypoint: main
templates:
- name: main
metadata:
annotations:
k8s.aliyun.com/eci-extra-ephemeral-storage: "20Gi" # Specify the capacity that you want to scale up for the temporary storage space.
k8s.aliyun.com/eci-use-specs : "ecs.g7.xlarge"
container:
image: alpine:latest
command:
- sh
- -c
args:
- |
mkdir -p /out
dd if=/dev/random of=/out/testfile.txt bs=20M count=1024 # Generate a 20Gi file
echo "created files!"
outputs: # Trigger the upload of a file to OSS.
artifacts:
- name: out
path: /out/testfile.txt
The Artifact garbage collection (GC) mechanism of Argo Workflows is mainly used to delete no-longer-needed files generated by the workflow (such as intermediate results and logs) after the workflow ends, which can save storage space and costs and prevent unlimited consumption of storage resources.
In open-source scenarios, the unavailability of automatic file reclaims from OSS increases the costs of use and O/M. Therefore, ACK One Serverless Argo Workflows optimizes the file cleanup method on OSS. By simple configuration of reclaim logic, the following can be implemented:
When the workflow is completed or the administrator manually clears the workflow-related resources in the cluster, the files uploaded to OSS will be automatically reclaimed after a certain period of time.
Reclaims are configured only for successful workflow tasks to prevent clearing logs that contain failed operations and facilitate tracing. Alternatively, reclaims are configured only for failed workflow tasks to reclaim invalid intermediate output.
We can set rules to automatically delete old Artifacts based on parameters such as the time and prefix by using the lifecycle management policy provided by OSS, or archive early Artifacts to cold storage to reduce costs while ensuring data integrity.
Sample code:
Configure the artifactGC policy to use this feature. As shown in the following example, the artifactGC policy of the workflow is Recycle After Deletion, and the recycling policy of the on-completion file is Recycle Upon Completion. That is, after the workflow is submitted, it can be observed on OSS that the on-completion.txt is recycled when the workflow is completed, and the on-deletion.txt file is recycled after the workflow is deleted.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-gc-
spec:
entrypoint: main
artifactGC:
strategy: OnWorkflowDeletion # The global reclaim policy to recycle the Artifact when the workflow is deleted, which can be overwritten
templates:
- name: main
container:
image: argoproj/argosay:v2
command:
- sh
- -c
args:
- |
echo "hello world" > /tmp/on-completion.txt
echo "hello world" > /tmp/on-deletion.txt
outputs: # Upload an object to OSS
artifacts:
- name: on-completion
path: /tmp/on-completion.txt
artifactGC:
strategy: OnWorkflowCompletion # Overwrite the global reclaim policy and recycle the Artifact when the workflow is completed
- name: on-deletion
path: /tmp/on-deletion.txt
When an open-source solution uses Argo Server to download files, it needs to persist data before transferring them. This results in a high disk usage which not only affects server performance but may also lead to service interruption or data loss.
ACK One Serverless Argo Workflows implements the OpenStream interface of OSS. When a user clicks to download a file on the Argo Workflows UI interface, the Argo Server directly streams the file from the OSS server to the user, instead of downloading the file to the server and then providing it for the user. This streaming mechanism is especially suitable for large-scale data transmission and storage workflow tasks:
Improve download performance: Streaming transfers a file from the OSS server without waiting for the entire file to be downloaded to the Argo Server first. This means that the download starts with a smaller delay so that it can provide a faster response and a smoother experience.
Reduce resource usage to improve concurrency: Streaming processing reduces the memory and disk requirements for the Argo Server, which enables it to handle more parallel file transfer requests with the same hardware resources and improves the system's concurrent processing capabilities. With the increases in users or file sizes, direct streaming allows better scaling of services to handle the growth without worrying about disk space limitations on the Argo Server.
Improve security compliance: Streaming avoids the temporary storage of data in Argo Server space, reduces security risks and data leakage risks, and helps comply with data protection and compliance requirements.
Streaming maximizes the performance of UI file downloads while minimizing the pressure on Argo Server single points. By Artifact streaming, Argo Server becomes a lightweight data flow center rather than a heavy load center for storage and computing.
As a fully managed Argo Workflows service, ACK One Serverless Argo Workflows has the following advantages over the open-source solution to Artifacts file management:
OSS file management capability | Open-source Argo Workflows | ACK One Serverless Argo Workflows |
File upload | Only files less than 5Gi are supported, and oversized files are not supported | Support multipart upload of oversized files |
File reclaim | Not supported | Support Artifacts GC |
File download | Data persistence is required | Support streaming |
In the future, Serverless Argo will feed these enhancements back into the community and grow with the community. It will continue to integrate the practices and experiences from the community, and further improve the stability and usability to provide users with a high-performance and scalable workflow platform.
Alibaba Cloud ACK One: Quickly Build A Zone-disaster Recovery System with Multi-cluster Gateways
Cloud Native Game Solution | Cost Efficiency and DevOps Boost Optimization in the Gaming Industry
177 posts | 31 followers
FollowAlibaba Container Service - April 12, 2024
Alibaba Container Service - November 21, 2024
Alibaba Developer - September 7, 2020
Alibaba Container Service - October 15, 2024
Alibaba Cloud Native Community - March 11, 2024
Alibaba Container Service - July 24, 2024
177 posts | 31 followers
FollowProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAn encrypted and secure cloud storage service which stores, processes and accesses massive amounts of data from anywhere in the world
Learn MoreProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreMore Posts by Alibaba Container Service