Use Argo Workflows to Orchestrate Genetic Computing Workflows

By Shuangkun Tian

In the highly complex and data-intensive field of genetic computing, researchers and bioinformatics analysts face great challenges, not only in the explosive growth of data but also in how to integrate and analyze these data efficiently and accurately to reveal the mysteries of life. To respond to these challenges, automated workflow orchestration has become a key technology. As a containerized, flexible, and easy-to-use workflow engine, Argo Workflows stands out as a powerful assistant to connect genetic computing steps. This article describes how to use Argo Workflows to orchestrate genetic computing workflows.

Genetic Computing Workflows

In genomics research, a series of interdependent computing tasks and data processing steps are organized for a specific analysis goal and form a process that can be orderly executed. This is referred to as genetic computing workflows. These workflows typically include multiple complex steps such as data pre-processing, sequence alignment, mutation detection, gene expression analysis, and phylogenetic tree building.

Benefits of Using Argo Workflows for Genetic Computing Workflows

Argo Workflows is an open source Kubernetes-native workflow engine that is designed for containerized environments to orchestrate complex workflows flexibly and efficiently. In the genetic computing scenario, the advantages of Argo Workflows are particularly significant:

• Containerization and Environment Consistency: Gene analysis involves many software tools and dependency libraries. By encapsulating each analysis step into a Docker container, cross-platform consistency and reproducibility are ensured and the requirement for a specified machine is relaxed.

• Flexible Orchestration Capabilities: Workflows in genomics research often require multiple steps, condition branches, and parallel processing. Argo Workflows supports complex logic and condition control, making it simple to customize workflows on demand.

Although the open-source Argo Workflows has demonstrated significant capabilities to orchestrate genetic computing workflows, several issues still need to be addressed in practical applications:

• Large-Scale O&M: Due to large-scale tasks and the fact that users with a research background may lack in-depth cluster O&M experience, it will be difficult to implement efficient cluster optimization and maintenance strategies.

• Complex Workflow Orchestration: The characteristics of experiments for scientific research determine the need for various parameters and process steps, so the workflow often involves thousands of jobs. However, open-source workflow engines cannot support such orchestration.

• Resource Optimization and Automatic Scaling Issues: Genetic data analysis often consumes a large amount of computing resources. From the perspective of users, it is expected to intelligently schedule resources based on workloads to achieve efficient utilization of resources. At the same time, the on-demand automatic scaling of computing capabilities is also expected. Open-source solutions may struggle to meet these requirements.

To meet the challenges of large-scale O&M, complex workflow orchestration, resource optimization, and automatic scaling in genetic computing scenarios, the Alibaba Cloud ACK One team launched the Kubernetes cluster for distributed Argo workflows.

Kubernetes Cluster for Distributed Argo Workflows

Kubernetes clusters for distributed Argo workflows (workflow clusters) are implemented based on the open-source Argo Workflows and deployed on top of a serverless architecture. This type of cluster runs Argo workflows on elastic container instances (ECIs) and optimizes cluster parameters to schedule large-scale workflows with efficiency, elasticity, and cost-effectiveness. It supports multiple execution strategies such as concurrency, looping, and retry typically for genetic computing processes, and enables the workflow orchestration of highly complex tasks.

Use Argo Workflows to Orchestrate Genetic Computing Workflows

A classic BWA sequencing and alignment workflow is used to demonstrate how to use Argo Workflows to edit and run genetic computing workflows:

1. Create a Kubernetes cluster for distributed Argo workflows.

2. Mount Alibaba Cloud OSS volumes so that workflows can use OSS files in the same way as use local files.

For more information, see Use volumes.

3. Use the following YAML to create a workflow. For more information, see Create a workflow.

The process mainly consists of three stages:

1) bwaprepare: In the data preparation phase, download and decompress the FASTQ and reference files to index the reference genome.

2) bwamap: Compare the sequencing data with the reference genome to generate alignment results, and process multiple files in parallel.

3) bwaindex: Compare raw sequencing data with the reference genome, generate, sort, and index BAM files, and then view the comparison results.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: bwa-oss-
spec:
  entrypoint: bwa-oss
  arguments: 
    parameters:
    - name: fastqFolder # The path where the downloaded file is saved
      value: /gene
    - name: reference # Reference genome file
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
    - name: fastq1 # Raw sequencing data
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
    - name: fastq2
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz
  volumes: # Mount the remote storage path
  - name: ossdir
    persistentVolumeClaim:
      claimName: pvc-oss
  templates:
  - name: bwaprepare # In the data preparation phase, download and decompress the FASTQ and reference files to index the reference genome.
    container:
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      command: [sh,-c]
      args:
      - mkdir -p /bwa{{workflow.parameters.fastqFolder}}; cd /bwa{{workflow.parameters.fastqFolder}}; rm -rf SRR1976948*;
        wget {{workflow.parameters.reference}};
        wget {{workflow.parameters.fastq1}};
        wget {{workflow.parameters.fastq2}};
        gzip -d subset_assembly.fa.gz;
        gunzip -c SRR1976948_1.fastq.gz | head -800000 > SRR1976948.1;
        gunzip -c SRR1976948_2.fastq.gz | head -800000 > SRR1976948.2;
        bwa index subset_assembly.fa;
      volumeMounts:
      - name: ossdir
        mountPath: /bwa
    retryStrategy: # Retry mechanism
      limit: 3
  - name: bwamap # Compare the sequencing data with the reference genome to generate alignment results.
    inputs:
      parameters:
      - name: object
    container:
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      command:
      - sh
      - -c
      args:
      - cd /bwa{{workflow.parameters.fastqFolder}};
        bwa aln subset_assembly.fa {{inputs.parameters.object}} > {{inputs.parameters.object}}.untrimmed.sai;
      volumeMounts:
      - name: ossdir
        mountPath: /bwa
    retryStrategy:
      limit: 3
  - name: bwaindex # Compare raw sequencing data with the reference genome, generate, sort, and index BAM files, and then view the comparison results.
    container:
      args:
      - cd /bwa{{workflow.parameters.fastqFolder}};
        bwa sampe subset_assembly.fa SRR1976948.1.untrimmed.sai SRR1976948.2.untrimmed.sai SRR1976948.1 SRR1976948.2 > SRR1976948.untrimmed.sam;
        samtools import subset_assembly.fa SRR1976948.untrimmed.sam SRR1976948.untrimmed.sam.bam;
        samtools sort SRR1976948.untrimmed.sam.bam -o SRR1976948.untrimmed.sam.bam.sorted.bam;
        samtools index SRR1976948.untrimmed.sam.bam.sorted.bam;
        samtools tview SRR1976948.untrimmed.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:1000 -d T;
      command:
      - sh
      - -c
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      volumeMounts:
      - mountPath: /bwa/
        name: ossdir
    retryStrategy:
      limit: 3
  - name: bwa-oss # Workflow orchestration for each stage
    dag:
      tasks:
      - name: bwaprepare # Perform data preparation first
        template: bwaprepare
      - name: bwamap # Pre-process and generate alignment results
        template: bwamap
        dependencies: [bwaprepare] # Prepare dependencies
        arguments:
          parameters:
          - name: object
            value: "{{item}}"
        withItems: ["SRR1976948.1","SRR1976948.2"] # Process files in parallel
      - name: bwaindex # Compare and view the comparison results
        template: bwaindex
        dependencies: [bwamap] # Pre-pare dependencies

4. View the workflow status:

The workflow status is successful, and you can find that the comparison result file has been successfully generated in the corresponding folder of OSS.

Summary

Argo Workflows shows significant advantages in the field of genetic computing and other data-intensive scientific research areas for its containerization, flexible orchestration, and ease of use features. It greatly enhances the automation capability, resource utilization, and analysis efficiency in genetic computing.

Community

Use Argo Workflows to Orchestrate Genetic Computing Workflows

Genetic Computing Workflows

Benefits of Using Argo Workflows for Genetic Computing Workflows

Kubernetes Cluster for Distributed Argo Workflows

Use Argo Workflows to Orchestrate Genetic Computing Workflows

Summary

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

ACK One

Container Service for Kubernetes

Cloud-Native Applications Management Solution

Bastionhost