Use AGS to process WGS tasks - - Alibaba Cloud Documentation Center

Alibaba Cloud Genomics Compute Service (AGS) enables rapid processing of whole genome sequencing (WGS) tasks, including gene comparison, sequencing, deduplication, and variant detection. This topic describes how to manage WGS workflows by using AGS CLI.

Prerequisites

You have applied for the public preview of AGS and are added to the whitelist.

Note

If your account is a Resource Access Management (RAM) user, you must provide the UID of the RAM user when you apply for the public preview. You can log on to the Account Management console to obtain the UID of the RAM user.

Preparations

Grant permissions and prepare data.

Set up AGS.
For more information about how to download and install AGS CLI, see Introduction to AGS CLI.
```
ags config init
```
Prepare an Object Storage Bucket (OSS) bucket and grant AGS the read and write permissions on the bucket and the permissions to call GetBucketInfo.
Note
- Make sure that the OSS bucket that you want to use is owned by the current account. Otherwise, we recommend that you create a new OSS bucket within the account and grant the permissions to call GetBucketInfo.
- If your account is a RAM user, you must attach the AliyunOSSFullAccess policy to the RAM user. For more information, see Grant permissions to a RAM user.
- If your account is a RAM user, we recommend that you create a new OSS bucket within the account to ensure that the OSS bucket is owned by the RAM user. You can run the following command to check the owner of the OSS bucket:
```
ossutil stat oss://<your new bucket name>
```
```
Usage:
ags config oss <your bucket name>

e.g.
ags config oss my-test-shenzhen
```
Upload FASTQ data to the OSS bucket by using ossutil.
For more information about how to download and install ossutil, see Install ossutil.
Note
AGS supports only WGS and whole exome sequencing (WES) of human genome data. The comparisons of methylated genome data, plant genome data, and animal genome data are not supported.
- Upload data in simple mode.
```
Usage:

ossutil cp -r <local dir of fastq> <path of  oss bucket >

e. g.
ossutil cp -r ./MGISEQ oss://my-test-shenzhen/MGISEQ
```
- Upload a large amount of data.
  Refer to the following example to organize samples in pairs and store them in different directories.
```
./samples
├── sample1
│   ├── fastq_L1.tgz
│   └── fastq_L2.tgz
└── sample2    
       ├── fastq_L1.tgz    
       └── fastq_L2.tgz
```
  Run the following command to upload data in batches. ossutil commands support resumable upload.
```
Usage:

ossutil--recursivecp-r-u--parallel<numberofconcurrenttasks><localdiroffastqornfspath><pathofossbucket>

e. g.
ossutil --recursive cp -r -u  --parallel 16  ./samples  oss://mybucket/samples
```

Start a WGS workflow

We recommend that you use version hs37d5 of human reference genome hg19. This is also the default genome.

Note

Version hs37d5 of human reference genome hg19 has the following characteristics:

Excludes alternative (ALT) contigs.
Hard masks pseudoautosomal regions (PARs) on the Y chromosome (chrY).
Includes decoy contigs.

AGS is ALT-aware, which enables AGS to identify and process ALT contigs. Genome UCSC hg19 includes ALT contigs but does not have the other two characteristics, which decreases the accuracy of variant detection. For more information, click Which human reference genome to use?.

Usage:

ags remote run wgs \
--region cn-shenzhen # region of oss, e.g. cn-shenzhen, cn-beijing and etc\
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz # filename of fastq pair 2, fastq-path\filename \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  # filename of fastq pair 1\
--bucket my-test-shenzhen # Bucket name\
--output-bam bam/MGISEQ_NA12878_hs37d5.bam, # Output BAM to bucket,  By default empty, non output of BAM \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_5.vcf # Output filename \
--service "g" #SLA: [n:normal|s:silver|g:gold|p:platinum]\
--reference [hg19|hg38|<reference path on OSS>] # hg19: it is hs37d5 version, GRCh37/hg19 include decoy contig, no support for UCSC hg19. hg38: GRCh38/hg38 include decoy
--reference-group "\"@RG\\tID:TEST\\tSM:12878\\tPL:MGISEQ2000\"" # allow to specify reference groups for PL/SM/ID and etc
e.g.
ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz  \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_5.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_5.bam \
--service "s" \
--reference hg19

### Process FASTQ files that include multiple lanes and samples in batches
MGISAMPLE001 is a set of WGS sequencing samples of multiple lanes. You can combine and compute the sequencing results of multiple lanes by specifying the sample directory --fastq1 MGISAMPLE001 or --fastq2 MGISAMPLE001. 
oss://my-test-shenzhen/MGISAMPLE001/L1/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L2/MGISEQ2000_PCR-free_NA12878_1_V100003043_L02_1.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L1/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L2/MGISEQ2000_PCR-free_NA12878_1_V100003043_L02_2.fq.gz

ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISAMPLE001 \
--fastq2 MGISAMPLE001 \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_6.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_6.bam \
--service "g" \
--reference hg19

ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISAMPLE002 \
--fastq2 MGISAMPLE002 \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_7.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_7.bam \
--service "g" \
--reference hg19

The following shows how to use AGS CLI to run a WGS workflow.

Start a Mapping workflow

Use --fastq1 and --fastq2 to specify FASTQ files and use --output to specify the output path of bam.

Usage:

ags remote run mapping \
--region cn-shenzhen # region of oss, e.g. cn-shenzhen, cn-beijing and etc\
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz # filename of fastq pair 2, fastq-path\filename \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  # filename of fastq pair 1\
--bucket my-test-shenzhen # Bucket name\
--output-bam bam/MGISEQ_NA12878_hs37d5.bam # Output filename of BAM \
--service "g" #SLA: [n:normal|s:silver|g:gold|p:platinum]\
--markdup [true|false|default true] #Mark Duplicated, by default true
--reference [hg19|hg38|<reference path on OSS>]

e.g.

ags remote run mapping \
--region cn-shenzhen \
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz  \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  \
--bucket my-test-shenzhen \
--output-bam bam/MGISEQ_NA12878_hs37d5.bam # Output filename of BAM \
--service "g" \
--markdup "true" \
--reference hg19

The following shows how to use AGS CLI to run a Mapping workflow.

List remote workflows

Usage:
ags remote list

e.g.
ags remtoe list
+---------------+-------------------------------+
|   JOB NAME    |          CREATE TIME          |
+---------------+-------------------------------+
| wgs-gpu-ckw96 | 2020-01-07 19:08:32 +0000 UTC |
| wgs-gpu-djzws | 2020-01-07 18:31:22 +0000 UTC |
| wgs-gpu-pd659 | 2020-01-03 20:34:09 +0000 UTC |
+---------------+-------------------------------+

Obtain workflow details

Usage:
ags remote get <workflow id> --show
--show show detail of input parameters of workflow

e.g.
ags remote get wgs-gpu-sjtlw
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
|   JOB NAME    |  JOB NAMESPACE   |  STATUS   |          CREATE TIME          | DURATION |          FINISH TIME          |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
| wgs-gpu-sjtlw | XXXXXXXXXXXXXXXX | Succeeded | 2020-01-07 21:38:05 +0800 CST | 12m25s   | 2020-01-07 21:50:30 +0800 CST |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+

ags remote get wgs-gpu-97xfn --show

+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
|   JOB NAME    |  JOB NAMESPACE   |  STATUS   |          CREATE TIME          | DURATION |          FINISH TIME          |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
| wgs-gpu-sjtlw | XXXXXXXXXXXXXXXX | Succeeded | 2020-01-07 21:38:05 +0800 CST | 12m25s   | 2020-01-07 21:50:30 +0800 CST |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+


+-----------------------+---------------------------------+
|      JOB DETAIL       |                                 |
+-----------------------+---------------------------------+
| wgs_reference_file    | hg19                          |
| wgs_service           | g                               |
| wgs_oss_region        | cn-shenzhen                     |
| wgs_fastq_first_name  | MGISAMPLE001                    |
| wgs_fastq_second_name | MGISAMPLE001                    |
| wgs_bucket_name       | my-test-shenzhen                |
| wgs_vcf_file_name     | vcf/MGISEQ_NA12878_hs37d5_6.vcf |
| wgs_bam_file_name     | bam/MGISEQ_NA12878_hs37d5_6.bam |
+-----------------------+---------------------------------+

Cancel a running workflow

Usage:

ags remote cancel  <workflow id>

e.g.

ags remote cancel wgs-gpu-zls6r
INFO[0000] Successed to cancel wgs-gpu-zls6r

Remove a finished workflow

You can remove successful and failed workflows. However, you cannot remove running workflows.

Usage: 

ags remote remove <workflow id>

e.g.

ags remote remove wgs-gpu-zls6r
INFO[0000] Successed to remove wgs-gpu-zls6r