gtx-fpga
Overview
Developed by GTX-Laboratory, GTX-FPGA is a tool that uses CPUs and field-programmable gate arrays (FPGAs) to accelerate whole genome sequencing in a heterogeneous manner and leverages their characteristics to ensure the high-performance computing of genetic data. GTX-FPGA helps shorten the time to analyze 30X whole genome sequencing data from 30 hours to only 30 minutes and 100X whole exome sequencing data from 6 hours to only 5 minutes.
GTX-FPGA analysis focuses on index building (index), genome alignment (align), variant calling (vc), and whole genome sequencing (wgs) that integrates genome alignment and variant calling. The GTX one process mentioned in the following section is also the whole genome sequencing process.
This topic describes how to use GTX-FPGA in Alibaba Cloud Batch Compute to run analysis jobs of whole genome sequencing data and whole exome sequencing data with a few clicks.
Constraints
GTX-FPGA supports only instances of the f3 instance family in Alibaba Cloud Elastic Compute Service (ECS). Each instance must be equipped with an SSD. The SSD capacity is determined by the FASTA file size. The SSD capacity required by genome alignment (align) is the sum of two FASTQ file sizes multiplied by 2. For example, if the size of File FASTQ1 is 40 GB, and the size of File FASTQ2 is 42 GB, the required SSD capacity is 164 GB. The SSD capacity required by whole genome sequencing (wgs) is the sum of the original data size and the calculation result. For example, if the original data size is 100 GB for 30X whole genome sequencing data and the calculation result is 150 GB, then the required SSD capacity for whole genome sequencing (wgs) is 250 GB. If you want to calculate the data disk size for human genomes, you can use the default values shown in the following demo example.
GTX-FPGA supports only testing in the China (Beijing) region.
GTX-FPGA is in public preview. During the public preview, GTX-FPGA is free of charge. You are charged only for the instances that are required for jobs and resource storage.
Prerequisites
You are logged on to the Alibaba Cloud Management Console, and the account balance is sufficient.
The Batch Compute service is activated to analyze data.
The Object Storage Service (OSS) service is activated to upload your sequencing data and save the analysis results. A bucket is created. For example, you created a bucket named gtx-wgs-demo.
The AccessKey pair of your Alibaba Cloud account is created and can be viewed. If you use a RAM user, make sure that the RAM user has the permissions on Batch Compute and OSS. For more information, see Quick start . The AccessKey ID and AccessKey secret can be copied for subsequent use. In this example, the AccessKey ID LTAI8xxxxx and the AccessKey secret vVGZVE8qUNjxxxxxxxx are used.
Procedure
GTX-FPGA supports the running of jobs in the workflow description language (WDL) mode and directed acyclic graph (DAG) mode. The following table describes the required parameters.
1 GTX command format
command | parameter | Parameter description |
index | -f | Forcibly overwrite an existing index file |
-h | Print the help documentation | |
-m | Specify that the path to intermediate temporary files defaults to /ssd-cache | |
--disable-gtx-index | Disable index for gtx | |
--disable-bwa-index | Disable index for bwa | |
--enable-bwa2-index | Enable index for mem2 | |
align | -o | Output bam file |
-R | The header message for read group defaults to "'@RG\\tID:foo\\tSM:bar'\n" | |
-A | Match score, default to 1 | |
-B | Mismatch penalty, default to 4 | |
-E | The gap extension penalty score, which defaults to 1 | |
-t | The number of threads, the default is 32 (best performance in all-in-one) | |
--bwa | The accuracy of the comparison results with this parameter is comparable to that of BWA-mem | |
--disable-mark-duplicate | Disable mark duplicate | |
wgs | -o | Output vcf file |
-b | Output bam file | |
-R | The header information of the read group, the default is "'@RG\\tID:foo\\tSM:bar'\n" | |
-A | Match score, default to 1 | |
-B | Mismatch penalty, default to 4 | |
-E | The gap extension penalty score, which defaults to 1 | |
-t | The number of threads, the default 32 (best performance in all-in-one) | |
-L | Specify one chromosome (eg.chr1:1-200) or multiple chromosomes (bed file) for calculation | |
-g | Outputs a gvcf format file | |
--bwa | The accuracy of the comparison results with this parameter is comparable to that of BWA-mem | |
--disable-mark-duplicate | Disable mark duplicate | |
--metrics | Outputs the metrics in the deduplication process | |
vc | -o | Output vcf file |
-r | fasta file | |
-i | Enter the bam file after sorting and deduplication | |
-t | The number of open threads, the default is 32 (best performance in all-in-one) | |
-L | Specify one chromosome (eg.chr1:1-200) or multiple chromosomes (bed file) for calculation | |
-g | Outputs a gvcf format file | |
--gtz-rbin1 | This parameter represents that when the input file fastq1 is a gtz file, the rbin used is used to decompress the calculation | |
--gtz-rbin2 | This parameter represents that when the input file fast2 is a gtz file, the rbin that needs to be used to extract the rbin file says please refer to the official documentation of gtz https://github.com/Genetalks/gtz |
2 WDL mode
For more information about the WDL mode, see related documents.
3 DAG mode
3.1 Sample scripts
Download the sample code of a DAG job.
When you use the sample code, take note of the following items:
genGtxIndexCmd is the command to build an index. For more information about how to run the command, see the help information in the code. genGtxWgsCmd is the command of GTX one. For more information about how to run the command, see the help information in the code. genGtxAlignCmd is the command to align genomes. For more information about how to run the command, see the help information in the code. genGtxVcCmd is the command to detect mutations. For more information about how to run the command, see the help information in the code.
You can configure custom values for each GTX parameter in the preceding steps or follow the default values.
The operations to build an index are not necessary in this topic. In this demo, an index is built by default. If you want to build an index, you must add the description of the isNeedIndex parameter when you execute scripts.
You can pass the value of the read_group_header parameter by using CLIs, or you can use the default value.
By default, the sample code runs the GTX one (alignment and variant calling) process. If you want to separately perform operations by step, you must configure the related parameters.
You can run the pip install -upgrade batchcompute command to update the Batch Compute SDK for Python to the latest version.
3.2 Commands
python test.py --reference oss://xxx/ref/hg19.fa --fastq1 oss://xxx/input/human30x_10m_1.fastq --fastq2 oss://xxxx/_input/human30x_10m_2.fastq --output oss://xxx/testoutput/
3.3 Results