Parabricks

Parabricks is a GPU accelerated software suite for performing secondary analysis of next generation sequencing (NGS) DNA data. A major benefit of Parabricks is that it is designed to deliver results at blazing fast speeds and low cost. Parabricks can analyze whole human genomes in about 45 minutes, compared to about 30 hours for 30x WGS data. The best part is the output results exactly match the commonly used software. So, it's fairly simple to verify the accuracy of the output.

Example Job

An example input for a Parabricks run is available at /share/apps/singularity/images/parabricks/parabrickssample.tar.gz (also avaialbe for download: wget https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz). To uncompress the archine in your home directory run: <code> pinnacle-l1:pwolinsk:~$ tar -xzvf /share/apps/singularity/images/parabricks/parabrickssample.tar.gz parabrickssample/ parabrickssample/Data/ parabrickssample/Data/sample2.fq.gz parabrickssample/Data/sample1.fq.gz parabrickssample/Ref/ parabrickssample/Ref/Homosapiensassembly38.fasta parabrickssample/Ref/Homosapiensassembly38.fasta.pac parabrickssample/Ref/Homosapiensassembly38.fasta.ann parabrickssample/Ref/Homosapiensassembly38.knownindels.vcf.gz.tbi parabrickssample/Ref/Homosapiensassembly38.fasta.amb parabrickssample/Ref/Homosapiensassembly38.dict parabrickssample/Ref/Homosapiensassembly38.fasta.fai parabrickssample/Ref/Homosapiensassembly38.knownindels.vcf.gz parabrickssample/Ref/Homosapiensassembly38.fasta.bwt parabrickssample/Ref/Homosapiensassembly38.fasta.sa c1612:pwolinsk:~$ </code> Create a slurm script to submit the job to the gpu queue: <code> pinnacle-l9:pwolinsk:~$ cat parabricks.slurm #!/bin/bash #SBATCH -p gpu06 #SBATCH -N1 #SBATCH -n32 #SBATCH -t 1:00:00 module load singularity parabricks pbrun fq2bam –ref parabrickssample/Ref/Homosapiensassembly38.fasta –in-fq parabrickssample/Data/sample1.fq.gz parabrickssample/Data/sample_2.fq.gz –out-bam output.bam </code> The script above requests 1 node with all 32 cores in the gpu06 partition for 1 hour.
Submit the job: <code> pinnacle-l9:pwolinsk:~$ sbatch parabricks.slurm Submitted batch job 56682 pinnacle-l9:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 56682 gpu06 parabric pwolinsk R 0:05 1 c1612 pinnacle-l9:pwolinsk:~$ </code> While the job is running you can verify that it is using the GPU by running the
nvidia-smi command remotely over ssh on the compute node running your job. In this case it is c1612 as listed in the 'NODELIST' column above: <code> pinnacle-l9:pwolinsk:~$ ssh c1612 “nvidia-smi” Mon Mar 23 11:28:08 2020
+—————————————————————————–+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |——————————-+———————-+———————-+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE… Off | 00000000:3B:00.0 Off | 0 | | N/A 35C P0 39W / 250W | 11941MiB / 32510MiB | 34% Default | +——————————-+———————-+———————-+
+—————————————————————————–+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 7209 C PARABRICKS 11921MiB | +—————————————————————————–+ </code> And follow progress using
tail -f'' on the job standard output file: <code> pinnacle-l9:pwolinsk:~$ tail -f slurm-56682.out || Parabricks accelerated Genomics Pipeline || || Version v2.5.0 || || GPU-BWA mem, Sorting, Marking Duplicates, BQSR || || Contact: Parabricks-Support@nvidia.com || —————————————————————————— [M::bwa
idxloadfrom_disk] read 0 ALT contigs GPU-BWA mem ProgressMeter Reads Base Pairs Aligned [16:23:11] 5043564 580000000 [16:23:38] 10087128 1160000000 [16:24:04] 15130692 1740000000 [16:24:32] 20174256 2320000000 [16:24:59] 25217820 2900000000 [16:25:25] 30261384 3480000000 [16:25:53] 35304948 4060000000 [16:26:20] 40348512 4640000000 [16:26:47] 45392076 5220000000 [16:27:14] 50435640 5800000000 GPU-BWA Mem time: 297.342929 seconds GPU-BWA Mem is finished. GPU Sorting, Marking Dups, BQSR ProgressMeter SAM Entries Completed [16:27:46] 5000000 [16:27:52] 10000000 [16:27:58] 15000000 [16:28:05] 20000000 [16:28:12] 25000000 [16:28:19] 30000000 [16:28:26] 35000000 [16:28:32] 40000000 [16:28:38] 45000000 [16:28:43] 50000000 Total GPU-BWA Mem + Sorting + MarkingDups + BQSR Generation + BAM writing Processing time: 387.945936 seconds [main] CMD: PARABRICKS mem -Z ./pbOpts.txt /scrfs/storage/pwolinsk/home/parabrickssample/Ref/Homosapiensassembly38.fasta /scrfs/storage/pwolinsk/home/parabrickssample/Data/sample1.fq.gz /scrfs/storage/pwolinsk/home/parabrickssample/Data/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1 [main] Real time: 392.828 sec; CPU: 3587.779 sec —————————————————————————— || Program: GPU-BWA mem, Sorting, Marking Duplicates, BQSR || || Version: v2.5.0 || || Start Time: Mon Mar 23 16:22:34 2020 || || End Time: Mon Mar 23 16:29:07 2020 || || Total Time: 6 minutes 33 seconds || —————————————————————————— ^C </code> Then check for new output files: <code> pinnacle-l9:pwolinsk:~$ ls -ltr … -rw-rw-r–. 1 pwolinsk pwolinsk 288 Mar 23 11:22 parabricks.slurm -rw-r–r–. 1 pwolinsk pwolinsk 6882792 Mar 23 11:28 output.bam.bai -rw-r–r–. 1 pwolinsk pwolinsk 4728882999 Mar 23 11:28 output.bam -rw-rw-r–. 1 pwolinsk pwolinsk 2659 Mar 23 11:29 slurm-56682.out -rw-r–r–. 1 pwolinsk pwolinsk 87690 Mar 23 11:29 output_chrs.txt pinnacle-l9:pwolinsk:~$ </code>