NCBI Blast+ is a shared-memory program that runs on a single node with multiple threads. The Intel processors on the razor cluster will run blast about three times as fast as the AMD processors on trestles (but trestles has twice as many per node). Razor 12-core nodes are sufficient since blast+ scales to about 8 threads as shown by user/real time, but the number of cores actually present is used as the threads variable in each example.
Blast works better with a database located on a local file system, so if doing a number of runs, it may be worth the couple of minutes to copy the database to your area of the local scratch disk, as shown. For a single run it is probably faster overall to blast directly from the parallel filesystem database. If copying the database please remember to remove it at the end of the job.
Unfortunately the 2.4.0+ version has a significant performance regression on AMD, and blast+ overall runs better on Intel. Time-to-solution may still depend on cluster load.
/home/rfeynman$ cd /local_scratch/rfeynman /local_scratch/rfeynman$ rsync -a /share/apps/bioinformatics/blast/db20150912/nt . /local_scratch/rfeynman$ module purge;module load blast/2.3.0+ /local_scratch/rfeynman$ time blastn -num_threads 32 -db nt/nt -query \ /home/rfeynman/NM_001005648 >blast-2.3.0.out real 0m11.146s user 1m33.556s sys 0m10.938s /local_scratch/rfeynman$ module purge;module load blast/2.4.0+ /local_scratch/rfeynman$ time blastn -num_threads 32 -db nt/nt -query \ /home/rfeynman/NM_001005648 >blast-2.4.0.out real 0m18.026s user 1m48.788s sys 0m11.792s /local_scratch/rfeynman$ rm -rf ./nt
Examples are shown on a 16-core node for the last 3 versions of blastn and for blastall used with qiime. In this case 2.2.29 and 2.3.0 give the same output while 2.2.28 and blastall are different.
/home/rfeynman$ cd /local_scratch/rfeynman /local_scratch/rfeynman$ rsync -a /share/apps/blast/db/nt . /local_scratch/rfeynman$ module purge;module load blast/2.2.28+ /local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \ -query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.2.28.out real 0m3.273s user 0m2.204s sys 0m1.530s /local_scratch/rfeynman$ module purge;module load blast/2.2.29+ /local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \ -query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.2.29.out real 0m3.817s user 0m38.202s sys 0m3.077s /local_scratch/rfeynman$ module purge;module load blast/2.3.0+ /local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \ -query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.3.0.out real 0m3.327s user 0m33.666s sys 0m3.284s /local_scratch/rfeynman$ module purge module load gcc/4.6.3 mkl/13.1.0 python/2.7.5 R/3.1.2-mkl qiime/1.9.1 /local_scratch/rfeynman$ time blastall -p blastn -a 16 -d nt/nt \ -i /share/apps/blast/queries/blastn/NM_001005648 -o blastall.out real 0m6.231s user 1m20.847s sys 0m3.195s /local_scratch/rfeynman$ ls -al *out -rw-r--r-- 1 rfeynman rfeynman 119520 Feb 29 13:10 blast-2.2.28.out -rw-r--r-- 1 rfeynman rfeynman 498945 Feb 29 13:11 blast-2.2.29.out -rw-r--r-- 1 rfeynman rfeynman 498538 Feb 29 13:11 blast-2.3.0.out -rw-r--r-- 1 rfeynman rfeynman 1422859 Feb 29 13:14 blastall.out /local_scratch/rfeynman$ diff -ibw blast-2.2.29.out blast-2.3.0.out 1c1 < BLASTN 2.2.29+ --- > BLASTN 2.3.0+ /local_scratch/rfeynman$ rm -rf ./nt
On Intel razor, 2.4.0+ timed comparably or better to 2.3.0+. This is timed with a longer query that times more repeatably:
/local_scratch/rfeynman$ module purge;module load blast/2.3.0+ /local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \ -query /share/apps/blast/queries/blastn/NM_010585 >blast-2.3.0.out real 0m7.097s user 0m25.746s sys 0m2.325s /local_scratch/rfeynman$ module purge;module load blast/2.4.0+ /local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \ -query /share/apps/blast/queries/blastn/NM_010585 >blast-2.4.0.out real 0m6.667s user 0m22.723s sys 0m2.219s
Please recall that the shared parallel scratch disks on both systems have ~5,000 MB/s bandwidth, and local scratch disks have a bandwidth of ~150 MB/s (razor, hard disks) or ~300 MB/s (trestles, flash drives). So a single Blast job may run faster on the shared disk, depending on load. But distributed Blast runs on every trestles node will have about 15 times more aggregate bandwidth (256*300=76,800 MB/s) if using the local disks.