==== blast+ , blastall ====

NCBI Blast+ is a shared-memory program that runs on a single node with multiple threads. The Intel processors on the razor cluster will run blast about three times as fast as the AMD processors on trestles (but trestles has twice as many per node). Razor 12-core nodes are sufficient since blast+ scales to about 8 threads as shown by user/real time, but the number of cores actually present is used as the threads variable in each example.

Blast works better with a database located on a local file system, so if doing a number of runs, it may be worth the couple of minutes to copy the database to your area of the local scratch disk, as shown.  For a single run it is probably faster overall to blast directly from the parallel filesystem database. If copying the database please remember to remove it at the end of the job.  

==trestles==
Unfortunately the 2.4.0+ version has a significant performance regression on AMD, and blast+ overall runs better on Intel.  Time-to-solution may still depend on cluster load.

<code>
/home/rfeynman$ cd /local_scratch/rfeynman
/local_scratch/rfeynman$ rsync -a /share/apps/bioinformatics/blast/db20150912/nt .
/local_scratch/rfeynman$ module purge;module load blast/2.3.0+ 
/local_scratch/rfeynman$ time blastn -num_threads 32 -db nt/nt -query \
/home/rfeynman/NM_001005648 >blast-2.3.0.out

real    0m11.146s
user    1m33.556s
sys     0m10.938s
/local_scratch/rfeynman$ module purge;module load blast/2.4.0+ 
/local_scratch/rfeynman$ time blastn -num_threads 32 -db nt/nt -query \
/home/rfeynman/NM_001005648 >blast-2.4.0.out

real	0m18.026s
user	1m48.788s
sys	0m11.792s

/local_scratch/rfeynman$ rm -rf ./nt
</code>

==razor==
Examples are shown on a 16-core node for the last 3 versions of blastn and for blastall used with qiime. In this case 2.2.29 and 2.3.0 give the same output while 2.2.28 and blastall are different.

<code>
/home/rfeynman$ cd /local_scratch/rfeynman
/local_scratch/rfeynman$ rsync -a /share/apps/blast/db/nt .
/local_scratch/rfeynman$ module purge;module load blast/2.2.28+
/local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \
-query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.2.28.out

real	0m3.273s
user	0m2.204s
sys	0m1.530s
/local_scratch/rfeynman$ module purge;module load blast/2.2.29+
/local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \
-query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.2.29.out

real	0m3.817s
user	0m38.202s
sys	0m3.077s
/local_scratch/rfeynman$ module purge;module load blast/2.3.0+
/local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \
-query /share/apps/blast/queries/blastn/NM_001005648 >blast-2.3.0.out

real	0m3.327s
user	0m33.666s
sys	0m3.284s
/local_scratch/rfeynman$ module purge
module load gcc/4.6.3 mkl/13.1.0 python/2.7.5 R/3.1.2-mkl qiime/1.9.1
/local_scratch/rfeynman$ time blastall -p blastn -a 16 -d nt/nt \
-i /share/apps/blast/queries/blastn/NM_001005648 -o blastall.out

real	0m6.231s
user	1m20.847s
sys	0m3.195s
/local_scratch/rfeynman$ ls -al *out
-rw-r--r-- 1 rfeynman rfeynman  119520 Feb 29 13:10 blast-2.2.28.out
-rw-r--r-- 1 rfeynman rfeynman  498945 Feb 29 13:11 blast-2.2.29.out
-rw-r--r-- 1 rfeynman rfeynman  498538 Feb 29 13:11 blast-2.3.0.out
-rw-r--r-- 1 rfeynman rfeynman 1422859 Feb 29 13:14 blastall.out
/local_scratch/rfeynman$ diff -ibw blast-2.2.29.out blast-2.3.0.out
1c1
< BLASTN 2.2.29+
---
> BLASTN 2.3.0+
/local_scratch/rfeynman$ rm -rf ./nt
</code>

On Intel razor, 2.4.0+ timed comparably or better to 2.3.0+. This is timed with a longer query that times more repeatably:
<code>
/local_scratch/rfeynman$ module purge;module load blast/2.3.0+
/local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \
-query /share/apps/blast/queries/blastn/NM_010585 >blast-2.3.0.out

real	0m7.097s
user	0m25.746s
sys	0m2.325s

/local_scratch/rfeynman$ module purge;module load blast/2.4.0+
/local_scratch/rfeynman$ time blastn -num_threads 16 -db nt/nt \
-query /share/apps/blast/queries/blastn/NM_010585 >blast-2.4.0.out

real	0m6.667s
user	0m22.723s
sys	0m2.219s
</code>

==Disk considerations==
Please recall that the shared parallel scratch disks on both systems have ~5,000 MB/s bandwidth, and local scratch disks have a bandwidth of ~150 MB/s (razor, hard disks) or ~300 MB/s (trestles, flash drives). So a single Blast job may run faster on the shared disk, depending on load. But distributed Blast runs on every trestles node will have about 15 times more aggregate bandwidth (256*300=76,800 MB/s) if using the local disks.