==== MPI types, MPI examples, MPI-threaded hybrid, HPL ====

Here are some MPI examples for different flavors. Each also illustrates (a) setting modules and environment variables in the batch file, and (b) hybrid MPI/MKL threads. Hybrid MPI/OpenMP is run in the same way as MPI/MKL, except the relevant environment variable is ''OMP_NUM_THREADS''.  Each MPI performance is very similar for this small example, but there may be a big difference for larger jobs.

The test program is HPL (High Perfomance Linpack) on (a) two E5-2650 v2/Mellanox 16-core nodes or (b) two E5-2670/QLogic 16-core nodes, older but higher-clocked.  This job is with a relatively small matrix, so performance is well short of maximum.  The best HPL layout we have tested is 4 MPI processes per two-socket node, with 3 or 4 MKL threads per MPI process for 12 or 16-core nodes respectively.  HPL is usually compiled with the ''gcc'' compiler.  Better compilers don't help as the great majority of the execution time is in the BLAS library.

We have found that MPI software has mostly improved over time, so the best choices are recent versions as shown in the examples.  Also newer versions have better support for module selection in the job file as shown in these examples.
 
==Intel MPI ==

If you set the ''impi'' module at the top of your batch file, paths (program ''$PATH'' and shared library ''$LD_LIBRARY_PATH'') will be passed to slave nodes and a multiple-node job will run.  If you need other environment variables in the batch file, they won't be passed to slave nodes and must be set either (a) in .bashrc or (b) set in the ''mpirun'' statement, for instance the number of MKL threads below. Notice the csh-style, no equals sign, in the MKL_NUM_THREADS assignment statement.

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 impi/5.1.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID -genv MKL_NUM_THREADS 4  ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.90              5.628e+02 (E5-2650 v2)
================================================================================
WC32L2C4       34560   180     2     4              47.65              5.776e+02 (E5-2670)
================================================================================
</code>

== MVAPICH2 ==

MVAPICH2 is similar to Intel MPI in that ''$PATH'' and ''$LD_LIBRARY_PATH'' will correctly pass to slave nodes, but other environment variables won't, and have to be set in ''.bashrc'' or deliberately passed by ''mpirun'' or ''mpirun_rsh''.  MVAPICH2 hybrid with OpenMP or MKL threads has a very bad cpu affinity by default (although default affinity is good where each core has an MPI process).  The long MV2 option list below is needed to provide reasonable performance for hybrid.

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 mvapich2/2.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun_rsh -ssh -np 8 -hostfile nodefile.$PBS_JOBID MKL_NUM_THREADS=4 MV2_ENABLE_AFFINITY=1 \
MV2_USE_AFFINITY=1 MV2_USE_SHARED_MEM=1 MV2_CPU_BINDING_LEVEL=numanode \
MV2_CPU_BINDING_POLICY=scatter ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.53              5.671e+02 (E5-2650 v2)
================================================================================
WC32L2C4       34560   180     2     4              48.17              5.714e+02 (E5-2670)
================================================================================
</code>

Repeated without the MV2 affinity variables, MKL threads are unusable and this 49-second job was stopped after 21 minutes.

The MVAPICH2 ''mpirun'' job starter is similar in performance to ''mpirun_rsh'', but the syntax is different:

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 mvapich2/2.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID -genv MKL_NUM_THREADS 4 -genv MV2_ENABLE_AFFINITY 1 \
-genv MV2_USE_AFFINITY 1 -genv MV2_USE_SHARED_MEM 1 -genv MV2_CPU_BINDING_LEVEL numanode \
-genv MV2_CPU_BINDING_POLICY scatter ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.71              5.649e+02 (E5-2650 v2)
================================================================================
</code>

== Open MPI ==

Open MPI in the later versions will correctly pass ''$PATH'' but not ''$LD_LIBRARY_PATH'' to slave compute nodes, so the ''-x LD_LIBRARY_PATH'' below is required, and here the optional environment variable ''MKL_NUM_THREADS'' is also set.  Again syntax is slightly different than other mpiruns, no equals and no value means pass the existing value, and =value means set and pass the value.  This only works reliably in our setup for Open MPI version at least 1.8.8.  Earlier versions need the ''openmpi'' module set in ''.bashrc''.  Recent Open MPI versions are much faster, so programs using early versions should be recompiled anyway.  Small version changes like openmpi/1.8.6 to openmpi/1.8.8 will usually work without recompiling. For hybrid MPI/MKL threads (or any case where the number of MPI processes is not equal to the number of nodes*cores per node), the option ''--bynode'' is needed for the reduced (sort -u below) nodefile to properly allocate MPI processes round-robin instead of all on the first node.

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID --bynode -x LD_LIBRARY_PATH -x MKL_NUM_THREADS=4  ./xhpl 
>logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.71              5.650e+02 (E5-2650 v2)
================================================================================
WC32L2C4       34560   180     2     4              48.08              5.723e+02 (E5-2670)
================================================================================
</code>

If MPI processes=nodes\*cores per node, it's not necessary to process the nodefile.  Here MPI processes=nodes\*cores per nodes is set automatically (it's the number of lines in $PBS_NODEFILE):

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
cd $PBS_O_WORKDIR
NP=$(wc -l $PBS_NODEFILE)
mpirun -np $NP -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH ./mympiprogram
>logfile
</code>

== Platform MPI ==

IBM [[http://www.ibm.com/developerworks/downloads/im/mpi/|Platform MPI Community Edition]] 9.1.2 (formerly HP MPI) is installed on razor.

<code>
#PBS ...
#PBS -l node=2:ppn=16
module purge
module load gcc/4.9.1 mkl/14.0.3 platform_mpi/9.1.2
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID -e MKL_NUM_THREADS=4 \
-e LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              49.15              5.599e+02 (E5-2650 v2)
================================================================================
WC32L2C4       34560   180     2     4              48.61              5.661e+02 (E5-2670, add -psm)
================================================================================
</code>