User Tools

Site Tools


mpi

This is an old revision of the document!


Here are some MPI examples for different flavors. Each also illustrates (a) setting modules and environment variables in the batch file, and (b) hybrid MPI/MKL threads. Hybrid MPI/OpenMP is run in the same way as MPI/MKL, except the relevant environment variable is OMP_NUM_THREADS. Each MPI performance is very similar for this small example, but there may be a big difference for larger jobs.

The test program is HPL (High Perfomance Linpack) on two 16-core nodes with a relatively small matrix, so performance is well short of maximum. The best HPL layout we have tested is 4 MPI processes per two-socket node, with 3 or 4 MKL threads per MPI process for 12 or 16-core nodes respectively.

We have found that MPI software has improved over time, so the best choices are recent versions as shown in the examples.

Intel MPI

If you set the impi module at the top of your batch file, paths (program $PATH and shared library $LD_LIBRARY_PATH) will be passed to slave nodes and a multiple-node job will run. If you need other environment variables in the batch file, they won't be passed to slave nodes and must be set either (a) in .bashrc or (b) set in the mpirun statement, for instance the number of MKL threads below. Notice no equals sign in the assignment statement.

#PBS ...
#PBS -l node=2:ppn=16
module load gcc/4.9.1 mkl/14.0.3 impi/5.1.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID -genv MKL_NUM_THREADS 4  ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.90              5.628e+02
================================================================================
MVAPICH2

MVAPICH2 is similar to Intel MPI in that $PATH and $LD_LIBRARY_PATH will correctly pass to slave nodes, but other environment variables won't, and have to be set in .bashrc or deliberately passed by mpirun or mpirun_rsh. MVAPICH2 hybrid with OpenMP or MKL threads has a very bad cpu affinity by default (although default affinity is good where each core has an MPI process). The long MV2 option list below is needed to provide reasonable performance for hybrid.

#PBS ...
#PBS -l node=2:ppn=16
module load gcc/4.9.1 intel/14.0.3 mvapich2/2.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun_rsh -ssh -np 8 -hostfile nodefile.$PBS_JOBID MKL_NUM_THREADS=4 MV2_ENABLE_AFFINITY=1 \
MV2_USE_AFFINITY=1 MV2_USE_SHARED_MEM=1 MV2_CPU_BINDING_LEVEL=numanode \
MV2_CPU_BINDING_POLICY=scatter ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.53              5.671e+02
================================================================================

Repeated without the MV2 affinity variables, MKL threads are unusable and this 49-second job was stopped after 21 minutes.

The MVAPICH2 mpirun job starter is similar in performance to mpirun_rsh, but the syntax is different:

#PBS ...
#PBS -l node=2:ppn=16
module load gcc/4.9.1 intel/14.0.3 mvapich2/2.2
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile machinefile -genv MKL_NUM_THREADS 4 -genv MV2_ENABLE_AFFINITY 1 \
-genv MV2_USE_AFFINITY 1 -genv MV2_USE_SHARED_MEM 1 -genv MV2_CPU_BINDING_LEVEL numanode \
-genv MV2_CPU_BINDING_POLICY scatter ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.71              5.649e+02
================================================================================
Open MPI

Open MPI in the later versions will correctly pass $PATH but not $LD_LIBRARY_PATH to slave compute nodes, so the -x LD_LIBRARY_PATH below is required, and here the optional environment variable MKL_NUM_THREADS is also set. Again syntax is slightly different than other mpiruns, no equals and no value means pass the existing value, and =value means set and pass the value. This only works reliably in our setup for Open MPI version at least 1.8.8. Earlier versions need the openmpi module set in .bashrc. Recent Open MPI versions are much faster, so programs using early versions should be recompiled anyway. Small version changes like openmpi/1.8.6 to openmpi/1.8.8 will usually work without recompiling. For hybrid MPI/MKL threads (or any case where the number of MPI processes is not equal to the number of nodes*cores per node), the option –bynode is needed for the reduced (sort -u below) nodefile to properly allocate MPI processes round-robin instead of all on the first node.

#PBS ...
#PBS -l node=2:ppn=16
module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
cd $PBS_O_WORKDIR
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID --bynode -x LD_LIBRARY_PATH -x MKL_NUM_THREADS=4  ./xhpl 
>logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              48.71              5.650e+02
================================================================================

If MPI processes=nodes*cores per node, it's not necessary to process the nodefile. Here MPI processes=nodes*cores per nodes is set automatically (it's the number of lines in $PBS_NODEFILE):

module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
cd $PBS_O_WORKDIR
NP=$(wc -l $PBS_NODEFILE)
mpirun -np $NP -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH ./mympiprogram
>logfile
Platform MPI

IBM Platform MPI Community Edition 9.1.2 (formerly HP MPI) is installed on razor.

module purge
module load gcc/4.9.1 mkl/14.0.3 platformmpi/9.1.2
sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
mpirun -np 8 -machinefile nodefile.$PBS_JOBID -e MKL_NUM_THREADS=4 \
-e LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./xhpl >logfile
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC32L2C4       34560   180     2     4              49.15              5.599e+02
================================================================================
mpi.1474570216.txt.gz · Last modified: 2016/09/22 18:50 by root