User Tools

Site Tools


mpi

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
mpi [2016/09/22 18:49]
root created
mpi [2016/10/31 16:05] (current)
root
Line 1: Line 1:
 +==== MPI types, MPI examples, MPI-threaded hybrid, HPL ====
 +
 Here are some MPI examples for different flavors. Each also illustrates (a) setting modules and environment variables in the batch file, and (b) hybrid MPI/MKL threads. Hybrid MPI/OpenMP is run in the same way as MPI/MKL, except the relevant environment variable is ''​OMP_NUM_THREADS''​. ​ Each MPI performance is very similar for this small example, but there may be a big difference for larger jobs. Here are some MPI examples for different flavors. Each also illustrates (a) setting modules and environment variables in the batch file, and (b) hybrid MPI/MKL threads. Hybrid MPI/OpenMP is run in the same way as MPI/MKL, except the relevant environment variable is ''​OMP_NUM_THREADS''​. ​ Each MPI performance is very similar for this small example, but there may be a big difference for larger jobs.
  
-The test program is HPL (High Perfomance Linpack) on two 16-core nodes with a relatively small matrix, so performance is well short of maximum. ​ The best HPL layout we have tested is 4 MPI processes per two-socket node, with 3 or 4 MKL threads per MPI process for 12 or 16-core nodes respectively.+The test program is HPL (High Perfomance Linpack) on (a) two E5-2650 v2/​Mellanox ​16-core nodes or (b) two E5-2670/​QLogic 16-core nodes, older but higher-clocked. ​ This job is with a relatively small matrix, so performance is well short of maximum. ​ The best HPL layout we have tested is 4 MPI processes per two-socket node, with 3 or 4 MKL threads per MPI process for 12 or 16-core nodes respectively.  HPL is usually compiled with the ''​gcc''​ compiler. ​ Better compilers don't help as the great majority of the execution time is in the BLAS library.
  
-We have found that MPI software has improved over time, so the best choices are recent versions as shown in the examples.+We have found that MPI software has mostly ​improved over time, so the best choices are recent versions as shown in the examples. ​ Also newer versions have better support for module selection in the job file as shown in these examples.
    
 ==Intel MPI == ==Intel MPI ==
  
-If you set the ''​impi''​ module at the top of your batch file, paths (program ''​$PATH''​ and shared library ''​$LD_LIBRARY_PATH''​) will be passed to slave nodes and a multiple-node job will run.  If you need other environment variables in the batch file, they won't be passed to slave nodes and must be set either (a) in .bashrc or (b) set in the ''​mpirun''​ statement, for instance the number of MKL threads below. Notice no equals sign in the assignment statement.+If you set the ''​impi''​ module at the top of your batch file, paths (program ''​$PATH''​ and shared library ''​$LD_LIBRARY_PATH''​) will be passed to slave nodes and a multiple-node job will run.  If you need other environment variables in the batch file, they won't be passed to slave nodes and must be set either (a) in .bashrc or (b) set in the ''​mpirun''​ statement, for instance the number of MKL threads below. Notice ​the csh-style, ​no equals signin the MKL_NUM_THREADS ​assignment statement.
  
 <​code>​ <​code>​
 #PBS ... #PBS ...
 #PBS -l node=2:​ppn=16 #PBS -l node=2:​ppn=16
 +module purge
 module load gcc/4.9.1 mkl/14.0.3 impi/5.1.2 module load gcc/4.9.1 mkl/14.0.3 impi/5.1.2
 cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
Line 19: Line 22:
 T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops
 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
-WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.90              5.628e+02+WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.90              5.628e+02 ​(E5-2650 v2) 
 +================================================================================ 
 +WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             47.65              5.776e+02 (E5-2670)
 ================================================================================ ================================================================================
 </​code>​ </​code>​
Line 29: Line 34:
 #PBS ... #PBS ...
 #PBS -l node=2:​ppn=16 #PBS -l node=2:​ppn=16
-module load gcc/​4.9.1 ​intel/14.0.3 mvapich2/​2.2+module purge 
 +module load gcc/​4.9.1 ​mkl/14.0.3 mvapich2/​2.2
 cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
 sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID
Line 38: Line 44:
 T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops
 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
-WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.53              5.671e+02+WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.53              5.671e+02 ​(E5-2650 v2) 
 +================================================================================ 
 +WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.17              5.714e+02 (E5-2670)
 ================================================================================ ================================================================================
 </​code>​ </​code>​
Line 47: Line 55:
 #PBS ... #PBS ...
 #PBS -l node=2:​ppn=16 #PBS -l node=2:​ppn=16
-module load gcc/​4.9.1 ​intel/14.0.3 mvapich2/​2.2+module purge 
 +module load gcc/​4.9.1 ​mkl/14.0.3 mvapich2/​2.2
 cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
 sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID
-mpirun -np 8 -machinefile ​machinefile ​-genv MKL_NUM_THREADS 4 -genv MV2_ENABLE_AFFINITY 1 \+mpirun -np 8 -machinefile ​nodefile.$PBS_JOBID ​-genv MKL_NUM_THREADS 4 -genv MV2_ENABLE_AFFINITY 1 \
 -genv MV2_USE_AFFINITY 1 -genv MV2_USE_SHARED_MEM 1 -genv MV2_CPU_BINDING_LEVEL numanode \ -genv MV2_USE_AFFINITY 1 -genv MV2_USE_SHARED_MEM 1 -genv MV2_CPU_BINDING_LEVEL numanode \
 -genv MV2_CPU_BINDING_POLICY scatter ./xhpl >logfile -genv MV2_CPU_BINDING_POLICY scatter ./xhpl >logfile
Line 56: Line 65:
 T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops
 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
-WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.71              5.649e+02+WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.71              5.649e+02 ​(E5-2650 v2)
 ================================================================================ ================================================================================
 </​code>​ </​code>​
Line 66: Line 75:
 #PBS ... #PBS ...
 #PBS -l node=2:​ppn=16 #PBS -l node=2:​ppn=16
 +module purge
 module load gcc/4.9.1 mkl/14.0.3 openmpi/​2.0.1 module load gcc/4.9.1 mkl/14.0.3 openmpi/​2.0.1
 cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
Line 74: Line 84:
 T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops
 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
-WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.71              5.650e+02+WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.71              5.650e+02 ​(E5-2650 v2) 
 +================================================================================ 
 +WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.08              5.723e+02 (E5-2670)
 ================================================================================ ================================================================================
 </​code>​ </​code>​
 If MPI processes=nodes*cores per node, it's not necessary to process the nodefile. ​ Here MPI processes=nodes*cores per nodes is set automatically (it's the number of lines in $PBS_NODEFILE):​ If MPI processes=nodes*cores per node, it's not necessary to process the nodefile. ​ Here MPI processes=nodes*cores per nodes is set automatically (it's the number of lines in $PBS_NODEFILE):​
 <​code>​ <​code>​
 +#PBS ...
 +#PBS -l node=2:​ppn=16
 +module purge
 module load gcc/4.9.1 mkl/14.0.3 openmpi/​2.0.1 module load gcc/4.9.1 mkl/14.0.3 openmpi/​2.0.1
 cd $PBS_O_WORKDIR cd $PBS_O_WORKDIR
Line 90: Line 105:
 IBM [[http://​www.ibm.com/​developerworks/​downloads/​im/​mpi/​|Platform MPI Community Edition]] 9.1.2 (formerly HP MPI) is installed on razor. IBM [[http://​www.ibm.com/​developerworks/​downloads/​im/​mpi/​|Platform MPI Community Edition]] 9.1.2 (formerly HP MPI) is installed on razor.
 <​code>​ <​code>​
 +#PBS ...
 +#PBS -l node=2:​ppn=16
 module purge module purge
-module load gcc/4.9.1 mkl/​14.0.3 ​platformmpi/9.1.2+module load gcc/4.9.1 mkl/​14.0.3 ​platform_mpi/9.1.2
 sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID sort -u $PBS_NODEFILE >​nodefile.$PBS_JOBID
-mpirun -np 8 -machinefile nodefile.$PBS_JOBID -e MKL_NUM_THREADS=4 -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./xhpl >logfile+mpirun -np 8 -machinefile nodefile.$PBS_JOBID -e MKL_NUM_THREADS=4 ​
 +-e LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./xhpl >logfile
 ================================================================================ ================================================================================
 T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops T/V                N    NB     ​P ​    ​Q ​              ​Time ​                ​Gflops
 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
-WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             49.15              5.599e+02+WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             49.15              5.599e+02 ​(E5-2650 v2) 
 +================================================================================ 
 +WC32L2C4 ​      ​34560 ​  ​180 ​    ​2 ​    ​4 ​             48.61              5.661e+02 (E5-2670, add -psm)
 ================================================================================ ================================================================================
 </​code>​ </​code>​
  
mpi.1474570170.txt.gz · Last modified: 2016/09/22 18:49 by root