Differences

This shows you the differences between two versions of the page.

--- mpi [2022/05/13 16:20]
root
+++ mpi [2022/09/09 18:37] (current)
root
@@ Line 1: / Line 1: @@
-==== MPI types, MPI examples, MPI-threaded hybrid, HPL ====
+==== MPI ====
-Here are some MPI examples for different flavors. Each also illustrates (a) setting modules and environment variables in the batch file, and (b) hybrid MPI/MKL threads. Hybrid MPI/OpenMP is run in the same way as MPI/MKL, except the relevant environment variable is ''OMP_NUM_THREADS''.  Each MPI performance is very similar for this small example, but there may be a big difference for larger jobs.
+The great majority of multi-node programs in HPC (as well as many single-node programs) use the MPI (Message Passing Interface) parallel software [[https://www.mcs.anl.gov/research/projects/mpi/]].  There are many possible options in configuring MPI for a particular set of hardware (the help file ./configure --help for Open MPI is about 700 lines), so a very particular setup is needed for best performance. Fortunately a single setup can be used for many applications.
-The test program is HPL (High Perfomance Linpack) on (a) two E5-2650 v2/Mellanox 16-core nodes or (b) two E5-2670/QLogic 16-core nodes, older but higher-clocked.  This job is with a relatively small matrix, so performance is well short of maximum.  The best HPL layout we have tested is 4 MPI processes per two-socket node, with 3 or 4 MKL threads per MPI process for 12 or 16-core nodes respectively.  HPL is usually compiled with the ''gcc'' compiler.  Better compilers don't help as the great majority of the execution time is in the BLAS library.
+The most common MPI variants in Linux are Open MPI, MVAPICH, and Intel MPI.  The last two are both derived from earlier ''mpich'' and are ABI compatible so that "mpirun/mpiexec" from either MVAPICH or Intel MPI can be used to execute programs compiled with eith MVAPICH or Intel MPI.   Open MPI is not ABI compatible with either. but in most cases any of the three toolsets can be used (all the way through, compile and run) with a standard-conforming source file.  Usually one MPI variant will work a little better on a particular program.
-We have found that MPI software has mostly improved over time, so the best choices are recent versions as shown in the examples.  Also newer versions have better support for module selection in the job file as shown in these examples.
+Modules for currently supported versions are ''openmpi/4.1.4, mvapich2/7.3.2, impi/19.0.5''. In our configuration, the open-source MPI versions openmpi and mvapich2 are compiled for a particular compiler.
+So a compiler module should be specified before the MPI module to let the module know which code path to use.  Example compiler modules are ''gcc/11.2.1, intel/19.0.5, nvhpc/22.7'' for Gnu, Intel proprietary, and PGI compilers respectively. Each has c/fortran/c++ compilers. See [[compilers]].  The Intel proprietary compiler (as opposed to the newer Intel clang compiler) has run-time scripts to match to a given compiler, but we ask at module load time for (compiler) (mpi version) the same way as openmpi and mvapich.
-==Intel MPI ==
-If you set the ''impi'' module at the top of your batch file, paths (program ''$PATH'' and shared library ''$LD_LIBRARY_PATH'') will be passed to slave nodes and a multiple-node job will run.  If you need other environment variables in the batch file, they won't be passed to slave nodes and must be set either (a) in .bashrc or (b) set in the ''mpirun'' statement, for instance the number of MKL threads below. Notice the csh-style, no equals sign, in the MKL_NUM_THREADS assignment statement.
+Many HPC centers pass slurm directives (with default modules) directly to MPI like this, which has the virtue of simplicity though lacking in control. In these examples only the slurm commands relating directly to MPI are included, though others are also required.
 <code>
-#PBS ...
+#SBATCH --nodes=4
-#PBS -l node=2:ppn=16
+#SBATCH --tasks-per-node=16
-module purge
+#SBATCH --cpus-per-task=2
-module load gcc/4.9.1 mkl/14.0.3 impi/5.1.2
+srun ./my_MPI_executable
-cd $PBS_O_WORKDIR
-sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
-mpirun -np 8 -machinefile nodefile.$PBS_JOBID -genv MKL_NUM_THREADS 4  ./xhpl >logfile
-================================================================================
-T/V                N    NB     P     Q               Time                 Gflops
---------------------------------------------------------------------------------
-WC32L2C4       34560   180     2     4              48.90              5.628e+02 (E5-2650 v2)
-================================================================================
-WC32L2C4       34560   180     2     4              47.65              5.776e+02 (E5-2670)
-================================================================================
 </code>
-== MVAPICH2 ==
+This program is intended by the slurm commands to be executed on 4 nodes, with 16 MPI processes per node, and 2 available threads (cpus in slurm speak) per MPI process.  In most cases, to maximize the CPU utilization,  ( tasks-per-node x cpus-per-task ) should equal the number of physical cores in each node (mostly 32 or 64 at AHPCC). Sometimes this can't be maintained, usually because either 1) that many MPI tasks would require more memory than the node has, or 2) the computational grid is tied to a certain number of MPI tasks that doesn't equal what's available on the node.
-MVAPICH2 is similar to Intel MPI in that ''$PATH'' and ''$LD_LIBRARY_PATH'' will correctly pass to slave nodes, but other environment variables won't, and have to be set in ''.bashrc'' or deliberately passed by ''mpirun'' or ''mpirun_rsh''.  MVAPICH2 hybrid with OpenMP or MKL threads has a very bad cpu affinity by default (although default affinity is good where each core has an MPI process).  The long MV2 option list below is needed to provide reasonable performance for hybrid.
+AHPCC leaves a little more manual control in the process as shown below. A program has been compiled using both MPI (explicit) and OpenMP (multithreaded) parallelization, and we will run all three MPI variants one after the other. Unfortunately the exec/run commands are just a little different for each variant. As is the usual case, we are trying to make sure that MPI processes are spread evenly across the nodes, using -ppn (processes-per-node) in mvapich and impi, and -np (total-tasks) and --map-by node in openmpi.  We are also trying to make each MPI task execute with two OpenMP threads, which requires each MPI task to see an environment variable ''OMP\_NUM\_THREADS''.  Each version is specified differently, ''OMP\_NUM\_THREADS=2'' in mvapich, ''-x OMP\_NUM\_THREADS=2'' in openmpi, and ''genv OMP\_NUM\_THREADS 2'' (no equals sign) in impi. In addition, openmpi often needs the environment variables $PATH and $LD_LIBRARY_PATH to be passed to processes with '' x ''.  It's easier just to specify it than figure out when you need it.
+Every slurm job generates a hostfile (a list of the hosts allocated for the job) in the job scratch directory as shown.  But you don't need it unless it's a multi-node job.
 <code>
-#PBS ...
+#SBATCH --nodes=4
-#PBS -l node=2:ppn=16
+#SBATCH --tasks-per-node=16
-module purge
+#SBATCH --cpus-per-task=2
-module load gcc/4.9.1 mkl/14.0.3 mvapich2/2.2
+#
-cd $PBS_O_WORKDIR
+omp_threads=$SLURM_CPUS_PER_TASK
-sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
+mpi_pernode_tasks=$SLURM_NTASKS_PER_NODE
-mpirun_rsh -ssh -np 8 -hostfile nodefile.$PBS_JOBID MKL_NUM_THREADS=4 MV2_ENABLE_AFFINITY=1 \
+mpi_total_tasks=$SLURM_NTASKS
-MV2_USE_AFFINITY=1 MV2_USE_SHARED_MEM=1 MV2_CPU_BINDING_LEVEL=numanode \
+#
-MV2_CPU_BINDING_POLICY=scatter ./xhpl >logfile
+#mvapich
-================================================================================
+module purge;module load gcc/11.2.1 impi/19.0.5
-T/V                N    NB     P     Q               Time                 Gflops
+mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
---------------------------------------------------------------------------------
+OMP_NUM_THREADS=$omp_threads ./my_mvapich_omp_executable
-WC32L2C4       34560   180     2     4              48.53              5.671e+02 (E5-2650 v2)
+#
-================================================================================
+#openmpi
-WC32L2C4       34560   180     2     4              48.17              5.714e+02 (E5-2670)
+module purge;module load gcc/11.2.1 openmpi/4.1.4
-================================================================================
+mpiexec -np $mpi_tasks --map-by node -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
+-x LD_LIBRARY_PATH -x PATH -x OMP_NUM_THREADS=$omp_threads  ./my_openmpi_omp_executable
+#
+#impi
+module purge;module load gcc/11.2.1 impi/19.0.5
+mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
+-genv OMP_NUM_THREADS $omp_threads ./my_impi_omp_executable
 </code>
-Repeated without the MV2 affinity variables, MKL threads are unusable and this 49-second job was stopped after 21 minutes.
+Single-node MPI runs are easier to specify as you usually just need mpiexec/mpirun, the number of processes, and the name of the executable.
-The MVAPICH2 ''mpirun'' job starter is similar in performance to ''mpirun_rsh'', but the syntax is different:
+For a little more realistic example we will use "osu_bw.c", which measures the bandwidth of the system interconnect.  The output only makes sense for two MPI tasks across either one or two nodes, though in real computation, shared memory is faster, so you almost always want to fill local cores before allocating a second node.
+In the first "mpiexec" run, with no hostfile specified, MPI will put all the tasks asked for on the first (or the current, if interactive) node.  The resulting bandwidth of about 19 GB/s is then measured for this particular hardware for shared memory.  In the second run, we force it to distribute (with openmpi map-by) the processes across the nodes, so the resulting bandwidth is that of the network EDR Infiniband (about 100 Gb/s or 12 GB/s).  In the third case, we specify a hostfile but we don't force it to spread the tasks across nodes.  You can see by the measured bandwidth of shared memory that it did not spread the tasks. In an actual run, this usually puts all the tasks on the first node and drastically reduces performance.  So spreading by host should usually be done.
 <code>
-#PBS ...
+#SBATCH --nodes=2
-#PBS -l node=2:ppn=16
+#SBATCH --tasks-per-node=1
-module purge
+#SBATCH --cpus-per-task=1
-module load gcc/4.9.1 mkl/14.0.3 mvapich2/2.2
+hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
-cd $PBS_O_WORKDIR
+#
-sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
+module load gcc/11.2.1 openmpi/4.1.4
-mpirun -np 8 -machinefile nodefile.$PBS_JOBID -genv MKL_NUM_THREADS 4 -genv MV2_ENABLE_AFFINITY 1 \
+mpicc osu_bw.c
--genv MV2_USE_AFFINITY 1 -genv MV2_USE_SHARED_MEM 1 -genv MV2_CPU_BINDING_LEVEL numanode \
+#
--genv MV2_CPU_BINDING_POLICY scatter ./xhpl >logfile
+mpiexec -np 2 ./a.out
-================================================================================
+# OSU MPI Bandwidth Test (Version 2.2)
-T/V                N    NB     P     Q               Time                 Gflops
+# Size		Bandwidth (MB/s)
---------------------------------------------------------------------------------
+1048576		19409.270824
-WC32L2C4       34560   180     2     4              48.71              5.649e+02 (E5-2650 v2)
+#
-================================================================================
+mpiexec -np 2 -hostfile $hostfile --map-by node ./a.out
+# OSU MPI Bandwidth Test (Version 2.2)
+# Size		Bandwidth (MB/s)
+1048576		12167.945917
+#
+mpiexec -np 2 -hostfile hostfile ./a.out
+# OSU MPI Bandwidth Test (Version 2.2)
+# Size		Bandwidth (MB/s)
+1048576		18383.474371
 </code>
-== Open MPI ==
+Here we make a similar run with mvapich2 and a newer 64-core node.  The shared-memory bandwidth is considerably higher, though the network bandwidth is about the same (a cost-saving decision made at acquisition time).  Repeating with openmpi, it benchmarks for this problem considerably better than mvapich, though you don't usually see very much difference in full applications.
-Open MPI in the later versions will correctly pass ''$PATH'' but not ''$LD_LIBRARY_PATH'' to slave compute nodes, so the ''-x LD_LIBRARY_PATH'' below is required, and here the optional environment variable ''MKL_NUM_THREADS'' is also set.  Again syntax is slightly different than other mpiruns, no equals and no value means pass the existing value, and =value means set and pass the value.  This only works reliably in our setup for Open MPI version at least 1.8.8.  Earlier versions need the ''openmpi'' module set in ''.bashrc''.  Recent Open MPI versions are much faster, so programs using early versions should be recompiled anyway.  Small version changes like openmpi/1.8.6 to openmpi/1.8.8 will usually work without recompiling. For hybrid MPI/MKL threads (or any case where the number of MPI processes is not equal to the number of nodes*cores per node), the option ''--bynode'' is needed for the reduced (sort -u below) nodefile to properly allocate MPI processes round-robin instead of all on the first node.
 <code>
-#PBS ...
+#SBATCH --nodes=2
-#PBS -l node=2:ppn=16
+#SBATCH --tasks-per-node=1
+#SBATCH --cpus-per-task=1
+hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
+#
 module purge
-module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
+module load gcc/11.2.1 mvapich2/2.3.7
-cd $PBS_O_WORKDIR
+mpicc osu_bw.c
-sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
+#
-mpirun -np 8 -machinefile nodefile.$PBS_JOBID --bynode -x LD_LIBRARY_PATH -x MKL_NUM_THREADS=4  ./xhpl
+mpiexec -ppn 2 ./a.out
->logfile
+# OSU MPI Bandwidth Test (Version 2.2)
-================================================================================
+# Size		Bandwidth (MB/s)
-T/V                N    NB     P     Q               Time                 Gflops
+1048576		22325.640419
---------------------------------------------------------------------------------
-WC32L2C4       34560   180     2     4              48.71              5.650e+02 (E5-2650 v2)
-================================================================================
-WC32L2C4       34560   180     2     4              48.08              5.723e+02 (E5-2670)
-================================================================================
-</code>
-If MPI processes=nodes\*cores per node, it's not necessary to process the nodefile.  Here MPI processes=nodes\*cores per nodes is set automatically (it's the number of lines in $PBS_NODEFILE):
+mpiexec -ppn 1 -hostfile $hostfile ./a.out
+# OSU MPI Bandwidth Test (Version 2.2)
-<code>
+# Size		Bandwidth (MB/s)
-#PBS ...
+1048576		12208.926829
-#PBS -l node=2:ppn=16
+#
 module purge
-module load gcc/4.9.1 mkl/14.0.3 openmpi/2.0.1
+module load gcc/11.2.1 openmpi/4.1.4
-cd $PBS_O_WORKDIR
+mpicc osu_bw.c
-NP=$(wc -l $PBS_NODEFILE)
+#
-mpirun -np $NP -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH ./mympiprogram
+mpiexec -np 2 ./a.out
->logfile
+# OSU MPI Bandwidth Test (Version 2.2)
+# Size		Bandwidth (MB/s)
+1048576		44016.948025
+#
+mpiexec -np 2 --map-by node -hostfile $hostfile ./a.out
+# OSU MPI Bandwidth Test (Version 2.2)
+# Size		Bandwidth (MB/s)
+1048576		11316.488573
 </code>
-== Platform MPI ==
-IBM [[http://www.ibm.com/developerworks/downloads/im/mpi/|Platform MPI Community Edition]] 9.1.2 (formerly HP MPI) is installed on razor.
-<code>
-#PBS ...
-#PBS -l node=2:ppn=16
-module purge
-module load gcc/4.9.1 mkl/14.0.3 platform_mpi/9.1.2
-sort -u $PBS_NODEFILE >nodefile.$PBS_JOBID
-mpirun -np 8 -machinefile nodefile.$PBS_JOBID -e MKL_NUM_THREADS=4 \
--e LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./xhpl >logfile
-================================================================================
-T/V                N    NB     P     Q               Time                 Gflops
---------------------------------------------------------------------------------
-WC32L2C4       34560   180     2     4              49.15              5.599e+02 (E5-2650 v2)
-================================================================================
-WC32L2C4       34560   180     2     4              48.61              5.661e+02 (E5-2670, add -psm)
-================================================================================
-</code>

Arkansas High Performace Computing Center [hpcwiki]

User Tools

Site Tools

Differences

Page Tools