User Tools

Site Tools


mpi

**This is an old revision of the document!**

MPI

The great majority of multi-node programs in HPC (as well as many single-node programs) use the https://www.mcs.anl.gov/research/projects/mpi/ MPI parallel programming software. There are many possible options in configuring MPI for a particular set of hardware (the help file ./configure –help for Open MPI is about 700 lines), so a very particular setup is needed. Fortunately a single setup can be used for many applications.

The most common MPI variants in Linux are Open MPI, MVAPICH, and Intel MPI. The last two are ABI compatible so that “mpirun/mpiexec” from either MVAPICH or Intel MPI can be used to execute programs compiled with eith MVAPICH or Intel MPI. Open MPI is not ABI compatible with either. but in most cases any of the three toolsets can be used with a standard-conforming source file. Usually one MPI variant will work a little better on a particular program.

Modules for currently supported versions are openmpi/4.1.4, mvapich2/7.3.2, impi/19.0.5. In our configuration, the open-source MPI versions openmpi and mvapich2 are compiled for a particular compiler. So a compiler module should be specified before the MPI module to let the module know which code path to use. Example compiler modules are gcc/11.2.1, intel/19.0.5, nvhpc/22.7 for Gnu, Intel proprietary, and PGI compilers respectively. See compilers.

Many HPC centers pass slurm directives (with default modules) directly to MPI like this, which has the virtue of simplicity though lacking in control.

#SBATCH --nodes=4
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=2
srun ./my_MPI_executable

Here the intended state is that the program would be executed on 4 nodes, with 16 MPI processes per node, and 2 available threads (cpus in slurm speak) per MPI process. In most cases, to maximize the CPU utilization, tasks-per-node x cpus-per-task should equal the number of physical cores in each node. Sometimes this can't be maintained, usually because either 1) it would require more memory than the node has, or 2) the computational grid is tied to a certain number of MPI processes that doesn't equal what's available on the node.

We leave a little more manual control in the process as shown below. A program has been compiled using both MPI (explicit) and OpenMP (multithreaded) parallelization, and we will run all three MPI variants one after the other. Unfortunately the commands are just a little different for each variant. We are trying to make sure that MPI processes are spread evenly across the nodes, using -ppn (processes-per-node) in mvapich and impi, -np (total-tasks) and –map-by node in openmpi. We are also trying to make each MPI task execute with two OpenMP threads, which requires each MPI task to see an environment variable OMP_NUM_THREADS. Each version is specified differently, OPEN_NUM_THREADS=2 in mvapich, -x OMP_NUM_THREADS=2 in openmpi, and genv OMP_NUM_THREADS 2 (no equals sign) in impi. In addition, openmpi often needs the environment variables $PATH and $LDLIBRARYPATH passed to processes.

Every slurm job generates a hostfile (a list of the hosts allocated for the job) in the job scratch directory as shown. But you don't need it unless it's a multi-node job.

#SBATCH --nodes=4
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=2
#
omp_threads=$SLURM_CPUS_PER_TASK
mpi_pernode_tasks=$SLURM_NTASKS_PER_NODE
mpi_total_tasks=$SLURM_NTASKS
#
#mvapich 
module purge;module load gcc/11.2.1 impi/19.0.5
mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
OMP_NUM_THREADS=$omp_threads ./my_mvapich_omp_executable
#
#openmpi
module purge;module load gcc/11.2.1 openmpi/4.1.4
mpiexec -np $mpi_tasks --map-by node -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \ 
-x LD_LIBRARY_PATH -x PATH -x OMP_NUM_THREADS=$omp_threads  ./my_openmpi_omp_executable
#
#impi
module purge;module load gcc/11.2.1 impi/19.0.5
mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
-genv OMP_NUM_THREADS $omp_threads ./my_impi_omp_executable

Single-node MPI runs are easier to specify as you usually just need mpiexec/mpirun, the number of processes, and the name of the executable.

For a little more realistic example we will use “osu_bw.c” which measures the bandwidth of the system interconnect. The output only makes sense for two MPI tasks across either one or two nodes, though in real computation, shared memory is faster, so you almost always want to fill local cores before allocating a second node.

In the first “mpiexec” run, with no hostfile specified, MPI will put all the tasks asked for on the first (or the current, if interactive) node. The resulting bandwidth of about 19 GB/s is then measured for this particular hardware for shared memory. In the second run, we force it to distribute (with openmpi map-by) the processes across the nodes, so the resulting bandwidth is that of the network EDR Infiniband (about 100 Gb/s or 12 GB/s). In the third case, we specify a hostfile but we don't force it to spread the tasks across nodes. You can see by the measured bandwidth of shared memory that it did not spread the tasks. In an actual run, this usually puts all the tasks on the first node and drastically reduces performance.

#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
#
module load gcc/11.2.1 openmpi/4.1.4
mpicc osu_bw.c
#
mpiexec -np 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		19409.270824
#
mpiexec -np 2 -hostfile $hostfile --map-by node ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		12167.945917
#
mpiexec -np 2 -hostfile hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		18383.474371

Here we make a similar run with mvapich2 and a newer 64-core node. The shared-memory bandwidth is considerably higher, though the network bandwidth is about the same (a cost-saving decision made at acquisition time). Repeating with openmpi, it benchmarks for this problem considerably better than mvapich, though you don't usually see much difference in full applications.

#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
#
module purge
module load gcc/11.2.1 mvapich2/2.3.7
mpicc osu_bw.c
#
mpiexec -ppn 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		22325.640419

mpiexec -ppn 1 -hostfile $hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		12208.926829
#
module purge
module load gcc/11.2.1 openmpi/4.1.4
mpicc osu_bw.c
#
mpiexec -np 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		44016.948025
#
mpiexec -np 2 --map-by node -hostfile $hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		11316.488573
mpi.1662675228.txt.gz · Last modified: 2022/09/08 22:13 by root