Arkansas High Performace Computing Center [hpcwiki]

MPI

The great majority of multi-node programs in HPC (as well as many single-node programs) use the MPI (Message Passing Interface) parallel software https://www.mcs.anl.gov/research/projects/mpi/. There are many possible options in configuring MPI for a particular set of hardware (the help file ./configure –help for Open MPI is about 700 lines), so a very particular setup is needed for best performance. Fortunately a single setup can be used for many applications.

The most common MPI variants in Linux are Open MPI, MVAPICH, and Intel MPI. The last two are both derived from earlier mpich and are ABI compatible so that “mpirun/mpiexec” from either MVAPICH or Intel MPI can be used to execute programs compiled with eith MVAPICH or Intel MPI. Open MPI is not ABI compatible with either. but in most cases any of the three toolsets can be used (all the way through, compile and run) with a standard-conforming source file. Usually one MPI variant will work a little better on a particular program.

Modules for currently supported versions are openmpi/4.1.4, mvapich2/7.3.2, impi/19.0.5. In our configuration, the open-source MPI versions openmpi and mvapich2 are compiled for a particular compiler. So a compiler module should be specified before the MPI module to let the module know which code path to use. Example compiler modules are gcc/11.2.1, intel/19.0.5, nvhpc/22.7 for Gnu, Intel proprietary, and PGI compilers respectively. Each has c/fortran/c++ compilers. See compilers. The Intel proprietary compiler (as opposed to the newer Intel clang compiler) has run-time scripts to match to a given compiler, but we ask at module load time for (compiler) (mpi version) the same way as openmpi and mvapich.

Many HPC centers pass slurm directives (with default modules) directly to MPI like this, which has the virtue of simplicity though lacking in control. In these examples only the slurm commands relating directly to MPI are included, though others are also required.

#SBATCH --nodes=4
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=2
srun ./my_MPI_executable

This program is intended by the slurm commands to be executed on 4 nodes, with 16 MPI processes per node, and 2 available threads (cpus in slurm speak) per MPI process. In most cases, to maximize the CPU utilization, ( tasks-per-node x cpus-per-task ) should equal the number of physical cores in each node (mostly 32 or 64 at AHPCC). Sometimes this can't be maintained, usually because either 1) that many MPI tasks would require more memory than the node has, or 2) the computational grid is tied to a certain number of MPI tasks that doesn't equal what's available on the node.

AHPCC leaves a little more manual control in the process as shown below. A program has been compiled using both MPI (explicit) and OpenMP (multithreaded) parallelization, and we will run all three MPI variants one after the other. Unfortunately the exec/run commands are just a little different for each variant. As is the usual case, we are trying to make sure that MPI processes are spread evenly across the nodes, using -ppn (processes-per-node) in mvapich and impi, and -np (total-tasks) and –map-by node in openmpi. We are also trying to make each MPI task execute with two OpenMP threads, which requires each MPI task to see an environment variable OMP_NUM_THREADS. Each version is specified differently, OMP_NUM_THREADS=2 in mvapich, -x OMP_NUM_THREADS=2 in openmpi, and genv OMP_NUM_THREADS 2 (no equals sign) in impi. In addition, openmpi often needs the environment variables $PATH and $LDLIBRARYPATH to be passed to processes with x . It's easier just to specify it than figure out when you need it.

Every slurm job generates a hostfile (a list of the hosts allocated for the job) in the job scratch directory as shown. But you don't need it unless it's a multi-node job.

#SBATCH --nodes=4
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=2
#
omp_threads=$SLURM_CPUS_PER_TASK
mpi_pernode_tasks=$SLURM_NTASKS_PER_NODE
mpi_total_tasks=$SLURM_NTASKS
#
#mvapich 
module purge;module load gcc/11.2.1 impi/19.0.5
mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
OMP_NUM_THREADS=$omp_threads ./my_mvapich_omp_executable
#
#openmpi
module purge;module load gcc/11.2.1 openmpi/4.1.4
mpiexec -np $mpi_tasks --map-by node -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \ 
-x LD_LIBRARY_PATH -x PATH -x OMP_NUM_THREADS=$omp_threads  ./my_openmpi_omp_executable
#
#impi
module purge;module load gcc/11.2.1 impi/19.0.5
mpiexec -ppn $mpi_pernode_tasks -hostfile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} \
-genv OMP_NUM_THREADS $omp_threads ./my_impi_omp_executable

Single-node MPI runs are easier to specify as you usually just need mpiexec/mpirun, the number of processes, and the name of the executable.

For a little more realistic example we will use “osu_bw.c”, which measures the bandwidth of the system interconnect. The output only makes sense for two MPI tasks across either one or two nodes, though in real computation, shared memory is faster, so you almost always want to fill local cores before allocating a second node.

In the first “mpiexec” run, with no hostfile specified, MPI will put all the tasks asked for on the first (or the current, if interactive) node. The resulting bandwidth of about 19 GB/s is then measured for this particular hardware for shared memory. In the second run, we force it to distribute (with openmpi map-by) the processes across the nodes, so the resulting bandwidth is that of the network EDR Infiniband (about 100 Gb/s or 12 GB/s). In the third case, we specify a hostfile but we don't force it to spread the tasks across nodes. You can see by the measured bandwidth of shared memory that it did not spread the tasks. In an actual run, this usually puts all the tasks on the first node and drastically reduces performance. So spreading by host should usually be done.

#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
#
module load gcc/11.2.1 openmpi/4.1.4
mpicc osu_bw.c
#
mpiexec -np 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		19409.270824
#
mpiexec -np 2 -hostfile $hostfile --map-by node ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		12167.945917
#
mpiexec -np 2 -hostfile hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		18383.474371

Here we make a similar run with mvapich2 and a newer 64-core node. The shared-memory bandwidth is considerably higher, though the network bandwidth is about the same (a cost-saving decision made at acquisition time). Repeating with openmpi, it benchmarks for this problem considerably better than mvapich, though you don't usually see very much difference in full applications.

#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
hostfile=/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}
#
module purge
module load gcc/11.2.1 mvapich2/2.3.7
mpicc osu_bw.c
#
mpiexec -ppn 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		22325.640419

mpiexec -ppn 1 -hostfile $hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		12208.926829
#
module purge
module load gcc/11.2.1 openmpi/4.1.4
mpicc osu_bw.c
#
mpiexec -np 2 ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		44016.948025
#
mpiexec -np 2 --map-by node -hostfile $hostfile ./a.out
# OSU MPI Bandwidth Test (Version 2.2)
# Size		Bandwidth (MB/s) 
1048576		11316.488573

Arkansas High Performace Computing Center [hpcwiki]

User Tools

Site Tools

MPI

Page Tools