User Tools

Site Tools


mpi

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
mpi [2022/09/08 22:13]
root
mpi [2022/09/09 18:37] (current)
root
Line 1: Line 1:
 ==== MPI ==== ==== MPI ====
  
-The great majority of multi-node programs in HPC (as well as many single-node programs) use the [[https://www.mcs.anl.gov/research/projects/mpi/]] MPI parallel programming software.  There are many possible options in configuring MPI for a particular set of hardware (the help file ./configure --help for Open MPI is about 700 lines), so a very particular setup is needed. Fortunately a single setup can be used for many applications.+The great majority of multi-node programs in HPC (as well as many single-node programs) use the MPI (Message Passing Interface) parallel software [[https://www.mcs.anl.gov/research/projects/mpi/]].  There are many possible options in configuring MPI for a particular set of hardware (the help file ./configure --help for Open MPI is about 700 lines), so a very particular setup is needed for best performance. Fortunately a single setup can be used for many applications.
  
-The most common MPI variants in Linux are Open MPI, MVAPICH, and Intel MPI.  The last two are ABI compatible so that "mpirun/mpiexec" from either MVAPICH or Intel MPI can be used to execute programs compiled with eith MVAPICH or Intel MPI.   Open MPI is not ABI compatible with either. but in most cases any of the three toolsets can be used with a standard-conforming source file.  Usually one MPI variant will work a little better on a particular program.+The most common MPI variants in Linux are Open MPI, MVAPICH, and Intel MPI.  The last two are both derived from earlier ''mpich'' and are ABI compatible so that "mpirun/mpiexec" from either MVAPICH or Intel MPI can be used to execute programs compiled with eith MVAPICH or Intel MPI.   Open MPI is not ABI compatible with either. but in most cases any of the three toolsets can be used (all the way through, compile and run) with a standard-conforming source file.  Usually one MPI variant will work a little better on a particular program.
  
 Modules for currently supported versions are ''openmpi/4.1.4, mvapich2/7.3.2, impi/19.0.5''. In our configuration, the open-source MPI versions openmpi and mvapich2 are compiled for a particular compiler. Modules for currently supported versions are ''openmpi/4.1.4, mvapich2/7.3.2, impi/19.0.5''. In our configuration, the open-source MPI versions openmpi and mvapich2 are compiled for a particular compiler.
-So a compiler module should be specified before the MPI module to let the module know which code path to use.  Example compiler modules are ''gcc/11.2.1, intel/19.0.5, nvhpc/22.7'' for Gnu, Intel proprietary, and PGI compilers respectively. See [[compilers]].+So a compiler module should be specified before the MPI module to let the module know which code path to use.  Example compiler modules are ''gcc/11.2.1, intel/19.0.5, nvhpc/22.7'' for Gnu, Intel proprietary, and PGI compilers respectively. Each has c/fortran/c++ compilers. See [[compilers]].  The Intel proprietary compiler (as opposed to the newer Intel clang compiler) has run-time scripts to match to a given compiler, but we ask at module load time for (compiler) (mpi version) the same way as openmpi and mvapich.
  
-Many HPC centers pass slurm directives (with default modules) directly to MPI like this, which has the virtue of simplicity though lacking in control.+Many HPC centers pass slurm directives (with default modules) directly to MPI like this, which has the virtue of simplicity though lacking in control. In these examples only the slurm commands relating directly to MPI are included, though others are also required.
  
 <code> <code>
Line 17: Line 17:
 </code> </code>
  
-Here the intended state is that the program would be executed on 4 nodes, with 16 MPI processes per node, and 2 available threads (cpus in slurm speak) per MPI process.  In most cases, to maximize the CPU utilization, tasks-per-node x cpus-per-task should equal the number of physical cores in each node. +This program is intended by the slurm commands to be executed on 4 nodes, with 16 MPI processes per node, and 2 available threads (cpus in slurm speak) per MPI process.  In most cases, to maximize the CPU utilization,  tasks-per-node x cpus-per-task should equal the number of physical cores in each node (mostly 32 or 64 at AHPCC). Sometimes this can't be maintained, usually because either 1) that many MPI tasks would require more memory than the node has, or 2) the computational grid is tied to a certain number of MPI tasks that doesn't equal what's available on the node.
-Sometimes this can't be maintained, usually because either 1) it would require more memory than the node has, or 2) the computational grid is tied to a certain number of MPI processes that doesn't equal what's available on the node.+
  
-We leave a little more manual control in the process as shown below. A program has been compiled using both MPI (explicit) and OpenMP (multithreaded) parallelization, and we will run all three MPI variants one after the other. Unfortunately the commands are just a little different for each variant. We are trying to make sure that MPI processes are spread evenly across the nodes, using -ppn (processes-per-node) in mvapich and impi, -np (total-tasks) and --map-by node in openmpi.  We are also trying to make each MPI task execute with two OpenMP threads, which requires each MPI task to see an environment variable OMP\_NUM\_THREADS.  Each version is specified differently, ''OPEN\_NUM\_THREADS=2'' in mvapich, ''-x OMP\_NUM\_THREADS=2'' in openmpi, and ''genv OMP\_NUM\_THREADS 2'' (no equals sign) in impi. +AHPCC leaves a little more manual control in the process as shown below. A program has been compiled using both MPI (explicit) and OpenMP (multithreaded) parallelization, and we will run all three MPI variants one after the other. Unfortunately the exec/run commands are just a little different for each variant. As is the usual case, we are trying to make sure that MPI processes are spread evenly across the nodes, using -ppn (processes-per-node) in mvapich and impi, and -np (total-tasks) and --map-by node in openmpi.  We are also trying to make each MPI task execute with two OpenMP threads, which requires each MPI task to see an environment variable ''OMP\_NUM\_THREADS''.  Each version is specified differently, ''OMP\_NUM\_THREADS=2'' in mvapich, ''-x OMP\_NUM\_THREADS=2'' in openmpi, and ''genv OMP\_NUM\_THREADS 2'' (no equals sign) in impi. In addition, openmpi often needs the environment variables $PATH and $LD_LIBRARY_PATH to be passed to processes with '' x '' It's easier just to specify it than figure out when you need it.
-In addition, openmpi often needs the environment variables $PATH and $LD_LIBRARY_PATH passed to processes.+
  
 Every slurm job generates a hostfile (a list of the hosts allocated for the job) in the job scratch directory as shown.  But you don't need it unless it's a multi-node job. Every slurm job generates a hostfile (a list of the hosts allocated for the job) in the job scratch directory as shown.  But you don't need it unless it's a multi-node job.
Line 52: Line 50:
 Single-node MPI runs are easier to specify as you usually just need mpiexec/mpirun, the number of processes, and the name of the executable. Single-node MPI runs are easier to specify as you usually just need mpiexec/mpirun, the number of processes, and the name of the executable.
  
-For a little more realistic example we will use "osu_bw.c" which measures the bandwidth of the system interconnect.  The output only makes sense for two MPI tasks across either one or two nodes, though in real computation, shared memory is faster, so you almost always want to fill local cores before allocating a second node.+For a little more realistic example we will use "osu_bw.c"which measures the bandwidth of the system interconnect.  The output only makes sense for two MPI tasks across either one or two nodes, though in real computation, shared memory is faster, so you almost always want to fill local cores before allocating a second node.
  
-In the first "mpiexec" run, with no hostfile specified, MPI will put all the tasks asked for on the first (or the current, if interactive) node.  The resulting bandwidth of about 19 GB/s is then measured for this particular hardware for shared memory.  In the second run, we force it to distribute (with openmpi map-by) the processes across the nodes, so the resulting bandwidth is that of the network EDR Infiniband (about 100 Gb/s or 12 GB/s).  In the third case, we specify a hostfile but we don't force it to spread the tasks across nodes.  You can see by the measured bandwidth of shared memory that it did not spread the tasks. In an actual run, this usually puts all the tasks on the first node and drastically reduces performance.+In the first "mpiexec" run, with no hostfile specified, MPI will put all the tasks asked for on the first (or the current, if interactive) node.  The resulting bandwidth of about 19 GB/s is then measured for this particular hardware for shared memory.  In the second run, we force it to distribute (with openmpi map-by) the processes across the nodes, so the resulting bandwidth is that of the network EDR Infiniband (about 100 Gb/s or 12 GB/s).  In the third case, we specify a hostfile but we don't force it to spread the tasks across nodes.  You can see by the measured bandwidth of shared memory that it did not spread the tasks. In an actual run, this usually puts all the tasks on the first node and drastically reduces performance.  So spreading by host should usually be done.
  
 <code> <code>
Line 81: Line 79:
 </code> </code>
  
-Here we make a similar run with mvapich2 and a newer 64-core node.  The shared-memory bandwidth is considerably higher, though the network bandwidth is about the same (a cost-saving decision made at acquisition time).  Repeating with openmpi, it benchmarks for this problem considerably better than mvapich, though you don't usually see much difference in full applications.  +Here we make a similar run with mvapich2 and a newer 64-core node.  The shared-memory bandwidth is considerably higher, though the network bandwidth is about the same (a cost-saving decision made at acquisition time).  Repeating with openmpi, it benchmarks for this problem considerably better than mvapich, though you don't usually see very much difference in full applications.  
  
 <code> <code>
mpi.1662675228.txt.gz · Last modified: 2022/09/08 22:13 by root