Version 5.1 Compilation
With Intel compiler and either OpenMPI or MVAPICH2:
OpenMPI: DFLAGS = -D__INTEL -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK $(MANUAL_DFLAGS) IFLAGS = -I../include MPIF90 = mpif90 CFLAGS = -O3 -xSSE2 -axavx $(DFLAGS) $(IFLAGS) F90FLAGS = $(FFLAGS) -nomodule -fpp $(FDFLAGS) $(IFLAGS) $(MODFLAGS) FFLAGS = -O2 -xSSE2 -axavx -assume byterecl -g -traceback -par-report0 -vec-report0 FFLAGS_NOOPT = -O0 -assume byterecl -g -traceback FFLAGS_NOMAIN = -nofor_main LD = mpif90 LDFLAGS = -static-intel SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 FFT_LIBS = -L ${MKL_ROOT}/interfaces/fftw3xf -lfftw3xf_intel MVAPICH2: same except SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 trestles: same except no -axavx (though an "optional" code path, it makes the program fail on AMD)
Benchmarks
We run AUSURF112 from Espresso Benchmarks and compare with Glenn Lockwood who ran the AUSURF112 benchmark on SDSC Comet and on the Trestles system when it was at SDSC. Unfortunately, the AUSURF112 benchmark generally ends the simulation with
convergence NOT achieved after 2 iterations: stopping
,
but it does so fairly repeatably so may be timed.
OpenMPI: module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8 mpirun -np 64 -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \ /share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in MVAPICH2: module load intel/14.0.3 mkl/14.0.3 mvapich2/2.1 mpirun -np 64 -machinefile $PBS_NODEFILE \ /share/apps/espresso/espresso-5.1-intel-mvapich2/bin/pw.x -npools 1 <ausurf.in
The tables shows Lockwood's and our times. We add 32×4 core runs for Trestles as we think node-to-node is the more representative comparison. Our newer versions of OpenMPI show better results on the almost-identical hardware than Lockwood.
Walltime | CoresxNodes | Intel/Mvapich | Intel/OpenMPI |
---|---|---|---|
Lockwood Gordon E5-2670 | 16x4 | 470 | 580 |
Lockwood Trestles AMD6136 | 32x2 | 1060 | 1440 |
Our E5-2650V2 | 16x4 | na | 475 |
Our E5-2670 | 16x4 | 456 | 488 |
Our Trestles AMD6136 | 32x2 | (1) | 1007 |
Our Trestles AMD6136 | 32x4 | 642 | 762 |
(1) Fails with error charge is wrong.
Notes
Each run fails with error messages (depending on MPI type) and RC 1 after terminating normally according to the log. This appears harmless:
This run was terminated on: 13: 2:44 11Nov2015 =------------------------------------------------------------------------------= JOB DONE. =------------------------------------------------------------------------------= ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- ------------------------------------------------------------ A process or daemon was unable to complete a TCP connection to another process: etc.
Continuing Work
ELPA in newer versions of Espresso is reportedly faster than Scalapack.
OpenMPI threading.
MKL threading.
FFTW fft vs. Intel fft on AMD.
On Trestles with Intel tools. It's difficult to find a combination of versions that is new enough to compile qe-6.6 (compiler > 18) yet will still produce a binary that runs on AMD Bulldozer (mkl <20) while avoiding most mkl bugs (mkl > 18).
module load intel/18.0.2 mkl/19.0.5 impi/17.0.4 MKL_NUM_THREADS=# OMP_NUM_THREADS=# mpirun -np ## -machinefile machinefile /share/apps/espresso/espresso-6.6-intel-impi-mkl-trestles/bin/pw.x <ausurf.in
q-e appears to be a code that does not like mpi threads x OMP threads > physical cores. Performance on two trestles nodes is better than the previous 5.1 benchmarks, but it doesn't scale past two nodes on this small problem.
Cores Node type #mpi #nodes #OMP #MKL WALL 32 Trestles AMD 32 1 1 1 12m11s 32 Trestles AMD 32 1 2 1 50m41s 32 Trestles AMD 32 1 1 2 >14m 32 Trestles AMD 16 1 2 1 >14m 32 Trestles AMD 64 2 1 1 8m33s 32 Trestles AMD 128 4 1 1 8m42s 32 6130 Intel 32 1 1 1 2m35s 32 6130 Intel 32 1 1 1 2m33s * 32 6130 Intel 32 1 1 1 2m58s *** 48 7402 AMD 48 1 1 1 1m58s 48 7402 AMD 24 1 2 2 3m32s 48 7402 AMD 96 1 2 2 2m20s **
tested for number of hardware threads, which is negative for performance vs. number of physical cores * 6.5 version.
Install script
OMP="--enable-openmp" VERSION=6.6 ./install/configure MPIF90=mpiifort F90=ifort F77=ifort FC=ifort CC=icc \ SCALAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" \ LAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64" \ BLAS_LIBS="-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -thread" \ FFT_LIBS="-L$MKLROOT/interfaces/fftw3xf -lfftw3xf_intel" \ FFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK \ -assume byterecl -I$MKLROOT/include/fftw" \ CFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK" \ --with-hdf5=/share/apps/hdf5/1.10.5/intel/impi -with-scalapack=intel \ --enable-parallel $OMP --prefix=/share/apps/espresso/espresso-$VERSION-intel-impi-mkl-trestles
QE 6.8 are 7.1 installed with two versions compiled with the Intel compiler (“skylake” for Intel and “bulldozer” for AMD).
“skylake” uses -xHOST
and is compiled on Pinnacle I.
“bulldozer” uses -msse3 -axsse3,sse4.2,AVX,core-AVX2,CORE-AVX512
and is compiled on Trestles, so should work at some speed on all systems.
Both use module load intel/19.0.5 mkl/19.0.5 impi/17.0.4
.
The impi/19.0.5
module causes a fault on the AMD platforms for unknown reasons.
The “skylake” binary causes a fault on the AMD platforms because of a single AVX512 code path.
export MKL_DEBUG_CPU_TYPE=5
as set on AMD by module mkl<20
causes a fault on the AMD platforms (failure on Trestles, wrong answer on Pinnacle II).
For AMD, explicitly set after the module load:
export MKL_DEBUG_CPU_TYPE=0
A small pw.x input was used to allow some parameter sweeps.
Single Node Results {OMP_NUM_THREADS=1|2} time mpirun -np {16|32|64} \ /share/apps/espresso/espresso-{6.8|7.1}-intel-impi-mkl-{skylake|bulldozer}/bin/pw.x \ -nk {1|4|8|16} <scf.in >log System Cores CPU QE version compile #mpi #OMP #MKL -nk Wall Time(s) Pinnacle I 32 Intel 6130 6.8 skylake 32 1 1 1 20.5 Pinnacle I 32 Intel 6130 6.8 skylake 32 1 1 4 13.6 Pinnacle I 32 Intel 6130 6.8 skylake 32 1 1 8 9.5 Pinnacle I 32 Intel 6130 6.8 skylake 32 1 1 16 14.3 Pinnacle I 32 Intel 6130 6.8 skylake 16 2 1 4 >360 Pinnacle I 32 Intel 6130 6.8 bulldozer 32 1 1 8 22.0 Trestles 32 AMD 6136 6.8 bulldozer 32 1 1 8 17.5 Trestles 32 AMD 6136 6.8 bulldozer 32 1 1 4 16.2 Trestles 32 AMD 6136 7.1 bulldozer 32 1 1 4 15.3 Pinnacle II 64 AMD 7543 6.8 bulldozer 64 1 1 4 4.9 Pinnacle II 64 AMD 7543 6.8 bulldozer 64 1 1 8 4.0 Pinnacle II 64 AMD 7543 6.8 bulldozer 64 1 1 16 4.5 Pinnacle II 64 AMD 7543 6.8 bulldozer 32 1 1 8 6.0 Pinnacle II 64 AMD 7543 7.1 bulldozer 64 1 1 8 4.2
Conclusions for this sample program:
-nk 8
is best on Pinnacle I & II platforms, -nk 4
on Trestles.
QE 7.1 is slightly slower than 6.8 on Pinnacle I & II and slightly faster on Trestles.
OMP_NUM_THREADS>1
is a very substantial slowdown versus 1 or unset.
The bulldozer version runs on Intel but is significantly slower than the skylake version (other older platforms such as E5 condo nodes won't run AVX512 codes and would probably benefit from their own version).
The usually zero-wait Trestles system has relatively good performance on QE if shared memory (64 GB) allows (relatively meaning ~1/2 of Pinnacle I performance when for some programs it is ~1/5)