Arkansas High Performace Computing Center [hpcwiki]

This is an old revision of the document!

Quantum Espresso

Version 5.1 Compilation

With Intel compiler and either OpenMPI or MVAPICH2:

OpenMPI:
DFLAGS         = -D__INTEL -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK $(MANUAL_DFLAGS)
IFLAGS         = -I../include
MPIF90         = mpif90
CFLAGS         = -O3 -xSSE2 -axavx $(DFLAGS) $(IFLAGS)
F90FLAGS       = $(FFLAGS) -nomodule -fpp $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
FFLAGS         = -O2 -xSSE2 -axavx -assume byterecl -g -traceback -par-report0 -vec-report0
FFLAGS_NOOPT   = -O0 -assume byterecl -g -traceback
FFLAGS_NOMAIN  = -nofor_main
LD             = mpif90
LDFLAGS        = -static-intel 
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
FFT_LIBS       = -L ${MKL_ROOT}/interfaces/fftw3xf -lfftw3xf_intel

MVAPICH2: same except
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

trestles: same except
no -axavx (though an "optional" code path, it makes the program fail on AMD)

Benchmarks

We run AUSURF112 from Espresso Benchmarks and compare with Glenn Lockwood who ran the AUSURF112 benchmark on SDSC Comet and on the Trestles system when it was at SDSC. Unfortunately, the AUSURF112 benchmark generally ends the simulation with convergence NOT achieved after 2 iterations: stopping, but it does so fairly repeatably so may be timed.

OpenMPI:
module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8 
mpirun -np 64  -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \
/share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in
MVAPICH2:
module load intel/14.0.3 mkl/14.0.3 mvapich2/2.1
mpirun -np 64  -machinefile $PBS_NODEFILE \
/share/apps/espresso/espresso-5.1-intel-mvapich2/bin/pw.x -npools 1 <ausurf.in

The tables shows Lockwood's and our times. We add 32×4 core runs for Trestles as we think node-to-node is the more representative comparison. Our newer versions of OpenMPI show better results on the almost-identical hardware than Lockwood.

Walltime	CoresxNodes	Intel/Mvapich	Intel/OpenMPI
Lockwood Gordon E5-2670	16x4	470	580
Lockwood Trestles AMD6136	32x2	1060	1440
Our E5-2650V2	16x4	na	475
Our E5-2670	16x4	456	488
Our Trestles AMD6136	32x2	(1)	1007
Our Trestles AMD6136	32x4	642	762

(1) Fails with error charge is wrong.

Notes

Each run fails with error messages (depending on MPI type) and RC 1 after terminating normally according to the log. This appears harmless:

   This run was terminated on:  13: 2:44  11Nov2015            
=------------------------------------------------------------------------------=
   JOB DONE.
=------------------------------------------------------------------------------=
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
etc.

Continuing Work

ELPA in newer versions of Espresso is reportedly faster than Scalapack.

OpenMPI threading.

MKL threading.

FFTW fft vs. Intel fft on AMD.

2020 Update q-e 6.6

On Trestles with Intel tools. It's difficult to find a combination of versions that is new enough to compile qe-6.6 (compiler > 18) yet will still produce a binary that runs on AMD Bulldozer (mkl <20) while avoiding most mkl bugs (mkl > 18).

module load intel/18.0.2 mkl/19.0.5 impi/17.0.4
MKL_NUM_THREADS=# OMP_NUM_THREADS=# mpirun -np ## -machinefile machinefile /share/apps/espresso/espresso-6.6-intel-impi-mkl-trestles/bin/pw.x <ausurf.in

q-e appears to be a code that does not like mpi threads x OMP threads > physical cores. Performance on two trestles nodes is better than the previous 5.1 benchmarks, but it doesn't scale past two nodes on this small problem.

Cores Node  type  #mpi  #nodes   #OMP #MKL  WALL
32  Trestles AMD    32       1      1    1  12m11s
32  Trestles AMD    32       1      2    1  50m41s
32  Trestles AMD    32       1      1    2  >14m
32  Trestles AMD    16       1      2    1  >14m
32  Trestles AMD    64       2      1    1   8m33s
32  Trestles AMD   128       4      1    1   8m42s
32  6130   Intel    32       1      1    1   2m35s
32  6130   Intel    32       1      1    1   2m33s *
32  6130   Intel    32       1      1    1   2m58s ***
48  7402     AMD    48       1      1    1   1m58s
48  7402     AMD    24       1      2    2   3m32s
48  7402     AMD    96       1      2    2   2m20s  **

using a better optimized version for more modern machines, which doesn't seem to help much.

tested for number of hardware threads, which is negative for performance vs. number of physical cores * 6.5 version.

Install script

OMP="--enable-openmp"
VERSION=6.6
./install/configure MPIF90=mpiifort F90=ifort F77=ifort FC=ifort CC=icc \
SCALAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" \
LAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64" \
BLAS_LIBS="-lmkl_intel_lp64  -lmkl_intel_thread -lmkl_core -liomp5 -thread" \
FFT_LIBS="-L$MKLROOT/interfaces/fftw3xf -lfftw3xf_intel" \
FFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK \
-assume byterecl -I$MKLROOT/include/fftw" \
CFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK" \
--with-hdf5=/share/apps/hdf5/1.10.5/intel/impi -with-scalapack=intel \
--enable-parallel $OMP --prefix=/share/apps/espresso/espresso-$VERSION-intel-impi-mkl-trestles

Arkansas High Performace Computing Center [hpcwiki]

User Tools

Site Tools

**This is an old revision of the document!**

Quantum Espresso

2020 Update q-e 6.6

Page Tools

This is an old revision of the document!