User Tools

Site Tools


quantum_espresso

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
quantum_espresso [2022/06/20 18:40]
root
quantum_espresso [2022/07/01 20:57] (current)
root
Line 1: Line 1:
 ===== Quantum Espresso ===== ===== Quantum Espresso =====
-Version 5.1 
-** Compilation ** 
  
-With Intel compiler and either OpenMPI or MVAPICH2: +Versions 6.8/7.1
-<code> +
-OpenMPI: +
-DFLAGS         = -D__INTEL -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK $(MANUAL_DFLAGS) +
-IFLAGS         = -I../include +
-MPIF90         = mpif90 +
-CFLAGS         = -O3 -xSSE2 -axavx $(DFLAGS) $(IFLAGS) +
-F90FLAGS       = $(FFLAGS) -nomodule -fpp $(FDFLAGS) $(IFLAGS) $(MODFLAGS) +
-FFLAGS         = -O2 -xSSE2 -axavx -assume byterecl -g -traceback -par-report0 -vec-report0 +
-FFLAGS_NOOPT   = -O0 -assume byterecl -g -traceback +
-FFLAGS_NOMAIN  = -nofor_main +
-LD             = mpif90 +
-LDFLAGS        = -static-intel  +
-SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 +
-FFT_LIBS       = -L ${MKL_ROOT}/interfaces/fftw3xf -lfftw3xf_intel+
  
-MVAPICH2: same except +** Compilation ** 
-SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64+With Intel compiler, Intel MPI, and MKL
  
-trestles: same except 
-no -axavx (though an "optional" code path, it makes the program fail on AMD) 
-</code> 
- 
-** Benchmarks ** 
- 
-We run AUSURF112 from [[http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseBrowse&frs_package_id=36|Espresso Benchmarks]] and compare with [[http://glennklockwood.blogspot.com/2014/02/quantum-espresso-performance-benefits.html|Glenn Lockwood]] who ran the AUSURF112 benchmark on SDSC Comet and on the Trestles system when it was at SDSC.  Unfortunately, the AUSURF112 benchmark generally ends the simulation with  
-''convergence NOT achieved after   2 iterations: stopping'',  
-but it does so fairly repeatably so may be timed. 
-<code> 
-OpenMPI: 
-module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8  
-mpirun -np 64  -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \ 
-/share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in 
-MVAPICH2: 
-module load intel/14.0.3 mkl/14.0.3 mvapich2/2.1 
-mpirun -np 64  -machinefile $PBS_NODEFILE \ 
-/share/apps/espresso/espresso-5.1-intel-mvapich2/bin/pw.x -npools 1 <ausurf.in 
-</code> 
-The tables shows Lockwood's and our times.  We add 32x4 core runs for Trestles as we think node-to-node is the more representative comparison.  Our newer versions of OpenMPI show better results on the almost-identical hardware than Lockwood. 
-<csv> 
-Walltime,CoresxNodes,Intel/Mvapich, Intel/OpenMPI 
-Lockwood Gordon E5-2670,16x4,470,580 
-Lockwood Trestles AMD6136,32x2,1060,1440 
-Our E5-2650V2,16x4,na,475 
-Our E5-2670,16x4,456,488 
-Our Trestles AMD6136,32x2,(1),1007 
-Our Trestles AMD6136,32x4,642,762 
-</csv> 
-(1) Fails with error [[http://www.quantum-espresso.org/faq/frequent-errors-during-execution/#5.6|charge is wrong]]. 
- 
-** Notes ** 
- 
-Each run fails with error messages (depending on MPI type) and RC 1 after terminating normally according to the log. This appears harmless: 
- 
-<code> 
-   This run was terminated on:  13: 2:44  11Nov2015             
-=------------------------------------------------------------------------------= 
-   JOB DONE. 
-=------------------------------------------------------------------------------= 
-------------------------------------------------------- 
-Primary job  terminated normally, but 1 process returned 
-a non-zero exit code.. Per user-direction, the job has been aborted. 
-------------------------------------------------------- 
------------------------------------------------------------- 
-A process or daemon was unable to complete a TCP connection 
-to another process: 
-etc. 
-</code> 
- 
-** Continuing Work ** 
- 
-ELPA in newer versions of Espresso is reportedly faster than Scalapack. 
- 
-OpenMPI threading. 
- 
-MKL threading. 
- 
-FFTW fft vs. Intel fft on AMD. 
- 
-=== 2020 Update q-e 6.6=== 
-On Trestles with Intel tools.  It's difficult to find a combination of versions that is new enough to compile qe-6.6 (compiler > 18) yet will still produce a binary that runs on AMD Bulldozer (mkl <20) while avoiding most mkl bugs (mkl > 18). 
- 
- 
-<code> 
-module load intel/18.0.2 mkl/19.0.5 impi/17.0.4 
-MKL_NUM_THREADS=# OMP_NUM_THREADS=# mpirun -np ## -machinefile machinefile /share/apps/espresso/espresso-6.6-intel-impi-mkl-trestles/bin/pw.x <ausurf.in 
-</code> 
- 
-q-e appears to be a code that does not like mpi threads x OMP threads > physical cores. 
-Performance on two trestles nodes is better than the previous 5.1 benchmarks, but it doesn't scale past two nodes on this small problem. 
- 
-<code> 
-Cores Node  type  #mpi  #nodes   #OMP #MKL  WALL 
-32  Trestles AMD    32            1    1  12m11s 
-32  Trestles AMD    32            2    1  50m41s 
-32  Trestles AMD    32            1    2  >14m 
-32  Trestles AMD    16            2    1  >14m 
-32  Trestles AMD    64            1    1   8m33s 
-32  Trestles AMD   128            1    1   8m42s 
-32  6130   Intel    32            1    1   2m35s 
-32  6130   Intel    32            1    1   2m33s * 
-32  6130   Intel    32            1    1   2m58s *** 
-48  7402     AMD    48            1    1   1m58s 
-48  7402     AMD    24            2    2   3m32s 
-48  7402     AMD    96            2    2   2m20s  ** 
-</code> 
- 
-* using a better optimized version for more modern machines, which doesn't seem to help much. 
- 
-** tested for number of hardware threads, which is negative for performance vs. number of physical cores 
- 
-*** 6.5 version. 
- 
-Install script 
 <code> <code>
 +#COMPUTER=skylake
 +#OPT="-xHOST"
 +COMPUTER=bulldozer
 +OPT="-msse3 -axsse3,sse4.2,AVX,core-AVX2,CORE-AVX512"
 +VERSION=7.1
 +HDF5=1.12.0
 +module purge
 +module load intel/19.0.5 mkl/20.0.4 impi/17.0.4
 OMP="--enable-openmp" OMP="--enable-openmp"
-VERSION=6.6+make clean
 ./install/configure MPIF90=mpiifort F90=ifort F77=ifort FC=ifort CC=icc \ ./install/configure MPIF90=mpiifort F90=ifort F77=ifort FC=ifort CC=icc \
 SCALAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" \ SCALAPACK_LIBS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" \
Line 125: Line 22:
 BLAS_LIBS="-lmkl_intel_lp64  -lmkl_intel_thread -lmkl_core -liomp5 -thread" \ BLAS_LIBS="-lmkl_intel_lp64  -lmkl_intel_thread -lmkl_core -liomp5 -thread" \
 FFT_LIBS="-L$MKLROOT/interfaces/fftw3xf -lfftw3xf_intel" \ FFT_LIBS="-L$MKLROOT/interfaces/fftw3xf -lfftw3xf_intel" \
-FFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK +FFLAGS="-O3 $OPT -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK -assume byterecl \-I$MKLROOT/include/fftw"
--assume byterecl -I$MKLROOT/include/fftw"+CFLAGS="-O3 $OPT -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK"
-CFLAGS="-O3 -xHOST -D__INTEL -D__GNUC__ -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK"+--with-hdf5=/share/apps/hdf5/$HDF5/intel/impi -with-scalapack=intel --enable-parallel 
---with-hdf5=/share/apps/hdf5/1.10.5/intel/impi -with-scalapack=intel +$OMP --prefix=/share/apps/espresso/espresso-$VERSION-intel-impi-mkl-$COMPUTER 
---enable-parallel $OMP --prefix=/share/apps/espresso/espresso-$VERSION-intel-impi-mkl-trestles+make depends 
 +make all 
 +make install
 </code> </code>
  
-==Update 2022== +Runtime:
- +
-QE 6.8 are 7.1 installed with two versions compiled with the Intel compiler ("skylake" for Intel and "bulldozer" for AMD).  +
-"skylake" uses ''-xHOST'' and is compiled on Pinnacle I. +
-"bulldozer" uses ''-msse3 -axsse3,sse4.2,AVX,core-AVX2,CORE-AVX512'' and is compiled on Trestles, so should work at some speed on all systems. +
-Both use ''module load intel/19.0.5 mkl/19.0.5 impi/17.0.4''+
-The ''impi/19.0.5'' module causes a fault on the AMD platforms for unknown reasons. +
-The "skylake" binary causes a fault on the AMD platforms because of a single AVX512 code path. +
-''export MKL\_DEBUG\_CPU\_TYPE=5'' as set on AMD by module ''mkl<20'' causes a fault on the AMD platforms (failure on Trestles, wrong answer on Pinnacle II). +
-For AMD, explicitly set after the module load:+
 <code> <code>
-export MKL_DEBUG_CPU_TYPE=0+module load intel/18.0.2 impi/17.0.4 mkl/20.0.4 {qe/7.1 or qe/6.8} 
 +trestles:module load intel/18.0.2 impi/17.0.4 mkl/20.0.1 {qe/7.1 or qe/6.8}
 </code> </code>
  
-A small pw.x input was used to allow some parameter sweeps. +The performance is not sensitive to qe version between 6.8 and 7.1, but is quite sensitive to MKL version.  Newest MKL (20.0.4) is best on all platforms except on trestles (20.0.1) is best.  There are two executable sets selected by the module at runtime ("skylake" for Pinnacle-I and "bulldozer" for all other platforms).  Performance with OpenMP is slightly slower.
-<code> +
-Single Node Results+
  
-{OMP_NUM_THREADS=1|2} time mpirun -np {16|32|64} \ +The AUSURF112 benchmark is used for comparison with "-nk 2" and both CPUs on one node
-/share/apps/espresso/espresso-{6.8|7.1}-intel-impi-mkl-{skylake|bulldozer}/bin/pw.x \ +
--nk {1|4|8|16} <scf.in >log+
  
-System   Cores        CPU  QE version compile #mpi #OMP #MKL  -nk Wall Time(s)  +<code> 
- +System     QE version cores OMP  time  
-Pinnacle  32 Intel 6130      6.8    skylake   32    1    1    1   20.5 +Pinnacle II-AMD7543  7.1 64   1    86 
-Pinnacle  32 Intel 6130      6.8    skylake   32          4   13.6 +Pinnacle II-AMD7543  7.1 32      89 
-Pinnacle I  32 Intel 6130      6.8    skylake   32    1    1    8    9.5 +Pinnacle I-Intel6130 7.1 32     133 
-Pinnacle I  32 Intel 6130      6.8    skylake   32         16   14.3 +Pinnacle I-Intel6130 7.1 16   2   137 
- +Trestles-AMD6136     7.1 32     718 
-Pinnacle I  32 Intel 6130      6.8    skylake   16    2         >360 +Trestles-AMD6136     7.1 16     858
-Pinnacle I  32 Intel 6130      6.8  bulldozer   32          8   22.0  +
- +
-Trestles    32   AMD 6136      6.8  bulldozer   32          8   17.5  +
-Trestles    32   AMD 6136      6.8  bulldozer   32          4   16.2  +
-Trestles    32   AMD 6136      7.1  bulldozer   32    1    1    4   15.3  +
- +
-Pinnacle II 64   AMD 7543      6.8  bulldozer   64    1    1    4    4.9  +
-Pinnacle II 64   AMD 7543      6.8  bulldozer   64    1    1    8    4.0  +
-Pinnacle II 64   AMD 7543      6.8  bulldozer   64    1    1   16    4.5  +
-Pinnacle II 64   AMD 7543      6.8  bulldozer   32    1    1    8    6.0  +
-Pinnacle II 64   AMD 7543      7.1  bulldozer   64    1    1    8    4.2 +
 </code> </code>
  
-Conclusions for this sample program: 
- 
-''-nk 8'' is best on Pinnacle I & II platforms, ''-nk 4'' on Trestles. 
- 
-QE 7.1 is slightly slower than 6.8 on Pinnacle I & II and slightly faster on Trestles. 
- 
-''OMP\_NUM\_THREADS>1'' is a very substantial slowdown versus 1 or unset. 
- 
-The bulldozer version runs on Intel but is significantly slower than the skylake version (other older platforms such as E5 condo nodes won't run AVX512 codes and would probably benefit from their own version). 
- 
-The usually zero-wait Trestles system has relatively good performance on QE if shared memory (64 GB) allows (relatively meaning ~1/2 of Pinnacle I performance when for some programs it is ~1/5) 
quantum_espresso.txt · Last modified: 2022/07/01 20:57 by root