Compilation
With Intel compiler and either OpenMPI or MVAPICH2:
OpenMPI:
DFLAGS         = -D__INTEL -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK $(MANUAL_DFLAGS)
IFLAGS         = -I../include
MPIF90         = mpif90
CFLAGS         = -O3 -xSSE2 -axavx $(DFLAGS) $(IFLAGS)
F90FLAGS       = $(FFLAGS) -nomodule -fpp $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
FFLAGS         = -O2 -xSSE2 -axavx -assume byterecl -g -traceback -par-report0 -vec-report0
FFLAGS_NOOPT   = -O0 -assume byterecl -g -traceback
FFLAGS_NOMAIN  = -nofor_main
LD             = mpif90
LDFLAGS        = -static-intel 
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
FFT_LIBS       = -L ${MKL_ROOT}/interfaces/fftw3xf -lfftw3xf_intel
MVAPICH2: same except
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
trestles: same except
no -axavx (though an "optional" code path, it makes the program fail on AMD)
Benchmarks
We run AUSURF112 from Espresso Benchmarks and compare with Glenn Lockwood who ran the AUSURF112 benchmark on SDSC Comet and on the Trestles system when it was at SDSC.  Unfortunately, the AUSURF112 benchmark generally ends the simulation with 
convergence NOT achieved after   2 iterations: stopping, 
but it does so fairly repeatably so may be timed.
OpenMPI: module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8 mpirun -np 64 -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \ /share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in MVAPICH2: module load intel/14.0.3 mkl/14.0.3 mvapich2/2.1 mpirun -np 64 -machinefile $PBS_NODEFILE \ /share/apps/espresso/espresso-5.1-intel-mvapich2/bin/pw.x -npools 1 <ausurf.in
The tables shows Lockwood's and our times. We add 32×4 core runs for Trestles as we think node-to-node is the more representative comparison. Our newer versions of OpenMPI show better results on the almost-identical hardware than Lockwood.
| Walltime | CoresxNodes | Intel/Mvapich | Intel/OpenMPI | 
|---|---|---|---|
| Lockwood Gordon E5-2670 | 16x4 | 470 | 580 | 
| Lockwood Trestles AMD6136 | 32x2 | 1060 | 1440 | 
| Our E5-2650V2 | 16x4 | na | 475 | 
| Our E5-2670 | 16x4 | 456 | 488 | 
| Our Trestles AMD6136 | 32x2 | (1) | 1007 | 
| Our Trestles AMD6136 | 32x4 | 642 | 762 | 
(1) Fails with error charge is wrong.
Notes
Each run fails with error messages (depending on MPI type) and RC 1 after terminating normally according to the log. This appears harmless:
This run was terminated on: 13: 2:44 11Nov2015 =------------------------------------------------------------------------------= JOB DONE. =------------------------------------------------------------------------------= ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- ------------------------------------------------------------ A process or daemon was unable to complete a TCP connection to another process: etc.
Continuing Work
ELPA in newer versions of Espresso is reportedly faster than Scalapack.
OpenMPI threading.
MKL threading.
FFTW fft vs. Intel fft on AMD.