Differences

This shows you the differences between two versions of the page.

--- optimization [2023/03/09 21:34] – j root
+++ optimization [2025/10/15 19:51] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
-====Optimization====
+====Optimization/Making your code faster====
-How to make your code run faster: Here we focus on compiling someone else's code in Linux for scientific computing. Writing your own code expands the problem considerably.  For that you might check the free textbooks and supplemental material at [[https://theartofhpc.com/]].
+Here we focus on compiling someone else's code in Linux for scientific computing. Writing your own code expands the problem considerably.  For that you might check the free textbooks and supplemental material at [[https://theartofhpc.com/]].
 About 2015 this was a simpler exercise.  There was one compiler that was the best in most situations (Intel proprietary).  Now there are five or six compilers, all with some degree of different options.  There are three major MPI variants which can work with each compiler.  And usually you need to do at least a little custom compiling for each hardware that you plan to run on.  Here are the major factors in making your code faster.
@@ Line 8: / Line 8: @@
 * Intel proprietary: icc/icpc/ifort
 * Intel oneAPI Clang/LLVM based: icx/icpx/ifx
 * AMD Clang/LLVM based: clang/clang++/flang
 * NVidia PGI based: pgcc/pgc++/pgf90
 * GNU: gcc/g++/gfortran
+* Also base Clang/LLVM is available, but not necessary with two optimized versions
-* Also base Clang/LLVM but not necessary with two optimized versions
  For each of these you need to find the right options to enable your compute hardware. The most important options are:
-* Optimization level: Usually
+==Optimization levels==
-** -O0 (no optimization, for debugging)
+Fortunately usually the same with every compiler.
-** -O1 light optimization, for fast compiles
-** -O2 more optimization
+* -O0 (no optimization, for debugging)
-** -O3 more optimization
+* -O1 light optimization, for fast compiles
-** -Ofast usually -O3 with reduced numerical precision
+* -O2 more optimization
+* -O3 more optimization
+* -Ofast usually -O3 with reduced numerical precision
+==Target Architectures==
+with examples for AHPCC hardware (trestles=bulldozer, various older Intel E5 condo nodes, Pinnacle-1=skylake-avx512, Pinnacle-2=mostly Zen2). The five or so similar generations of Intel E5 processors are mostly distinguished by their floating point capability: nehalem(SSE4.2), sandybridge/ivybridge(AVX), haswell/broadwell(AVX2).
+* icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, core-avx2 for Zen, SSSE3 for Trestles
+* icx -x{mostly the same as icc}
+* clang -march=znver{1:2:3:4}, limited options for Intel
+* pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)}
+* gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native}
+* gcc --mtune={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3}
+PRACE has a good document [[ https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_AMD.pdf ]] with examples matching their (Zen 1) hardware. Modify processor-specific values and floating point levels accordingly.  It's from 2019 so recent developments in Clang are not covered well.
+* icc -O3 -march=core-avx2 -fma -ftz -fomit-frame-pointer  (+ifort -align array64byte)
+* icx not included
+* clang -O3 -march=znver1 -mfma -fvectorize -mfma -mavx2 -m3dnow -floop-unswitch-aggressive -fuse-ld=lld
+* pgicc -O3 -tp zen -Mvect=simd -Mcache_align -Mprefetch -Munroll
+* gcc -O3 -march=znver1 -mtune=znver1 -mfma -mavx2 -m3dnow -fomit-frame-pointer
+== OpenMP ==
+The automated parallelization is not usually very good, so it requires directives in the code for good performance.  But generally a compiler option is necessary to enable OpenMP.
-* Set the target architecture, with examples for AHPCC
+* icc -qopenmp -parallel
-The similar generations of Intel E5 processors are mostly distinguished by their floating point: nehalem(SSE4.2), sandybridge/ivybridge(AVX), haswell/broadwell(AVX2)
+* icx -qopenmp
+* clang -fopenmp
+* pgicc -mp
+* gcc -fopenmp
-** icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, limited options for AMD
-** icx -x{mostly the same as icc}
-** clang -march=znver{1:2:3:4}, limited options for Intel
-** pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)}
-** gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native}
-** gcc --mtume={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3}
+== Optimized Libraries ==
+It is best where possible to use standard libraries for low-level numerical calculations.  Some are highly optimized and coded in assembler to be much faster than high-level language equivalents.  "configure" scripts often default to using slow "reference" versions, particularly for BLAS/LAPACK.
+These include
+* BLAS and LAPACK: Intel MKL, AMD AOCL, OpenBLAS
+* FFT: FFTW, MKL, AOCL
+* Solvers: AOCL, MKL, Scalapack, Elpa, PetSC, and others
+* Random Numbers: AOCL, MKL
+==MPI Versions==
+* Intel MPI: usually the easiest as it has run-time interfaces for multiple compilers
+* Open MPI: often the fastest, must be compiled with the compiler in use
+* MVAPICH: (MPICH for Infiniband): sometimes the fastest, must be compiled with the compiler in use