This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
optimization [2023/03/09 21:34] root j |
optimization [2023/03/09 22:16] (current) root |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====Optimization==== | + | ====Optimization/Making your code faster==== |
- | How to make your code run faster: | + | Here we focus on compiling someone else's code in Linux for scientific computing. Writing your own code expands the problem considerably. |
About 2015 this was a simpler exercise. | About 2015 this was a simpler exercise. | ||
Line 8: | Line 8: | ||
* Intel proprietary: | * Intel proprietary: | ||
- | |||
* Intel oneAPI Clang/LLVM based: icx/ | * Intel oneAPI Clang/LLVM based: icx/ | ||
- | |||
* AMD Clang/LLVM based: clang/ | * AMD Clang/LLVM based: clang/ | ||
- | |||
* NVidia PGI based: pgcc/ | * NVidia PGI based: pgcc/ | ||
- | |||
* GNU: gcc/ | * GNU: gcc/ | ||
- | + | * Also base Clang/ | |
- | * Also base Clang/LLVM but not necessary with two optimized versions | + | |
For each of these you need to find the right options to enable your compute hardware. The most important options are: | For each of these you need to find the right options to enable your compute hardware. The most important options are: | ||
- | * Optimization | + | ==Optimization |
- | ** -O0 (no optimization, | + | Fortunately usually the same with every compiler. |
- | ** -O1 light optimization, | + | |
- | ** -O2 more optimization | + | * -O0 (no optimization, |
- | ** -O3 more optimization | + | * -O1 light optimization, |
- | ** -Ofast usually -O3 with reduced numerical precision | + | * -O2 more optimization |
+ | * -O3 more optimization | ||
+ | * -Ofast usually -O3 with reduced numerical precision | ||
+ | |||
+ | ==Target Architectures== | ||
+ | |||
+ | with examples for AHPCC hardware (trestles=bulldozer, | ||
+ | |||
+ | * icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, core-avx2 for Zen, SSSE3 for Trestles | ||
+ | * icx -x{mostly the same as icc} | ||
+ | * clang -march=znver{1: | ||
+ | * pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)} | ||
+ | * gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native} | ||
+ | * gcc --mtune={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3} | ||
+ | |||
+ | PRACE has a good document [[ https:// | ||
+ | |||
+ | * icc -O3 -march=core-avx2 -fma -ftz -fomit-frame-pointer | ||
+ | * icx not included | ||
+ | * clang -O3 -march=znver1 -mfma -fvectorize -mfma -mavx2 -m3dnow -floop-unswitch-aggressive -fuse-ld=lld | ||
+ | * pgicc -O3 -tp zen -Mvect=simd -Mcache_align -Mprefetch -Munroll | ||
+ | * gcc -O3 -march=znver1 -mtune=znver1 -mfma -mavx2 -m3dnow -fomit-frame-pointer | ||
+ | |||
+ | == OpenMP == | ||
+ | |||
+ | The automated parallelization is not usually very good, so it requires directives in the code for good performance. | ||
- | * Set the target architecture, | + | * icc -qopenmp -parallel |
- | The similar generations of Intel E5 processors are mostly distinguished by their floating point: nehalem(SSE4.2), | + | * icx -qopenmp |
+ | * clang -fopenmp | ||
+ | * pgicc -mp | ||
+ | * gcc -fopenmp | ||
- | ** icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, limited options for AMD | ||
- | ** icx -x{mostly the same as icc} | ||
- | ** clang -march=znver{1: | ||
- | ** pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)} | ||
- | ** gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native} | ||
- | ** gcc --mtume={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3} | ||
+ | == Optimized Libraries == | ||
+ | It is best where possible to use standard libraries for low-level numerical calculations. | ||
- | | + | These include |
+ | * BLAS and LAPACK: Intel MKL, AMD AOCL, OpenBLAS | ||
+ | * FFT: FFTW, MKL, AOCL | ||
+ | * Solvers: AOCL, MKL, Scalapack, Elpa, PetSC, and others | ||
+ | * Random Numbers: AOCL, MKL | ||
+ | ==MPI Versions== | ||
+ | * Intel MPI: usually the easiest as it has run-time interfaces for multiple compilers | ||
+ | * Open MPI: often the fastest, must be compiled with the compiler in use | ||
+ | * MVAPICH: (MPICH for Infiniband): | ||