optimization
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| optimization [2023/03/09 21:34] – j root | optimization [2025/10/15 19:51] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====Optimization==== | + | ====Optimization/Making your code faster==== |
| - | How to make your code run faster: | + | Here we focus on compiling someone else's code in Linux for scientific computing. Writing your own code expands the problem considerably. |
| About 2015 this was a simpler exercise. | About 2015 this was a simpler exercise. | ||
| Line 8: | Line 8: | ||
| * Intel proprietary: | * Intel proprietary: | ||
| - | |||
| * Intel oneAPI Clang/LLVM based: icx/ | * Intel oneAPI Clang/LLVM based: icx/ | ||
| - | |||
| * AMD Clang/LLVM based: clang/ | * AMD Clang/LLVM based: clang/ | ||
| - | |||
| * NVidia PGI based: pgcc/ | * NVidia PGI based: pgcc/ | ||
| - | |||
| * GNU: gcc/ | * GNU: gcc/ | ||
| - | + | * Also base Clang/ | |
| - | * Also base Clang/LLVM but not necessary with two optimized versions | + | |
| For each of these you need to find the right options to enable your compute hardware. The most important options are: | For each of these you need to find the right options to enable your compute hardware. The most important options are: | ||
| - | * Optimization | + | ==Optimization |
| - | ** -O0 (no optimization, | + | Fortunately usually the same with every compiler. |
| - | ** -O1 light optimization, | + | |
| - | ** -O2 more optimization | + | * -O0 (no optimization, |
| - | ** -O3 more optimization | + | * -O1 light optimization, |
| - | ** -Ofast usually -O3 with reduced numerical precision | + | * -O2 more optimization |
| + | * -O3 more optimization | ||
| + | * -Ofast usually -O3 with reduced numerical precision | ||
| + | |||
| + | ==Target Architectures== | ||
| + | |||
| + | with examples for AHPCC hardware (trestles=bulldozer, | ||
| + | |||
| + | * icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, core-avx2 for Zen, SSSE3 for Trestles | ||
| + | * icx -x{mostly the same as icc} | ||
| + | * clang -march=znver{1: | ||
| + | * pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)} | ||
| + | * gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native} | ||
| + | * gcc --mtune={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3} | ||
| + | |||
| + | PRACE has a good document [[ https:// | ||
| + | |||
| + | * icc -O3 -march=core-avx2 -fma -ftz -fomit-frame-pointer | ||
| + | * icx not included | ||
| + | * clang -O3 -march=znver1 -mfma -fvectorize -mfma -mavx2 -m3dnow -floop-unswitch-aggressive -fuse-ld=lld | ||
| + | * pgicc -O3 -tp zen -Mvect=simd -Mcache_align -Mprefetch -Munroll | ||
| + | * gcc -O3 -march=znver1 -mtune=znver1 -mfma -mavx2 -m3dnow -fomit-frame-pointer | ||
| + | |||
| + | == OpenMP == | ||
| + | |||
| + | The automated parallelization is not usually very good, so it requires directives in the code for good performance. | ||
| - | * Set the target architecture, | + | * icc -qopenmp -parallel |
| - | The similar generations of Intel E5 processors are mostly distinguished by their floating point: nehalem(SSE4.2), | + | * icx -qopenmp |
| + | * clang -fopenmp | ||
| + | * pgicc -mp | ||
| + | * gcc -fopenmp | ||
| - | ** icc -x{sandybridge|ivybridge|haswell|skylake-avx512|HOST(compile host)}, limited options for AMD | ||
| - | ** icx -x{mostly the same as icc} | ||
| - | ** clang -march=znver{1: | ||
| - | ** pgicc -tp={bulldozer|sandybridge|ivybridge|haswell|skylake|zen|zen2|zen3|native (compile host)} | ||
| - | ** gcc --march={bdver1|nehalem|sandybridge|ivybridge|haswell|skylake-avx512|znver1|znver2|znver3|native} | ||
| - | ** gcc --mtume={bdver1|nehalem|sandybridge|haswell|skylake-avx512|znver1|znver2|znver3} | ||
| + | == Optimized Libraries == | ||
| + | It is best where possible to use standard libraries for low-level numerical calculations. | ||
| - | | + | These include |
| + | * BLAS and LAPACK: Intel MKL, AMD AOCL, OpenBLAS | ||
| + | * FFT: FFTW, MKL, AOCL | ||
| + | * Solvers: AOCL, MKL, Scalapack, Elpa, PetSC, and others | ||
| + | * Random Numbers: AOCL, MKL | ||
| + | ==MPI Versions== | ||
| + | * Intel MPI: usually the easiest as it has run-time interfaces for multiple compilers | ||
| + | * Open MPI: often the fastest, must be compiled with the compiler in use | ||
| + | * MVAPICH: (MPICH for Infiniband): | ||
optimization.1678397696.txt.gz · Last modified: (external edit)
