Here we test single-core vectorization using several different CPUs and several different versions of the Intel compiler. The test problem is a linear solve DGESV of order 4000, compiling and using the reference fortran BLAS to solve rather than the various versions of optimized BLAS that would be used in practice. Most of the work for DGESV is done in the DGEMM matrix multiply routine. Tested compiler options are -O1, -O2, -O3, and -O3 with -xsse3,-xssse3,-xsse4.1,-xsse4.2, and -xavx. We have very few avx2-capable machines so that is not tested.
Tested CPUs are X5670 from our Razor 12core queues and E2650v2 from condo queues, soon to be available to all for short jobs, and AMD 6136 from Trestles. We hope to find a minimal subset of options that will work on as many CPUs as possible. A summary of results-
Intel 14 compiler with -O3 -axsse4.2 (don't set -x)
Intel 14+ compiler with -O3 -xsse4.2 -axavx
Note:compiling on razor frontends such as razor-l1
have only sse4.1 available so may fail an internal test in the application if compiling with -x sse4.2
. In that case, use an interactive job to a debug queue to compile on an sse4.2 machine.
Version | Opt | Vector | X5670 Time (s) | E2650 Time (s) | 6136 Time (s) |
---|---|---|---|---|---|
11.1 | 1 | none | 25.3 | 17.8 | 41.9 |
11.1 | 2 | none | 14.5 | 10.7 | 21.9 |
11.1 | 3 | none | 14.5 | 10.7 | 21.8 |
11.1 | 3 | sse3 | 14.5 | 10.8 | 21.9| | 11.1 | 3 | ssse3| 14.8 | 10.8 | N/A| | 11.1 | 3 | sse4.1| 14.8 | 10.8 | N/A| | 11.1 | 3 | sse4.2| 14.7 | 10.8 | N/A| | 11.1 | 3 | avx| N/A | 10.8 | N/A| | | | | | | | | 13.1 | 1 | none| 23.0 | 17.3 | 35.4| | 13.1 | 2 | none| 14.2 | 10.5 | 21.7| | 13.1 | 3 | none| 14.8 | 10.7 | 21.9| | 13.1 | 3 | sse3| 6.1 | 4.9 | 8.9 |
13.1 | 3 | ssse3 | 6.1 | 4.9 | N/A |
13.1 | 3 | sse4.1 | 6.1 | 4.9 | N/A |
13.1 | 3 | sse4.2 | 5.5 | 4.5 | N/A |
13.1 | 3 | avx | N/A | 4.5 | N/A |
14.0 | 1 | none | 23.0 | 17.3 | 35.9 |
14.0 | 2 | none | 14.8 | 10.7 | 21.9 |
14.0 | 3 | none | 6.1 | 4.9 | 8.8 |
14.0 | 3 | sse3 | 6.1 | 4.9 | N/A |
14.0 | 3 | ssse3 | 6.1 | 4.9 | N/A |
14.0 | 3 | sse4.1 | 6.1 | 4.9 | N/A |
14.0 | 3 | sse4.2 | 5.5 | 4.4 | N/A |
14.0 | 3 | avx | N/A | 3.5 | N/A |
16.0 | 1 | none | 23.0 | 17.3 | 39.6 |
16.0 | 2 | none | 14.7 | 10.6 | 21.8 |
16.0 | 3 | none | 14.7 | 10.6 | 21.8 |
16.0 | 3 | sse3 | 6.2 | 4.8 | N/A |
16.0 | 3 | ssse3 | 6.1 | 4.9 | N/A |
16.0 | 3 | sse4.1 | 6.2 | 4.9 | N/A |
16.0 | 3 | sse4.2 | 5.5 | 4.4 | N/A |
16.0 | 3 | avx | N/A | 3.6 | N/A |
*patched by AuthenticAMD editor