====Vectorization====

Here we test single-core vectorization using several different CPUs and several different versions of the Intel compiler.   The test problem is a linear solve DGESV of order 4000, compiling and using the reference fortran BLAS to solve rather than the various versions of optimized BLAS that would be used in practice.  Most of the work for DGESV is done in the DGEMM matrix multiply routine.  Tested compiler options are -O1, -O2, -O3, and -O3 with -xsse3,-xssse3,-xsse4.1,-xsse4.2, and -xavx. We have very few avx2-capable machines so that is not tested.

Tested CPUs are X5670 from our Razor 12core queues and E2650v2 from condo queues, soon to be available to all for short jobs, and AMD 6136 from Trestles.  We hope to find a minimal subset of options that will work on as many CPUs as possible. A summary of results-
  - For this code, Intel 11.1 and 12.1 (essentially same results to 11.1, not shown) didn't vectorize at all
  - For Intel CPUs, Intel compiler 14.0 and 16.0 worked best with the highest vectorization that the CPU could use, either -xsse4.2 or -xavx.  This can be combined in one executable as -xsse4.2 -axavx where -ax shows the optional vectorization path.
  - Most versions of the Intel compiler don't vectorize at -O3 with no -x specified, except for 14.0, which apparently does
  - Intel compiler on AMD CPU is a little complicated, but there is a workaround.  The 6136 claims to have sse3 and sse4.a capability where 4.a is an AMD extension not recognized by the Intel compiler.  The issue is that the Intel compiler disables sse3 on AMD. The executable checks the CPUID string and then uses different code paths for different processors. The reason for this is disputed [[http://www.agner.org/optimize/blog/read.php?i=49#49]] .  For Intel compiler versions through 13, the executable can be patched [[https://github.com/jimenezrick/patch-AuthenticAMD]] by a binary editor that changes the comparison string for the CPUID from GenuineIntel to AuthenticAMD .  The patch doesn't work on 14+ binaries, and where it works it doesn't allow vectorization that the CPU doesn't have, so for the 6136 it enables sse3 only. Fortunately, the 14.0 compiler with -O3 and no -x specification appears by the execution time to enable sse3 without the GenuineIntel check.  Also by the execution time, it appears that the 16.0 compiler with -O3 and no -x is similar to versions up to 13, that is no vectorization.
  - **Recommended** best single executable that runs on AHPCC **AMD and Intel systems**: <code>Intel 14 compiler with -O3 -axsse4.2 (don't set -x)</code>
  - **Recommended** best single executable that runs on AHPCC **Intel systems only**: <code>Intel 14+ compiler with -O3 -xsse4.2 -axavx</code>

Note:compiling on razor frontends such as ''razor-l1'' have only sse4.1 available so may fail an internal test in the application if compiling with ''-x sse4.2''. In that case, use an interactive job to a debug queue to compile on an sse4.2 machine.


^ Version ^ Opt ^ Vector ^ X5670 Time (s) ^ E2650 Time (s) ^ 6136 Time (s) ^
| 11.1 | 1 |    none|  25.3 |  17.8 |  41.9|
| 11.1 | 2 |    none|  14.5 |  10.7 |  21.9|
| 11.1 | 3 |    none|  14.5 |  10.7 |  21.8|
| 11.1 | 3 |    sse3|  14.5 |  10.8 |  21.9*|
| 11.1 | 3 |   ssse3|  14.8 |  10.8 |   N/A|
| 11.1 | 3 |  sse4.1|  14.8 |  10.8 |   N/A|
| 11.1 | 3 |  sse4.2|  14.7 |  10.8 |   N/A|
| 11.1 | 3 |     avx|   N/A |  10.8 |   N/A|
|      |   |         |       |      |      |
| 13.1 | 1 |    none|  23.0 |  17.3 |  35.4|
| 13.1 | 2 |    none|  14.2 |  10.5 |  21.7|
| 13.1 | 3 |    none|  14.8 |  10.7 |  21.9|
| 13.1 | 3 |    sse3|   6.1 |   4.9 |  8.9*|
| 13.1 | 3 |   ssse3|   6.1 |   4.9 |   N/A|
| 13.1 | 3 |  sse4.1|   6.1 |   4.9 |   N/A|
| 13.1 | 3 |  sse4.2|   5.5 |   4.5 |   N/A|
| 13.1 | 3 |     avx|   N/A |   4.5 |   N/A|
|      |   |         |       |      |      |
| 14.0 | 1 |    none|  23.0 |  17.3 |  35.9|
| 14.0 | 2 |    none|  14.8 |  10.7 |  21.9|
| 14.0 | 3 |    none|   6.1 |   4.9 |   8.8|
| 14.0 | 3 |    sse3|   6.1 |   4.9 |   N/A|
| 14.0 | 3 |   ssse3|   6.1 |   4.9 |   N/A|
| 14.0 | 3 |  sse4.1|   6.1 |   4.9 |   N/A|
| 14.0 | 3 |  sse4.2|   5.5 |   4.4 |   N/A|
| 14.0 | 3 |     avx|   N/A |   3.5 |   N/A|
|      |   |         |       |       |      |
| 16.0 | 1 |    none|  23.0 |  17.3 |  39.6|
| 16.0 | 2 |    none|  14.7 |  10.6 |  21.8|
| 16.0 | 3 |    none|  14.7 |  10.6 |  21.8|
| 16.0 | 3 |    sse3|   6.2 |   4.8 |   N/A|
| 16.0 | 3 |   ssse3|   6.1 |   4.9 |   N/A|
| 16.0 | 3 |  sse4.1|   6.2 |   4.9 |   N/A|
| 16.0 | 3 |  sse4.2|   5.5 |   4.4 |   N/A|
| 16.0 | 3 |     avx|   N/A |   3.6 |   N/A|
|      |   |         |       |       |      |

*patched by AuthenticAMD editor