#### Namd

The namd-verbs-smp binary
<https://web.archive.org/web/20181127065652/http://www.ks.uiuc.edu/Research/namd/benchmarks/>
version 2.11 or 2.12 is installed in /share/apps/NAMD on [razor](razor)
and [trestles](trestles). It does not use MPI.

This is for multiple-node runs with `charmrun` as the distributed
component and `namd2` on each compute node. We have found most runs are
faster with the `+setcpuaffinity +isomalloc_sync` options. charmrun
`++ppn` should match PBS `ppn=`.

***

    module load namd/2.12  [or 2.11]
    cd $PBS_O_WORKDIR
    NP=$(wc -l <$PBS_NODEFILE)
    rm -f nodelist
    for node in `cat $PBS_NODEFILE | sort | uniq`
    do
      echo "host ${node}" >> nodelist
    done
    charmrun ++remote-shell ssh ++ppn 16 `which namd2` \
    +p $NP +setcpuaffinity +isomalloc_sync apoa1.namd >apoa1.logfile

***

This is for single-node run using only the shared-memory program
`namd2`.

    module load namd/2.12  [or 2.11]
    cd $PBS_O_WORKDIR
    NP=$(wc -l <$PBS_O_WORKDIR)
    namd2 +p $NP apoa1.namd +setcpuaffinity +isomalloc_sync >apoa1.logfile

***

##### Benchmarks

The NAMD website has benchmarks run on Trestles while at UCSD
<http://www.ks.uiuc.edu/Research/namd/performance.html>, but they don't
have any info on how the scores were obtained (namd2, charmrun, or mpi).
These are shown as benchmark time\*cores. Best results for charmrun here
were obtained with multiple nodes using ppn=cores/node,p=total cores (or
ppn\*nodes). Single nodes running namd2 were both p=cores and are
comparable with the published benchmarks. Version 2.12 is substantially
faster than 2.11. The downloaded verbs-smp version is set by the module
as it is faster than the ibverbs-smp version. On this problem, the Intel
version didn't show any useful scaling for more than 2 nodes, and AMD
not very useful scaling for more than 3 nodes.

***

    Node Type      ppn  version  p  Nodes Bench WallClock  UCSD Bench
    16-core Intel  16   2.11    16    1   1.21    383        n/a
    16-core Intel  16   2.12    16    1   0.76    256        n/a
    16-core Intel  16   2.12    16    2   0.90    146        n/a
    16-core Intel  16   2.12    16    3   1.32    146        n/a
    32-core AMD    32   2.12    32    1   1.95    317        1.9
    32-core AMD    32   2.12    32    2   2.22    185        2.0
    32-core AMD    32   2.12    32    3   2.29    127        n/a
    32-core AMD    32   2.12    32    4   2.56    104        n/a

***

##### 2020 Update

Replicated a couple of old benchmarks and added some new versions and
machines

    Cores Node Type  ppn GPU version     nodes WallClock
    16 Intel Razor    16      2.12          1   242
    32 AMD Trestles   32      2.12          1   315
    32 Intel G6130    32      2.12          1   127
    32 Intel G6130    32      2.13          1   127
    48 AMD Epyc 7402  48      2.13          1    89
    32 Intel G6130    32      2.15a1-AVX512 1    76
    32 Intel G6130    32 V100 3.0a7-cuda    1    39
    48 AMD Epyc 7402  48      2.15a1-AVX2 needs recompilation