namd2023

namd 2023

Here is an update on namd for the shared memory one-node version namd2/namd3 and the multi-node version charmrun++. The standard NAMD benchmark apoa1 is too small to show the scaling on a reasonably modern system, so here we use a user's lipid simulation for 25k steps until it prints its “benchmark” performance.

Versions

Most of NAMD 3 and a few of the newer NAMD 2 are not usable with Centos 7 OS because of being compiled with too new a glibc. The exceptions we have found are 2.15alpha1 (CPU AVX512) and 3.0-alpha7-GPU, which are the best performing runnable versions we have found, but both only have “multicore” or single node shared memory namd2/3 versions. For charmrun++, a verbs-smp edition such as 2.14 is indicated. Newer versions will be available after we reimage the cluster in Rocky 8 OS.

CPU

We are using the number of cores available on the node, either “+p32” or “+p64”, except: NAMD recommends using one fewer core to run than present in the hardware; we find that to be beneficial for Intels and not beneficial for AMDs as reflected in the examples.

namd2 shared memory

The 2.14 verbs-smp version can be used with both namd2 and charmrun++. Pinnacle I is over twice as fast as Trestles on this version, and Pinnacle II is over twice as fast as Pinnacle I.

module load namd/2.14
#Pinnacle I Intel 6130 2.09 ns/day
namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Pinnacle II AMD 7543 0.81 ns/day
namd2 +p64 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Trestles AMD 4.51 ns/day
namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp

The 2.15a1 AVX512 version with namd2 here runs only on Pinnacle I, but is very much faster for that case than is 2.14.

module load namd/2.15a1
#Pinnacle I Intel 6130 1.24 ns/day
namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp

charmrun++ running namd2

Single node 2.14 charmrun++ ++np 1 with ++ppn ## moved to left side should run equivalently to the same namd and same ++ppn ##.

With two nodes, in a few cases charmrun++ scales fairly well, but because of better alternatives, the prospects for worthwhile charmrun++ runs are few with this set of compute nodes.

On Pinnacle I, 2.14 charmrun++ ++np 2 scaled well but was still hardly faster than single-node 2.15a1 namd2. Three nodes didn't scale well at all, so there's not really a good use case for charmrun++.

module load namd/2.14
#Pinnacle I Intel 6130 1 node 2.09 ns/day 
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Pinnacle I Intel 6130 2 node 1.17 ns/day
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync
#Pinnacle I 3 node 0.88 ns/day
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync

On Pinnacle II, two-node charmrun++ didn't scale well, so again little use case for charmrun++.

module load namd/2.14
#Pinnacle II AMD 7543 2 node 0.69 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp

On Trestles, two-node 2.14 scaled well to about the same speed as one-node 2.14 namd2 on Pinnacle I. Three nodes did not scale well. So here there may be a use case for using an uncrowded cluster.

module load namd/2.14
#Trestles AMD 2 node 2.81 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp
#Trestles AMD 2 node 1.99 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp

=nodelist=

charmrun++ is expecting, instead of an mpirun hostfile/machinefile as generated by slurm, a file called nodelist that resembles this:

host tres0931
host tres0929
host tres0928

To modify the machinefile (generated by the system for each job) to be a nodelist in the PWD, try

cat /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} | sed "s/^/host /" >nodelist

Overall there is not a good use case for charmrun++ because there are better alternatives, except for utilizing the underused capacity of Trestles.

GPU

Here we are using the number of CPU cores available on the node (24/32/64) and one GPU (two or more GPUs devices 0,1,2,3 scale poorly, not recommended or approved for AHPCC public use partitions). This benchmark simulation scaled significantly with the CPU cores used up to the number of cores present. A different test simulation didn't really scale at all and +p4 was best. It's not apparent to us from the input files why there is a difference. Test runs would be useful if you are going to do a lot of them.

On the gpu72 nodes with Intel 6130 and single NVidia V100, it's about 5 times faster than the best CPU version, so are a good use case. On agpu72 nodes with AMD7543 and single A100, it's only about 10% faster than 6130/V100, so that's not a good use case for the more expensive AMD/A100 nodes, unless gpu memory requires the newer GPU. The even more expensive multi-gpu qgpu72 nodes also don't scale well over single-gpu and are not a good use case.

#gpu72/v100:
module load namd/3.0a7
namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp
Info: Benchmark time: 32 CPUs 0.0393942 s/step 0.227976 days/ns 0 MB memory
#agpu72/a100: not recommended unless memory requires
namd3 +p64 +setcpuaffinity +isomalloc_sync _devices 0 step7.24_production.inp
Info: Benchmark time: 64 CPUs 0.0344332 s/step 0.199266 days/ns 0 MB memory