Here is an update on namd for the shared memory one-node version namd2/namd3
and the multi-node version charmrun++
. The standard NAMD benchmark apoa1 is too small to show the scaling on a reasonably modern system, so here we use a user's lipid simulation for 25k steps until it prints its “benchmark” performance.
Most of NAMD 3 and a few of the newer NAMD 2 are not usable with Centos 7 OS because of being compiled with too new a glibc. The exceptions we have found are 2.15alpha1 (CPU AVX512) and 3.0-alpha7-GPU, which are the best performing runnable versions we have found, but both only have “multicore” or single node shared memory namd2/3
versions. For charmrun++
, a verbs-smp edition such as 2.14 is indicated. Newer versions will be available after we reimage the cluster in Rocky 8 OS.
We are using the number of cores available on the node, either “+p32” or “+p64”, except: NAMD recommends using one fewer core to run than present in the hardware; we find that to be beneficial for Intels and not beneficial for AMDs as reflected in the examples.
The 2.14 verbs-smp
version can be used with both namd2
and charmrun++
. Pinnacle I is over twice as fast as Trestles on this version, and Pinnacle II is over twice as fast as Pinnacle I.
module load namd/2.14 #Pinnacle I Intel 6130 2.09 ns/day namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp #Pinnacle II AMD 7543 0.81 ns/day namd2 +p64 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp #Trestles AMD 4.51 ns/day namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
The 2.15a1 AVX512 version with namd2
here runs only on Pinnacle I, but is very much faster for that case than is 2.14.
module load namd/2.15a1 #Pinnacle I Intel 6130 1.24 ns/day namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
Single node 2.14 charmrun++ ++np 1
with ++ppn ##
moved to left side should run equivalently to the same namd
and same ++ppn ##
.
With two nodes, in a few cases charmrun++
scales fairly well, but because of better alternatives, the prospects for worthwhile charmrun++
runs are few with this set of compute nodes.
On Pinnacle I, 2.14 charmrun++ ++np 2
scaled well but was still hardly faster than single-node 2.15a1 namd2
. Three nodes didn't scale well at all, so there's not really a good use case for charmrun++
.
module load namd/2.14 #Pinnacle I Intel 6130 1 node 2.09 ns/day charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp #Pinnacle I Intel 6130 2 node 1.17 ns/day charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync #Pinnacle I 3 node 0.88 ns/day charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync
On Pinnacle II, two-node charmrun++
didn't scale well, so again little use case for charmrun++
.
module load namd/2.14 #Pinnacle II AMD 7543 2 node 0.69 ns/day charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync step7.24_production.inp
On Trestles, two-node 2.14 scaled well to about the same speed as one-node 2.14 namd2
on Pinnacle I. Three nodes did not scale well. So here there may be a use case for using an uncrowded cluster.
module load namd/2.14 #Trestles AMD 2 node 2.81 ns/day charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync step7.24_production.inp #Trestles AMD 2 node 1.99 ns/day charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync step7.24_production.inp
=nodelist=
charmrun++
is expecting, instead of an mpirun
hostfile/machinefile as generated by slurm, a file called nodelist
that resembles this:
host tres0931 host tres0929 host tres0928
To modify the machinefile (generated by the system for each job) to be a nodelist in the PWD, try
cat /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} | sed "s/^/host /" >nodelist
Overall there is not a good use case for charmrun++
because there are better alternatives, except for utilizing the underused capacity of Trestles.
Here we are using the number of CPU cores available on the node (24/32/64) and one GPU (two or more GPUs devices 0,1,2,3
scale poorly, not recommended or approved for AHPCC public use partitions). This benchmark simulation scaled significantly with the CPU cores used up to the number of cores present. A different test simulation didn't really scale at all and +p4
was best. It's not apparent to us from the input files why there is a difference. Test runs would be useful if you are going to do a lot of them.
On the gpu72
nodes with Intel 6130 and single NVidia V100, it's about 5 times faster than the best CPU version, so are a good use case. On agpu72
nodes with AMD7543 and single A100, it's only about 10% faster than 6130/V100, so that's not a good use case for the more expensive AMD/A100 nodes, unless gpu memory requires the newer GPU. The even more expensive multi-gpu qgpu72
nodes also don't scale well over single-gpu and are not a good use case.
#gpu72/v100: module load namd/3.0a7 namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp Info: Benchmark time: 32 CPUs 0.0393942 s/step 0.227976 days/ns 0 MB memory #agpu72/a100: not recommended unless memory requires namd3 +p64 +setcpuaffinity +isomalloc_sync _devices 0 step7.24_production.inp Info: Benchmark time: 64 CPUs 0.0344332 s/step 0.199266 days/ns 0 MB memory