User Tools

Site Tools


namd2023

**This is an old revision of the document!**

namd 2023

Here is an update on namd for the shared memory one-node version namd2/namd3 and the multi-node version charmrun++. The standard NAMD benchmark apoa1 is too small to show the scaling on a reasonably modern system, so here we use a user's lipid simulation for 25k steps until it prints its “benchmark” performance.

Versions

Most of NAMD 3 and a few of the newer NAMD 2 are not usable with Centos 7 OS because of being compiled with too new a glibc. The exceptions we have found are 2.15alpha1 (CPU AVX512) and 3.0-alpha7-GPU, which are the best performing runnable versions we have found, but both only have “multicore” or single node shared memory namd2/3 versions. For charmrun++, a verbs-smp edition such as 2.14 is indicated. Newer versions will be available after we reimage the cluster in Rocky 8 OS.

CPU

We are using the number of cores available on the node, either “+p32” or “+p64”, except: NAMD recommends using one fewer core to run than present in the hardware; we find that to be beneficial for Intels and not beneficial for AMDs as reflected in the examples.

namd2 shared memory

The 2.14 verbs-smp version can be used with both namd2 and charmrun++. Pinnacle I is over twice as fast as Trestles on this version, and Pinnacle II is over twice as fast as Pinnacle I.

module load namd/2.14
#Pinnacle I Intel 6130 2.09 ns/day
namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Pinnacle II AMD 7543 0.81 ns/day
namd2 +p64 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Trestles AMD 4.51 ns/day
namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp

The 2.15a1 AVX512 version with namd2 here runs only on Pinnacle I, but is very much faster for that case than is 2.14.

module load namd/2.15a1
#Pinnacle I Intel 6130 1.24 ns/day
namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
charmrun++ running namd2

Single node 2.14 charmrun++ ++np 1 with ++ppn ## moved to left side should run equivalently to the same namd and same ++ppn ##.

With two nodes, in a few cases charmrun++ scales fairly well, but because of better alternatives, the prospects for worthwhile charmrun++ runs are few with this set of compute nodes.

On Pinnacle I, 2.14 charmrun++ ++np 2 scaled well but was still hardly faster than single-node 2.15a1 namd2. Three nodes didn't scale well at all, so there's not really a good use case for charmrun++.

module load namd/2.14
#Pinnacle I Intel 6130 1 node 2.09 ns/day 
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
#Pinnacle I Intel 6130 2 node 1.17 ns/day
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync
#Pinnacle I 3 node 0.88 ns/day
charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync

On Pinnacle II, two-node charmrun++ didn't scale well, so again little use case for charmrun++.

module load namd/2.14
#Pinnacle II AMD 7543 2 node 0.69 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp

On Trestles, two-node 2.14 scaled well to about the same speed as one-node 2.14 namd2 on Pinnacle I. Three nodes did not scale well. So here there may be a use case for using an uncrowded cluster.

module load namd/2.14
#Trestles AMD 2 node 2.81 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp
#Trestles AMD 2 node 1.99 ns/day
charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp

=nodelist=

charmrun++ is expecting, instead of an mpirun hostfile/machinefile as generated by slurm, a file called nodelist that resembles this:

host tres0931
host tres0929
host tres0928

To modify the machinefile (generated by the system for each job) to be a nodelist in the PWD, try

cat /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} | sed "s/^/host /" >nodelist

Overall there is not a good use case for charmrun++ because there are better alternatives, except for utilizing the underused capacity of Trestles.

GPU

Here using the number of CPU cores available on the node (24/32/64) and one GPU (two or more GPUs devices 0,1,2,3 scale poorly, not recommended or approved for AHPCC public use partitions). This is a simulation that takes about 3.5 GB of GPU memory. With a smaller simulation that takes about 400 MB of GPU memory, the CPUs used did not scale and were best around +p4.

On the gpu72 nodes with Intel 6130 and single NVidia V100, it's about 5 times faster than the best CPU version, so are a good use case. On agpu72 nodes with AMD7543 and single A100, it's only about 10% faster than 6130/V100, so that's not a good use case for the more expensive AMD/A100 nodes, unless gpu memory requires the newer GPU. The even more expensive multi-gpu qgpu72 nodes also don't scale well over single-gpu and are not a good use case.

#gpu72/v100:
module load namd/3.0a7
namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp
Info: Benchmark time: 32 CPUs 0.0393942 s/step 0.227976 days/ns 0 MB memory
#agpu72/a100: not recommended unless memory requires
namd3 +p64 +setcpuaffinity +isomalloc_sync _devices 0 step7.24_production.inp
Info: Benchmark time: 64 CPUs 0.0344332 s/step 0.199266 days/ns 0 MB memory
namd2023.1709581167.txt.gz · Last modified: 2024/03/04 19:39 by root