User Tools

Site Tools


namd2023

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
namd2023 [2024/01/26 18:04]
root created
namd2023 [2024/03/04 19:55] (current)
root
Line 1: Line 1:
-====namd 2023====+=====namd 2023=====
  
-An update on [[namd]].  At this writing the production version of NAMD is 3.0b5 for CPU and GPU, which requires Rocky 8 so can't run on our production compute nodesWith current GPUs the standard NAMD benchmark apoa1 is too small to show the actual performance, so here we use a user's lipid simulation for 25k steps. +Here is an update on [[namd]] for the shared memory one-node version ``namd2/namd3`` and the multi-node version ``charmrun++`` The standard NAMD benchmark apoa1 is too small to show the scaling on  a reasonably modern system, so here we use a user's lipid simulation for 25k steps until it prints its "benchmark" performance.
  
-Examples follow for the newest versions we can run.  These don'have modules as all they need is a PATH and they will soon be outdated.+====Versions==== 
 +Most of NAMD 3 and a few of the newer NAMD 2 are not usable with Centos 7 OS because of being compiled with too new a glibc.  The exceptions we have found are 2.15alpha1 (CPU AVX512) and 3.0-alpha7-GPU, which are the best performing runnable versions we have found, but both only have "multicore" or single node shared memory ``namd2/3`` versions. For ``charmrun++``, a verbs-smp edition such as 2.14 is indicated. Newer versions will be available after we reimage the cluster in Rocky 8 OS. 
 + 
 +===CPU=== 
 +We are using the number of cores available on the node, either "+p32" or "+p64", except:  NAMD recommends using one fewer core to run than present in the hardware; we find that to be beneficial for Intels and not beneficial for AMDs as reflected in the examples. 
 + 
 +==namd2 shared memory== 
 + 
 +The 2.14 ``verbs-smp`` version can be used with both ``namd2`` and ``charmrun++``.  Pinnacle I is over twice as fast as Trestles on this version, and Pinnacle II is over twice as fast as Pinnacle I.
  
-==CPU== 
-Using the number of cores available on the node, either "+p32" or "+p64". 
 <code> <code>
-export PATH=$PATH:/share/apps/NAMD/./NAMD_2.13_Linux-x86_64-multicore+module load namd/2.14 
 +#Pinnacle I Intel 6130 2.09 ns/day 
 +namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
 +#Pinnacle II AMD 7543 0.81 ns/day 
 +namd2 +p64 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
 +#Trestles AMD 4.51 ns/day
 namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
 </code> </code>
  
-==GPU== +The 2.15a1 AVX512 version with ``namd2`` here runs only on Pinnacle Ibut is very much faster for that case than is 2.14.
-Again using the number of cores available on the node (24/32/64) and one GPU (two or more GPUs ``devices 0,1`` scale poorlynot recommended or approved for public use partitions).+
 <code> <code>
-export PATH=$PATH:/share/apps/NAMD/NAMD_3.0alpha7_Linux-x86_64-multicore-CUDA +module load namd/2.15a1 
-namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp+#Pinnacle I Intel 6130 1.24 ns/day 
 +namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
 </code> </code>
  
-==Results==+==charmrun++ running namd2== 
 + 
 +Single node 2.14 ``charmrun++ ++np 1`` with ``++ppn ##`` moved to left side should run equivalently to the same ``namd`` and same ``++ppn ##``.  
 + 
 +With two nodes, in a few cases ``charmrun++`` scales fairly well, but because of better alternatives, the prospects for worthwhile ``charmrun++`` runs are few with this set of compute nodes. 
 + 
 +On Pinnacle I, 2.14 ``charmrun++ ++np 2`` scaled well but was still hardly faster than single-node 2.15a1 ``namd2``.  Three nodes didn't scale well at all, so there's not really a good use case for ``charmrun++``. 
 <code> <code>
-Partition Cores Proc GPU Used Walltime +module load namd/2.14 
-_____________________________________ +#Pinnacle I Intel 6130 1 node 2.09 ns/day  
-comp72    32c i6130            5979 +charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
-acomp72   64c a7543           0  2282 +#Pinnacle I Intel 6130 2 node 1.17 ns/day 
-tres72    32c a6136           0 running +charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync 
-gpu72     32c i6130 1xV100    1  1168 +#Pinnacle I 3 node 0.88 ns/day 
-pcon06    32c i6130 2XV100    2  1038 +charmrun ++remote-shell ssh ++np 1 ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync 
-agpu72    64c a7543 1xA100      884 +</code> 
-tgpu72    24c i4166 4xTitanV  1  1638 + 
-tgpu72    24c i4166 4xTitanV   1472+On Pinnacle II, two-node ``charmrun++`` didn't scale well, so again little use case for ``charmrun++``. 
 + 
 +<code> 
 +module load namd/2.14 
 +#Pinnacle II AMD 7543 2 node 0.69 ns/day 
 +charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +</code> 
 + 
 +On Trestles, two-node 2.14 scaled well to about the same speed as one-node 2.14 ``namd2`` on Pinnacle I. Three nodes did not scale well.  So here there may be a use case for using an uncrowded cluster. 
 + 
 +<code> 
 +module load namd/2.14 
 +#Trestles AMD 2 node 2.81 ns/day 
 +charmrun ++remote-shell ssh ++np ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +#Trestles AMD 2 node 1.99 ns/day 
 +charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +</code> 
 + 
 +=nodelist= 
 + 
 +``charmrun++`` is expecting, instead of an ``mpirun`` hostfile/machinefile as generated by slurm, a file called ``nodelist`` that resembles this: 
 +<code> 
 +host tres0931 
 +host tres0929 
 +host tres0928 
 +</code> 
 + 
 +To modify the machinefile (generated by the system for each job) to be a nodelist in the PWD, try 
 +<code> 
 +cat /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} | sed "s/^/host /" >nodelist 
 +</code> 
 + 
 +Overall there is not a good use case for ``charmrun++`` because there are better alternatives, except for utilizing the underused capacity of Trestles. 
 + 
 +==GPU== 
 + 
 +Here we are using the number of CPU cores available on the node (24/32/64) and one GPU (two or more GPUs ``devices 0,1,2,3`` scale poorly, not recommended or approved for AHPCC public use partitions). This benchmark simulation scaled significantly with the CPU cores used up to the number of cores present.  A different test simulation didn't really scale at all and ``+p4`` was best.  It's not apparent to us from the input files why there is a difference. Test runs would be useful if you are going to do a lot of them. 
 + 
 +On the ``gpu72`` nodes with Intel 6130 and single NVidia V100, it's about 5 times faster than the best CPU version, so are a good use case.  On ``agpu72`` nodes with AMD7543 and single A100, it's only about 10% faster than 6130/V100, so that's not a good use case for the more expensive AMD/A100 nodes, unless gpu memory requires the newer GPU.  The even more expensive multi-gpu ``qgpu72`` nodes also don't scale well over single-gpu and are not a good use case. 
 + 
 +<code> 
 +#gpu72/v100: 
 +module load namd/3.0a7 
 +namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp 
 +Info: Benchmark time: 32 CPUs 0.0393942 s/step 0.227976 days/ns 0 MB memory 
 +#agpu72/a100: not recommended unless memory requires 
 +namd3 +p64 +setcpuaffinity +isomalloc_sync _devices 0 step7.24_production.inp 
 +Info: Benchmark time: 64 CPUs 0.0344332 s/step 0.199266 days/ns 0 MB memory
 </code> </code>
  
namd2023.1706292296.txt.gz · Last modified: 2024/01/26 18:04 by root