User Tools

Site Tools


namd2023

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
namd2023 [2024/01/26 19:36]
root
namd2023 [2024/03/04 19:55] (current)
root
Line 1: Line 1:
-====namd 2023====+=====namd 2023=====
  
-An update on [[namd]] for the shared memory one-node version ``namd2/namd3``.  At this writing the production version of NAMD is 3.0b5 for CPU and GPU, which requires Rocky 8 so can't run on our production compute nodes. With current GPUs the standard NAMD benchmark apoa1 is too small to show the actual performance, so here we use a user's lipid simulation for 25k steps. +Here is an update on [[namd]] for the shared memory one-node version ``namd2/namd3`` and the multi-node version ``charmrun++`` The standard NAMD benchmark apoa1 is too small to show the scaling on  a reasonably modern system, so here we use a user's lipid simulation for 25k steps until it prints its "benchmark" performance.
  
-Examples follow for the newest versions we can run.  These don'have modules as all they need is a PATH and they will soon be outdated.+====Versions==== 
 +Most of NAMD 3 and a few of the newer NAMD 2 are not usable with Centos 7 OS because of being compiled with too new a glibc.  The exceptions we have found are 2.15alpha1 (CPU AVX512) and 3.0-alpha7-GPU, which are the best performing runnable versions we have found, but both only have "multicore" or single node shared memory ``namd2/3`` versions. For ``charmrun++``, a verbs-smp edition such as 2.14 is indicated. Newer versions will be available after we reimage the cluster in Rocky 8 OS. 
 + 
 +===CPU=== 
 +We are using the number of cores available on the node, either "+p32" or "+p64", except:  NAMD recommends using one fewer core to run than present in the hardware; we find that to be beneficial for Intels and not beneficial for AMDs as reflected in the examples. 
 + 
 +==namd2 shared memory== 
 + 
 +The 2.14 ``verbs-smp`` version can be used with both ``namd2`` and ``charmrun++``.  Pinnacle I is over twice as fast as Trestles on this version, and Pinnacle II is over twice as fast as Pinnacle I.
  
-==CPU== 
-Using the number of cores available on the node, either "+p32" or "+p64". 
 <code> <code>
-export PATH=$PATH:/share/apps/NAMD/./NAMD_2.13_Linux-x86_64-multicore+module load namd/2.14 
 +#Pinnacle I Intel 6130 2.09 ns/day 
 +namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
 +#Pinnacle II AMD 7543 0.81 ns/day 
 +namd2 +p64 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
 +#Trestles AMD 4.51 ns/day
 namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp namd2 +p32 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
 </code> </code>
  
-==GPU== +The 2.15a1 AVX512 version with ``namd2`` here runs only on Pinnacle Ibut is very much faster for that case than is 2.14.
-Again using the number of cores available on the node (24/32/64) and one GPU (two or more GPUs ``devices 0,1,2,3`` scale poorlynot recommended or approved for public use partitions).+
 <code> <code>
-export PATH=$PATH:/share/apps/NAMD/NAMD_3.0alpha7_Linux-x86_64-multicore-CUDA +module load namd/2.15a1 
-namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp+#Pinnacle I Intel 6130 1.24 ns/day 
 +namd2 +p31 +setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp
 </code> </code>
  
-==Results==+==charmrun++ running namd2== 
 + 
 +Single node 2.14 ``charmrun++ ++np 1`` with ``++ppn ##`` moved to left side should run equivalently to the same ``namd`` and same ``++ppn ##``.  
 + 
 +With two nodes, in a few cases ``charmrun++`` scales fairly well, but because of better alternatives, the prospects for worthwhile ``charmrun++`` runs are few with this set of compute nodes. 
 + 
 +On Pinnacle I, 2.14 ``charmrun++ ++np 2`` scaled well but was still hardly faster than single-node 2.15a1 ``namd2``.  Three nodes didn't scale well at all, so there's not really a good use case for ``charmrun++``. 
 <code> <code>
-Partition Cores Proc GPU Used Walltime +module load namd/2.14 
-_____________________________________ +#Pinnacle I Intel 6130 node 2.09 ns/day  
-comp72    32c i6130            5979 +charmrun ++remote-shell ssh ++np ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync step7.2_production_colvar.inp 
-acomp06   64c a7543            2282 +#Pinnacle I Intel 6130 2 node 1.17 ns/day 
-tres72    32c a6136           0 13470 +charmrun ++remote-shell ssh ++np ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync 
-gpu72     32c i6130 1xV100     1168 +#Pinnacle I 3 node 0.88 ns/day 
-pcon06    32c i6130 2XV100     1038 +charmrun ++remote-shell ssh ++np ++ppn 31 `which namd2`+setcpuaffinity +isomalloc_sync
-agpu72    64c a7543 1xA100      884 +
-tgpu72    24c i4166 4xTitanV   1403 +
-tgpu72    12c i4166 4xTitanV   1689 +
-tgpu72    12c i4166 4xTitanV  1 +
-tgpu72    24c i4166 4xTitanV  2  1472+
 </code> </code>
  
-==ToDo== +On Pinnacle II, two-node ``charmrun++`` didn't scale well, so again little use case for ``charmrun++``. 
-Enable tgpu72 partition.+ 
 +<code> 
 +module load namd/2.14 
 +#Pinnacle II AMD 7543 2 node 0.69 ns/day 
 +charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +</code> 
 + 
 +On Trestles, two-node 2.14 scaled well to about the same speed as one-node 2.14 ``namd2`` on Pinnacle I. Three nodes did not scale well.  So here there may be a use case for using an uncrowded cluster. 
 + 
 +<code> 
 +module load namd/2.14 
 +#Trestles AMD 2 node 2.81 ns/day 
 +charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +#Trestles AMD 2 node 1.99 ns/day 
 +charmrun ++remote-shell ssh ++np 2 ++ppn 64 `which namd2` +setcpuaffinity +isomalloc_sync  step7.24_production.inp 
 +</code> 
 + 
 +=nodelist= 
 + 
 +``charmrun++`` is expecting, instead of an ``mpirun`` hostfile/machinefile as generated by slurm, a file called ``nodelist`` that resembles this: 
 +<code> 
 +host tres0931 
 +host tres0929 
 +host tres0928 
 +</code> 
 + 
 +To modify the machinefile (generated by the system for each job) to be a nodelist in the PWD, try 
 +<code> 
 +cat /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} | sed "s/^/host /" >nodelist 
 +</code> 
 + 
 +Overall there is not a good use case for ``charmrun++`` because there are better alternatives, except for utilizing the underused capacity of Trestles. 
 + 
 +==GPU== 
 + 
 +Here we are using the number of CPU cores available on the node (24/32/64) and one GPU (two or more GPUs ``devices 0,1,2,3`` scale poorly, not recommended or approved for AHPCC public use partitions)This benchmark simulation scaled significantly with the CPU cores used up to the number of cores present.  A different test simulation didn't really scale at all and ``+p4`` was best.  It's not apparent to us from the input files why there is a difference. Test runs would be useful if you are going to do a lot of them. 
 + 
 +On the ``gpu72`` nodes with Intel 6130 and single NVidia V100, it's about 5 times faster than the best CPU version, so are a good use case.  On ``agpu72`` nodes with AMD7543 and single A100, it's only about 10% faster than 6130/V100, so that's not a good use case for the more expensive AMD/A100 nodes, unless gpu memory requires the newer GPU.  The even more expensive multi-gpu ``qgpu72`` nodes also don't scale well over single-gpu and are not a good use case. 
 + 
 +<code> 
 +#gpu72/v100: 
 +module load namd/3.0a7 
 +namd3 +p32 +setcpuaffinity +isomalloc_sync +devices 0 step7.2_production_colvar.inp 
 +Info: Benchmark time: 32 CPUs 0.0393942 s/step 0.227976 days/ns 0 MB memory 
 +#agpu72/a100: not recommended unless memory requires 
 +namd3 +p64 +setcpuaffinity +isomalloc_sync _devices 0 step7.24_production.inp 
 +Info: Benchmark time: 64 CPUs 0.0344332 s/step 0.199266 days/ns 0 MB memory 
 +</code>
  
-Test 3.0b5 on Rocky 8 on test nodes. 
namd2023.1706297791.txt.gz · Last modified: 2024/01/26 19:36 by root