User Tools

Site Tools


resource_selection

**This is an old revision of the document!**

Resource_Selection

Selecting an appropriate resource is required for using AHPCC resources. Latitude is given when the exact resources are not known because a similar job has not been run yet. The purpose of this policy is to use very expensive resources such as GPU nodes and high-memory nodes to run programs that require those capabilities, and to use less expensive resources to run programs for which they are suited, thus producing the most computing output per dollar.

comp partitions

The most numerous and most crowded computing resource is the comp01,comp06,comp72 partitions, which are overlaid on mostly the same set of about 50 compute nodes. You can use the program shown below to search for idle nodes in every partition. All nodes are identical in the comp and cloud partitions: dual Intel Gold 6130 with no gpu, 32 cores, and 192 GB of memory.

If your code does not use a GPU and is either (1) MPI or shared memory parallel and able to use 32 cores, or (2) uses over 100 GB of main memory with 1 to 32 cores, you may use the comp partitions. These partitions are popular and often full. comp01,comp06, and comp72 have time limits of 1,6, and 72 hours respectively. The advantage of the shorter time limits is that comp06 has a higher queue priority than comp72, and comp01 has a higher queue priority than comp06, thus enabling faster starting of jobs. The partition time limits are hard limits: comp01 will terminate after 1 hour whether finished or not.

At time of execution there is only one idle node available.

/share/apps/bin/partition-status.sh comp01 | grep idle
c1401 idled public,0gpu,192gb,i6130,avx512,32c,intel
acomp partition

There is a very small non-gpu partition with (currently one) 64-core AMD machine acomp06 that is limited to 6 hours. Its machines are two to three times as fast as comp nodes for scalable programs and have six times the memory. If your program can't use more than a 32-core, 192 GB comp node, don't submit to acomp06.

/share/apps/bin/partition-status.sh acomp06
c2111 idled public,0gpu,1024gb,a7543,avx2,64c,amd
cloud partition

If your code does not use a GPU and uses 1-4 cores and 0-10 GB of main memory, use the cloud72 partition. This partition is usually immediately available, with alloc below indicating no cores are available, idled all cores available, and mixed some cores available. cloud72 is usually the best partition for simple tasks such as compiling, moving files, and non-compute-intensive non-parallel R,matlab,and python.
The frontend/login virtual machines are very low-powered only suitable for submitting jobs, editing, and viewing output. Do not run any kind of computational task on them.

/share/apps/bin/partition-status.sh cloud72
c1331 mixed public,0gpu,192gb,i6130,avx512,32c,intel,cloud
c1601 alloc public,0gpu,192gb,i6130,avx512,32c,intel,cloud
c1602 idled public,0gpu,192gb,i6130,avx512,32c,intel,cloud
himem partitions

There is a small partition of high-memory (768 GB) Intel computers called himem06 and himem72. If your program can run on a 192 GB comp machine, use that instead. When running new programs, you can use himem once to find out what the memory usage is. The himem nodes are the same architecture as the comp nodes, but have only 24 cores of higher frequency to better run poorly scaled codes such as bioinformatics and comsol.

/share/apps/bin/partition-status.sh himem72
c1422 alloc public,0gpu,768gb,i6128,24c,intel
c1424 alloc public,0gpu,768gb,i6128,24c,intel
c1425 alloc public,0gpu,768gb,i6128,24c,intel
c1426 alloc public,0gpu,768gb,i6128,24c,intel
c1428 alloc public,0gpu,768gb,i6128,24c,intel
gpu partitions

If your code does use an NVidia GPU, you may use the gpu72,agpu72,or qgpu72 partitions. The most numerous are gpu72 nodes similar to the comp nodes except having a single V100 GPU.

/share/apps/bin/partition-status.sh gpu72 | grep idle
c1711 idled public,1v100,192gb,i6130,avx512,32c,intel
c1713 idled public,1v100,192gb,i6130,avx512,32c,intel

Next most numerous agpu72 nodes are similar to acomp nodes but have a single (PCI) A100 GPU. Least numerous qgpu72 nodes are similar to agpu72 nodes except have four A100 GPUs connected by NVLink. qgpu72 nodes are primarily intended for data analysis and AI, and are not eligible to run programs that can only utilize a single GPU. They are also not eligible to run programs that poorly (as determined by AHPCC) utilize multiple GPUs, such as vasp and namd.

/share/apps/bin/partition-status.sh agpu72 | grep idle
c2005 idled public,1a100,1024gb,a7543,avx2,64c,amd
c2006 idled public,1a100,1024gb,a7543,avx2,64c,amd
c2007 idled public,1a100,1024gb,a7543,avx2,64c,amd
c2110 idled public,1a100,1024gb,a7543,avx2,64c,amd
resource_selection.1706044376.txt.gz · Last modified: 2024/01/23 21:12 by root