===Resource_Selection=== Selecting an appropriate resource for your program is **required** when using AHPCC resources. The purpose of this policy is to use very expensive resources such as GPU nodes and high-memory nodes to run programs that require those capabilities, and to use less expensive resources to run programs for which they are suited, thus producing the most computing throughput per dollar. Some latitude is given when the exact resources are not known because a similar job has not been run yet. This page details the ``public`` use partitions, those that are available for all researchers who are eligible to use AHPCC. For these ``public`` partitions, every computer in a given partition is identical, so the partition itself specifies the configuration. A **second page** [[condo_nodes]] details the (currently 164) researcher-funded compute nodes which can be accessed for **either** (1) dedicated use by those researchers in the ``condo`` partition, **or** (2) limited-time ``public`` usage in the ``pcon06`` partition. Both uses require some additional specification to select the correct nodes out of currently 30 node configurations. ==login nodes== When you first login to AHPCC, you will be logged into a ``login node`` or ``frontend node`` or ``portal node`` depending on whether you used ``ssh`` or the ``https`` OpenOnDemand portal to access. All are low-powered and only suitable for starting graphical sessions, submitting jobs, editing, and viewing output. **Do not run any kind of computational task** on them. Submit either a portal job, a batch script, or an srun session to do intensive computing. Those sessions run via ``Slurm partitions`` on one or more of our approximately 400 compute nodes. ==comp partitions== The most popular (and hence the slowest starting) computing resources are the ``comp01``,``comp06``,``comp72`` partitions, which are overlaid on mostly the same set of about 50 compute nodes, and differ only by time limit. You can use the program shown below to search for idle nodes in every partition. All nodes are identical in the ``comp`` and ``cloud`` partitions: dual Intel Gold 6130 with no gpu, 32 cores, and 192 GB of memory. If your code **does not use a GPU** and is **either** (1) **MPI or shared memory parallel and able to use 32 cores**, or (2) **uses over 100 GB of main memory with 1 to 32** cores, you **may use** the ``comp`` partitions. These partitions are popular and often full. ``comp01``,``comp06``, and ``comp72`` have time limits of 1,6, and 72 hours respectively. The advantage of the shorter time limits is that ``comp06`` has a higher queue priority than ``comp72``, and ``comp01`` has a higher queue priority than ``comp06``, thus enabling faster starting of jobs. The partition time limits are hard limits: ``comp01`` will terminate after 1 hour whether finished or not. At time this program was run, there is only one idle node available: /share/apps/bin/partition-status.sh comp01 | grep idle c1401 idled public,0gpu,192gb,i6130,avx512,32c,intel ==acomp partition== There is a very small non-gpu partition with (currently one) 64-core AMD machine ``acomp06`` that is limited to 6 hours. Its machines are two to three times as fast as ``comp`` nodes for scalable programs and have six times the memory. If **your program can't use more than a 32-core, 192 GB** ``comp`` node, don't submit to ``acomp06``. /share/apps/bin/partition-status.sh acomp06 c2111 idled public,0gpu,1024gb,a7543,avx2,64c,amd ==cloud partition== If your code **does not use a GPU** and uses **1-4 cores and 0-10 GB** of main memory, use the ``cloud72`` partition. This partition is usually immediately available, with ``alloc`` below indicating no cores are available, ``idled`` all cores available, and ``mixed`` some cores available. ``cloud72`` **is usually the best partition** for simple tasks such as compiling, moving files, and non-compute-intensive non-parallel ``R``,``matlab``,and ``python``. /share/apps/bin/partition-status.sh cloud72 c1331 mixed public,0gpu,192gb,i6130,avx512,32c,intel,cloud c1601 alloc public,0gpu,192gb,i6130,avx512,32c,intel,cloud c1602 idled public,0gpu,192gb,i6130,avx512,32c,intel,cloud ==himem partitions== There is two small partitions of high-memory (768 GB) Intel computers called ``himem06`` and ``himem72``. If your program can run on a 192 GB ``comp`` node, **use that instead**. When running new programs, you can use ``himem`` once to find out what the memory usage is. The ``himem`` nodes are the same architecture as the ``comp`` nodes, but have only 24 cores of higher frequency to better run poorly scaled codes requiring large shared memory such as bioinformatics programs and ``comsol``. /share/apps/bin/partition-status.sh himem72 c1422 alloc public,0gpu,768gb,i6128,24c,intel c1424 alloc public,0gpu,768gb,i6128,24c,intel c1425 alloc public,0gpu,768gb,i6128,24c,intel c1426 alloc public,0gpu,768gb,i6128,24c,intel c1428 alloc public,0gpu,768gb,i6128,24c,intel ==gpu partitions== If your code **uses an NVidia GPU**, you may use the ``gpu72``,``agpu72``,or ``qgpu72`` partitions. The most numerous are ``gpu72`` nodes similar to the ``comp`` nodes except also having a single V100 GPU. Non-GPU programs submitted to GPU partitions are subject to immediate cancellation. /share/apps/bin/partition-status.sh gpu72 | grep idle c1711 idled public,1v100,192gb,i6130,avx512,32c,intel c1713 idled public,1v100,192gb,i6130,avx512,32c,intel Next most numerous ``agpu72`` nodes are similar to ``acomp`` nodes but have a single (PCI) A100 GPU. Least numerous ``qgpu72`` nodes are similar to ``agpu72`` nodes except have four A100 GPUs connected by NVLink. ``qgpu72`` nodes are primarily intended for data analysis and AI, and are **not eligible** to run programs that can only utilize a single GPU. They are also **not eligible** to run programs that poorly (as determined by AHPCC) utilize multiple GPUs, such as ``vasp`` and ``namd``. /share/apps/bin/partition-status.sh agpu72 | grep idle c2005 idled public,1a100,1024gb,a7543,avx2,64c,amd c2006 idled public,1a100,1024gb,a7543,avx2,64c,amd c2007 idled public,1a100,1024gb,a7543,avx2,64c,amd c2110 idled public,1a100,1024gb,a7543,avx2,64c,amd ==tres288 partition== Trestles is a cluster of about 75 functionally obsolete compute nodes whose major virtue is that they are usually available immediately in quantity. Trestles can be useful for test jobs and for parameter sweeps that are not individually compute-intensive. The 32 cores per node each run at about half or less of the speed of more modern Intel compute nodes. Partition ``tres288`` can be run for up to 288 hours, and the original partition ``tres72`` has been updated for the same maximum time so it is functionally identical to ``tres288``, with the original name retained for compatibility. Programs that use ``sse4`` or ``avx`` floating point or above won't work on Trestles. We don't care about efficient usage of Trestles, so there are few limitations. /share/apps/bin/partition-status.sh tres288 | tail -1 t0931 idled public,0gpu,64gb,a6136,sse4,32c,amd