Arkansas High Performace Computing Center [hpcwiki]

How to use the Pinnacle Cluster

This is a brief “how to” summary of usage for users of the Pinnacle cluster.

Pinnacle has 101 compute nodes. 30 GPU and GPU-ready nodes are Dell R740, 69 nodes are Dell R640, two nodes are Dell R7425. There is no user-side difference between R740 (GPU-ready) and R640 nodes.

All-user nodes number 76, of which 6 nodes have 768 GB of memory and no GPU (himem.. partition), 19 nodes have 192 GB and one V100 GPU (gpu.. partition), and 51 are standard compute nodes with 192 GB and no GPU (comp.. partition).

Standard nodes have two Gold 6130 CPUs with total 32 cores at 2.1 GHz. 768 GB nodes have two Gold 6126 CPUs with total 24 cores at 2.6 GHz, fewer and faster cores for better performance on often poorly-threaded bioinformatics applications.

ssh to pinnacle.uark.edu redirects to one of two servers running 7 virtual login machines each, named pinnacle-l1 through pinnacle-l14. If there is a login problem you can try another ssh session and you will be assigned a different virtual machine, which may solve the problem.

Scheduler

All systems now use the slurm scheduler.Queues (slurm “partitions”) are:

comp72/06/01:     standard compute nodes, 72/6/1 hour limit, 42/46/48 nodes
gpu72/06:         gpu nodes: 72/6 hour limit, 19 nodes
agpu72/06:        a100 gpu nodes: 72/6 hour limit
himem72/06:       768 GB nodes, 72/6 hour limit, 6 nodes
pubcondo06:       condo nodes all-user use, 6 hour limit, various constraints required, 25 nodes
pcon06:           same as pubcondo06, shortened name for easier printout, use this going forward
cloud72:          virtual machines and containers, usually single processor, 72 hour limit, 3 nodes
condo:            condo nodes, no time limit, authorization required, various constraints required, 25 nodes
tres72/06:        reimaged trestles nodes, 72/06 hour limit, 126 nodes
razr72/06:        reimaged razor nodes, 72 hour limit, in progress

Transition from Torque/PBS

Basic slurm commands are shown, with transition from Torque/PBS/Maui. Compatibility commands are installed in slurm for qsub/qstat/qstat -u/qdel/qstat -q so those commands may still be used.

sbatch                      qsub                        submit <job file>
srun                        qsub -I                     submit interactive job
squeue                      qstat                       list all queued jobs
squeue -u -rfeynman         qstat -u rfeynman           list queued jobs for user rfeynman
scancel                     qdel                        cancel <job#>
sinfo                       shownodes -l -n;qstat -q    node status;list of queues

See also [ https://hprc.tamu.edu/wiki/TAMU_Supercomputing_Facility:HPRC:Batch_Translation ] [ https://slurm.schedmd.com/rosetta.pdf ] [ https://www.sdsc.edu/~hocks/FG/PBS.slurm.html ]

We have a conversion script /share/apps/bin/pbs2slurm.sh which should do 95% of the script conversion. Please report errors by the script so we can improve it. Here is an example conversion from PBS to SLURM:

pinnacle-l1:$ cat pbsscript.sh
#PBS -N espresso
#PBS -j oe
#PBS -o zzz.$PBS_JOBID
#PBS -l nodes=4:ppn=32,walltime=00:00:10
#PBS -q q06h32c
module purge
module load intel/14.0.3 mkl/14.0.3 fftw/3.3.6 impi/5.1.2
cd $PBS_O_WORKDIR
cp *.in *UPF /scratch/$PBS_JOBID
cd /scratch/$PBS_JOBID
sort -u $PBS_NODEFILE >hostfile
mpirun -ppn 16 -hostfile hostfile -genv OMP_NUM_THREADS 4 -genv MKL_NUM_THREADS 4 /share/apps/espresso/qe-6.1-intel-mkl-impi/bin/pw.x -npools 1 <ausurf.in
mv ausurf.log *mix* *wfc* *igk* $PBS_O_WORKDIR/


pinnacle-l1:$ pbs2slurm.sh pbsscript.sh  >slurmscript.sh

pinnacle-l1:$ cat slurmscript.sh
#!/bin/bash
#SBATCH --job-name=espresso
#SBATCH --output=zzz.slurm
#SBATCH --nodes=4
#SBATCH --tasks-per-node=32
$SBATCH --time=00:00:10
#SBATCH --partition comp06
module purge
module load intel/14.0.3 mkl/14.0.3 fftw/3.3.6 impi/5.1.2
cd $SLURM_SUBMIT_DIR
cp *.in *UPF /scratch/$SLURM_JOB_ID
cd /scratch/$SLURM_JOB_ID
sort -u /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} >hostfile
mpirun -ppn 16 -hostfile hostfile -genv OMP_NUM_THREADS 4 -genv MKL_NUM_THREADS 4 /share/apps/espresso/qe-6.1-intel-mkl-impi/bin/pw.x -npools 1 <ausurf.in
mv ausurf.log *mix* *wfc* *igk* $SLURM_SUBMIT_DIR/
tres-l1:$

and another sample slurm script.

#!/bin/bash
#SBATCH --partition comp06
#SBATCH --nodes=2
#SBATCH --tasks-per-node=32
#SBATCH --time=6:00:00
cd $SLURM_SUBMIT_DIR
module load intel/18.0.1 impi/18.0.1 mkl/18.0.1
mpirun -np $SLURM_NTASKS -machinefile /scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID} ./mympiexe -inputfile MA4um.mph -outputfile MA4um-output.mph

Notes:

Leading hash-bang /bin/sh or /bin/bash or /bin/tcsh is optional in torque, required in slurm, pbs2slurm.sh inserts it if not present

Slurm does not autogenerate an MPI hostfile/machinefile like torque. We have the prologue automatically generate this as:

/scratch/${SLURM_JOB_ID}/machinefile_${SLURM_JOB_ID}

The generated machinefile differs from torque machinefile in that it has 1 entry per host instead of ncores entries per host. Slurm defines variables $SLURMNTASKS and $SLURMCPUSPERTASK. Usually these should be set by the job request and request $SLURMNTASKS MPI tasks and $SLURMCPUSPERTASK OpenMP threads with their product usually equal to the number of cores in the node.

Interactive Jobs in SLURM

srun --nodes=1 --ntasks-per-node=1  --cpus-per-task=32 --partition gpu06 --time=6:00:00 --pty /bin/bash

Another script:

#!/bin/bash
#SBATCH --partition condo
#SBATCH --constraint=nvme
#SBATCH --nodes=1
#SBATCH --tasks-per-node=32
#SBATCH --time=144:00:00
#SBATCH --job-name=MOLPRO_lscr
cd $SLURM_SUBMIT_DIR
cp $SLURM_SUBMIT_DIR/mpr\*inp /local_scratch/$SLURM_JOB_ID/
cd /local_scratch/$SLURM_JOB_ID
module load mkl/18.0.2 intel/18.0.2 impi/18.0.2
/home/trr007/molpro/molprop_2015_1_linux_x86_64_i8/bin/molpro -n 4/4:8 mpr_qm_region.inp -d /local_scratch/$SLURM_JOB_ID -W /local_scratch/$SLURM_JOB_ID
rm -f sf_*TMP* fort*
rsync -av m* $SLURM_SUBMIT_DIR/

Software

Modules are the same as on the Trestles cluster. We recommend the more recent versions of compiler and math libraries so that they will recognize the AVX512 floating-point instructions. Examples:

module load intel/20.0.1 mkl/20.0.1 impi/20.0.1
module load gcc/9.3.1

Selecting the right Queue/Partition among multiple clusters

Generally the nodes are reserved for the most efficient use, especially for expensive features such as GPU and extra memory. Pinnacle compute nodes are very busy (comp.. and himem.. partitions) are reserved for scalable programs that can use all 32/24 cores (except for the cloud partition, and condo usage by the owner). Cores are allocated by the product of ntasks-per-node x cpus-per-task. Exceptions: (1) serial/single core jobs that use more memory than available on Razor/Trestles (64 to 192 GB) (2) multiple jobs submitted together that use a whole node, such as 4 x 8 cores (3) two jobs on one high-memory node (2 x 12 cores) that each use more than 192 GB (and less than 384 GB so that they can run on the himem node)

Single core serial jobs should be run on the cloud.. partitions or tres.. or razr.. partitions (unless requiring 64 to 192 GB, then run in the comp.. partitions with 32 cores allocated.

GPU nodes are reserved for programs that use the GPU (usually through the cuda libraries).

Large memory nodes are reserved for programs that use more shared memory than the 192 GB available on standard nodes.

Condo jobs must have the id of the project PI/node owner as a constraint and unique node identifying information where the PI has more than one type.

Pubcondo non-gpu jobs must have 0gpu as a constraint and the number of cores and memory as a constraint, with the memory reasonably related to the job. Options are 16c & 64gb (64 Intel nodes), 32c & 192gb (20 Intel nodes) , 32c & 256 gb (2 AMD nodes), 40c & 384 gb (10 Intel nodes), 48c & 256 gb (1 AMD node), 64c & 112 gb (2 Intel Phi nodes), 64c & 256 gb (5 AMD nodes) , 64c & 512gb (5 AMD nodes), 64c & 1024gb (1 AMD node), 64c & 2048 gb (1 AMD node). A slurm string would look like –partition pcon06 –constraint “0gpu & 16c & 64gb”. Examples (with the same options available in sbatch scripts):

pinnacle-l1:rfeynman:$ srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=16 --partition pcon06 --qos comp --time=6:00:00 --constraint="0gpu&16c&64gb" --pty /bin/bash
srun: job 706884 queued and waiting for resources
srun: job 706884 has been allocated resources
c3204:rfeynman66:$

pinnacle-l3:rfeynman:$ srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=24 --partition pcon06 --qos comp --time=6:00:00 --constraint="4titanv&24c" --pty /bin/bash
srun: job 706892 queued and waiting for resources
srun: job 706892 has been allocated resources
c1522:rfeynman:$

Pubcondo gpu jobs must have the gpu type as a constraint and use that many gpus. Options are 4titanv & 24c (1 node), 1v100 & 40c (1 node), 2v100 & 32c (1 node), 1a100 & 64c (2 nodes), 4a100 & 64c (9 nodes).

Selecting cores per node

To maintain throughput and avoid wasting capacity with partly-filled nodes, there are standards for selecting part node/full node jobs on comp..,himem.., and gpu.. partititons. These three partitions are restricted to whole nodes with a few exceptions. Whole nodes means the product ntasks-per-node x cpus-per-task = the number of cores per node, 32 for comp.. and gpu.. and 24 for himem…

Permitted exceptions: Jobs submitted at once x cores per job=32, such as 2 jobs x16 cores,4×8,8×4,16×2 on comp.., 2 jobs x 12 cores with 192 GB < memory per job < 384 GB and multiple gpu.. jobs meant to share the gpu (set cores per job to set the number of jobs per node as for comp).

Jobs that don't meet these standards may be canceled without warning.