User Tools

Site Tools


razor_usage

How to Use the Razor Cluster

This is a brief “how to” summary of usage for users of the Razor cluster.

Razor has several partitions: Razor I (128 12-core nodes, dual X5670 processor, 24GB memory except 4 nodes have 96GB), Razor II (112 16-core nodes, dual E5-2670 processor, 32GB memory), and 10 large-memory four-socket nodes, some condo, of 256GB-3024GB memory. There is also a condo partition Razor III (64 16-core nodes, dual E5-2650V2 processor, 64GB memory) which will move to the Trestles cluster. Each node has a local hard drive with 900-1800GB usable for temporary space in /local_scratch/$USER/. Nodes are interconnected with QLogic QDR Infiniband. Operating system is Centos 6.5.

Parallel file systems are GPFS /storage/$USER and /scratch/$USER. Your home area is located at /gpfs_home/$USER and is autohomed to /home/$USER. The storage page explains storage policies and how to use different storage areas for different purposes.

Login node is razor.uark.edu which is a load balancer to identical login nodes with local names razor-l1 through razor-l3. If you have a UArk ID, you will use UArk login and password. AHPCC doesn't see, know, reset, or email UArk passwords: use UITS facilities password.uark.edu to reset. If you don't have a UArk ID, we will send you login ID and password on a secure web server.

The main production queues, with 6 and 72 hour walltime limits respectively, are tiny12core and med12core on Razor I and tiny16core and med16core or Razor II. Both systems also have quick-turnaround non-production 30-minute limit debug12core and debug16core. On Razor I only there is a serial12core queue for single-core jobs of 2GB or less memory, which runs up to twelve jobs on a single node. There are a number of special-purpose and condo queues documented in queues.

A sample Maui/Torque job script follows. This requests two 12-core nodes for 6 hours (The format is walltime=DD:HH:MM:SS) in the tiny12core queue. The job starts and runs in the submit directory ($PBS_O_WORKDIR) and uses every core available in the requested nodes for MPI NP=$(wc -l < $PBS_NODEFILE), or nodes*ppn=24 cores in this case. The MPI executable was previously compiled with Intel compiler version 14.0.3 and Intel MPI version 5.1.1 (see modules).

#PBS -N example1
#PBS -q tiny12core
#PBS -j oe
#PBS -m ae
#PBS -o zzz.$PBS_JOBID.tiny
#PBS -l nodes=2:ppn=12,walltime=6:00:00
module purge
module load intel/14.0.3 impi/5.1.1
cd $PBS_O_WORKDIR 
NP=$(wc -l < $PBS_NODEFILE)
mpirun -np $NP -machinefile $PBS_NODEFILE  ./mympiexecutable

Lines beginning with #PBS are meaningful only to the scheduler, and lines that don't begin #PBS are commands to the compute node operating system.

In PBS scripts, the ppn parameter is very important. If the requested value of ppn is larger than the number of cores per node present on the hardware, the job will never run and will stay queued until deleted. In practice, you should almost always use ppn=8 on 8core queues, ppn=12 on 12core queues, and so on. If ppn is smaller than the correct value for the queue, the job will run, but will not use all the cores, if you are calculating the number of mpirun threads as in the example script above.

Submitting and Checking on Your Job

The completed job script can be submitted to the scheduler with the qsub command. The name and extension of the script are arbitrary. If the script is called MPI-test-script.pbs, you will run qsub MPI-test-script.pbs. To check the status of jobs, use the interactive commands showq or checkjob. The showq command will print general information about the state of the job queue.

Cancelling a Job

If a job that has been submitted to the scheduler needs to be cancelled, use the qdel command

razor_usage.txt · Last modified: 2017/09/12 22:31 by wlfarris