This is a brief “how to” summary of usage for users of the Razor cluster.
Razor has several partitions: Razor I (128 12-core nodes, dual X5670 processor, 24GB memory except 4 nodes have 96GB), Razor II (112 16-core nodes, dual E5-2670 processor, 32GB memory), and 10 large-memory four-socket nodes, some condo, of 256GB-3024GB memory. There is also a condo partition Razor III (64 16-core nodes, dual E5-2650V2 processor, 64GB memory) which will move to the Trestles cluster. Each node has a local hard drive with 900-1800GB usable for temporary space in
/local_scratch/$USER/. Nodes are interconnected with QLogic QDR Infiniband. Operating system is Centos 6.5.
Parallel file systems are GPFS
/scratch/$USER. Your home area is located at
/gpfs_home/$USER and is autohomed to
/home/$USER. The storage page explains storage policies and how to use different storage areas for different purposes.
Login node is
razor.uark.edu which is a load balancer to identical login nodes with local names
razor-l3. If you have a UArk ID, you will use UArk login and password. AHPCC doesn't see, know, reset, or email UArk passwords: use UITS facilities password.uark.edu to reset. If you don't have a UArk ID, we will send you login ID and password on a secure web server.
The main production queues, with 6 and 72 hour walltime limits respectively, are
med12core on Razor I and
med16core or Razor II. Both systems also have quick-turnaround non-production 30-minute limit
debug16core. On Razor I only there is a
serial12core queue for single-core jobs of 2GB or less memory, which runs up to twelve jobs on a single node. There are a number of special-purpose and condo queues documented in queues.
A sample Maui/Torque job script follows. This requests two 12-core nodes for 6 hours (The format is walltime=DD:HH:MM:SS) in the tiny12core queue. The job starts and runs in the submit directory ($PBS_O_WORKDIR) and uses every core available in the requested nodes for MPI
NP=$(wc -l < $PBS_NODEFILE), or nodes*ppn=24 cores in this case. The MPI executable was previously compiled with Intel compiler version 14.0.3 and Intel MPI version 5.1.1 (see modules).
#PBS -N example1 #PBS -q tiny16core #PBS -j oe #PBS -m ae #PBS -o zzz.$PBS_JOBID.tiny #PBS -l nodes=2:ppn=12,walltime=6:00:00 module purge module load intel/14.0.3 impi/5.1.1 cd $PBS_O_WORKDIR NP=$(wc -l < $PBS_NODEFILE) mpirun -np $NP -machinefile $PBS_NODEFILE ./mympiexecutable
Lines beginning with
#PBS are meaningful only to the scheduler, and lines that don't begin
#PBS are commands to the compute node operating system.
In PBS scripts, the
ppn parameter is very important. If the requested value of
ppn is larger than the number of cores per node present on the hardware, the job will never run and will stay queued until deleted. In practice, you should almost always use
ppn=8 on 8core queues,
ppn=12 on 12core queues, and so on. If
ppn is smaller than the correct value for the queue, the job will run, but will not use all the cores, if you are calculating the number of mpirun threads as in the example script above.
The completed job script can be submitted to the scheduler with the
qsub command. The name and extension of the script are arbitrary. If the script is called MPI-test-script.pbs, you will run
qsub MPI-test-script.pbs. To check the status of jobs, use the interactive commands
showq command will print general information about the state of the job queue.
If a job that has been submitted to the scheduler needs to be cancelled, use the