User Tools

Site Tools


trestles_usage

How to Use the Trestles Cluster

This is a brief “how to” summary of Trestles for users familiar with the Razor cluster.

Equipment

Trestles has about 240 usable identical nodes, each with four AMD 6136 8-core 2.4 GHz processors. Each node has 64GB of memory and a flash drive with about 90GB usable for temporary space in /local_scratch/$USER/, much less local storage than Razor nodes, but it's faster. Nodes are interconnected with Mellanox QDR Infiniband. Each 32-core node has about the same computational power as the 16-core Intel nodes in the Razor cluster, so an AMD core is about half the power of an Intel core for these models. Because Trestles has more cores which are less powerful, it is most suitable for highly scalable codes, that is codes with good parallel performance, and also for codes that require more than the 24 to 32 GB of memory on most Razor nodes. There is no serial queue on Trestles; use Razor's serial queue. If you have a serial job that needs more than 32GB of memory, you can use a full Trestles node (nodes=1:ppn=32), though a Razor 96GB node would be faster.

Usage

Login node is trestles.uark.edu which is a load balancer to identical login nodes with local names tres-l1 and tres-l2. You can also access from Razor by ssh tbridge ; ssh tres-l2.

Initially there are three queues with maximum runtimes of 30 minutes, 6 hour, and 72 hours: q30m32c, q06h32c, q72h32c. All nodes are identical with a few reserved for shorter jobs. Only whole node access (nodes=N:ppn=32) is supported initially. Serial (single-core) jobs should be run on the Razor cluster unless you specifically need the 64GB memory capacity of Trestles. In that case allocate the whole node with ppn=32.

The user environment is as similar as possible to the Razor cluster. Most codes will need to be recompiled because of (a) the different Infiniband network may affect the low-level links for some MPI versions, and (2) anything compiled on the Intel compilers with processor vectorization greater than SSE2 won't run on an AMD processor. When recompiling, the compiler modules will handle the low-level links, and Intel compiler -xSSE2 will be the best that will run for compiler vectorization on Trestles.

File Systems

Parallel file systems are Lustre /storage/$USER and /scratch. Your home area is located at /storage/$USER/home and is symlinked to /home/$USER. For most applications the home will be transparent vs. the slightly different “autohome” setup of the Razor cluster. There are also additional reserved condo storage areas /storaged (Douglas group) and /storageb (Bellaiche group, and /storage2 (Track II).

UPDATE Trestles /scratch and /local_scratch are very small, and there will be no directories corresponding to your userid. Existing directories /scratch/$USER with data will be wiped Thursday 3/17/16. There will be only per-job directories, created by the job prologue script, that will expire 14 days after each job ends. In your batch scripts you can use the defined environment variables below, or reconstruct the names of the already created scratch areas at runtime as /scratch/$PBS_JOBID and /local_scratch/$PBS_JOBID on the head compute node. Here is an example for a particular job.

Trestles /storage quota is 900 GiB soft/1000 GiB hard. Condo storage partitions /storage[x] don't have quotas. Unlike razor, your trestles /home area is part of /storage (/home/username is a symlink to /storage/username/home) and does not have its own quota. Future backup schemes that may pull from /home may limit that size. At this time Nothing on trestles is backed up.

cd $PBS_O_WORKDIR
cp *.in *UPF /scratch/$PBS_JOBID
cd /scratch/$PBS_JOBID

To use this effectively, you need to know what your job's input files are, and which output files you need to keep. A quantum_espresso job might look like:

#PBS -N espresso
#PBS -j oe
#PBS -m ae
#PBS -o zzz.$PBS_JOBID
#PBS -l nodes=4:ppn=32,walltime=6:00:00
#PBS -q q06h32c
cd $PBS_O_WORKDIR
cp *.in *UPF /scratch/$PBS_JOBID
cd /scratch/$PBS_JOBID
NP=$(wc -l < $PBS_NODEFILE)
module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8
mpirun -np $NP -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \
/share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in >ausurf.log
mv ausurf.log *mix* *wfc* *igk* $PBS_O_WORKDIR/

We expect to port these /scratch changes back to Razor to help with the space crisis there.

File Transfer to/from Razor

UPDATE tbridge/rbridge renamed bridge from both sides.

There is an interface node bridge to transfer files between Razor and Trestles. The interface node is only to move data; it can't submit jobs. Please login to the interface node and use cp/mv to move files between parallel file systems over Infiniband instead of using scp, which will send files over the slower ethernet network. You may also use rsync, which, on a single node, defaults to using cp. On the bridge node, Trestles file systems are mounted at /storage,/scratch,and /home, and Razor file systems are mounted at /razor/storage,/razor/scratch, and /razor/home. Trestles file systems are also mounted at /trestles/storage and so on. To copy /storage/$USER/mydir from Razor to Trestles, starting on Razor:

ssh bridge
cd /razor/storage/$USER
rsync -av mydir /storage/$USER/
File Transfer to/from World

The Trestles network doesn't yet have a file transfer node to the world, it's on the list to do. Until then, please use Razor tgv to bridge. Please avoid sending huge files to the login nodes on either Trestles or Razor.

For more information about moving files please visit: Data Transfer to and from AHPCC Clusters

Queues

Public queues start with q##. q06h is the shared instance of unused private condo nodes. Please note that shared jobs that request memory far in excess of their requirements may be terminated.

    
Queue     Time Limit   Cores(ppn=)  Nodes
q10m32c     10 min        32        trestles  formerly qtraining
q30m32c     30 min        32        trestles
q06h32c      6 hr         32        trestles
q72h32c     72 hr         32        trestles
q06h         6 hr        16-48     shared condo,select by ppn and memory properties below

Condo nodes are selected in queue qcondo by PBS node properties. Only a sufficient property is required (i.e. m256gb is unique over the currently installed nodes). Nodes with Intel Broadwell E5-26xx v4 cpus have the property “v4”.

Node                Cores(ppn=)  Number  Properties
ABI 3072 GB           48            1     abi:m3072gb
Bellaiche 64 GB       16            3     laurent:v4:m64gb
Douglas 768 GB        32            1     douglas:m768gb
Douglas 256 GB        16            3     douglas:v4:m256gb

Public queues: examples in single-line interactive form

#shared 6-hour
$ qsub -I -q q06h32c -l nodes=1:ppn=12 -l walltime=6:00:00
#shared 72-hour
$ qsub -I -q q72h32c -l nodes=1:ppn=12 -l walltime=72:00:00

Condo queues: examples in single-line interactive form

#condo 256gb
$ qsub -I -q qcondo -l nodes=3:ppn=16:m256gb -l walltime=8:00:00
#condo 768gb
$ qsub -I -q qcondo -l nodes=1:ppn=32:m768gb -l walltime=8:00:00
#condo 768gb, equivalent
$ qsub -I -q qcondo -l nodes=1:ppn=32:douglas -l walltime=8:00:00
#condo 64gb
$ qsub -I -q qcondo -l nodes=1:ppn=16:m64gb -l walltime=10:00:00
#shared 3072gb
$ qsub -I -q q06h -l nodes=1:ppn=48:m3072gb -l walltime=6:00:00
trestles_usage.txt · Last modified: 2016/06/28 11:16 by pwolinsk