This is a brief “how to” summary of Trestles for users familiar with the Razor cluster.
Trestles has about 240 usable identical nodes, each with four AMD 6136 8-core 2.4 GHz processors. Each node has 64GB of memory and a flash drive with about 90GB usable for temporary space in
/local_scratch/$USER/, much less local storage than Razor nodes, but it's faster. Nodes are interconnected with Mellanox QDR Infiniband. Each 32-core node has about the same computational power as the 16-core Intel nodes in the Razor cluster, so an AMD core is about half the power of an Intel core for these models. Because Trestles has more cores which are less powerful, it is most suitable for highly scalable codes, that is codes with good parallel performance, and also for codes that require more than the 24 to 32 GB of memory on most Razor nodes. There is no serial queue on Trestles; use Razor's serial queue. If you have a serial job that needs more than 32GB of memory, you can use a full Trestles node (nodes=1:ppn=32), though a Razor 96GB node would be faster.
Login node is
trestles.uark.edu which is a load balancer to identical login nodes with local names
tres-l2. You can also access from Razor by
ssh tbridge ; ssh tres-l2.
Initially there are three queues with maximum runtimes of 30 minutes, 6 hour, and 72 hours:
q72h32c. All nodes are identical with a few reserved for shorter jobs. Only whole node access (
nodes=N:ppn=32) is supported initially. Serial (single-core) jobs should be run on the Razor cluster unless you specifically need the 64GB memory capacity of Trestles. In that case allocate the whole node with ppn=32.
The user environment is as similar as possible to the Razor cluster. Most codes will need to be recompiled because of (a) the different Infiniband network may affect the low-level links for some MPI versions, and (2) anything compiled on the Intel compilers with processor vectorization greater than
SSE2 won't run on an AMD processor. When recompiling, the compiler modules will handle the low-level links, and Intel compiler
-xSSE2 will be the best that will run for compiler vectorization on Trestles.
Parallel file systems are Lustre
/scratch. Your home area is located at
/storage/$USER/home and is symlinked to
/home/$USER. For most applications the home will be transparent vs. the slightly different “autohome” setup of the Razor cluster. There are also additional reserved condo storage areas
/storaged (Douglas group) and
/storageb (Bellaiche group, and
/storage2 (Track II).
/local_scratch are very small, and there will be no directories corresponding to your userid. Existing directories
/scratch/$USER with data will be wiped Thursday 3/17/16. There will be only per-job directories, created by the job prologue script, that will expire 14 days after each job ends. In your batch scripts you can use the defined environment variables below, or reconstruct the names of the already created scratch areas at runtime as
/local_scratch/$PBS_JOBID on the head compute node. Here is an example for a particular job.
/storage quota is 900 GiB soft/1000 GiB hard. Condo storage partitions
/storage[x] don't have quotas. Unlike razor, your trestles
/home area is part of
/home/username is a symlink to
/storage/username/home) and does not have its own quota. Future backup schemes that may pull from
/home may limit that size. At this time Nothing on trestles is backed up.
cd $PBS_O_WORKDIR cp *.in *UPF /scratch/$PBS_JOBID cd /scratch/$PBS_JOBID
To use this effectively, you need to know what your job's input files are, and which output files you need to keep. A quantum_espresso job might look like:
#PBS -N espresso #PBS -j oe #PBS -m ae #PBS -o zzz.$PBS_JOBID #PBS -l nodes=4:ppn=32,walltime=6:00:00 #PBS -q q06h32c cd $PBS_O_WORKDIR cp *.in *UPF /scratch/$PBS_JOBID cd /scratch/$PBS_JOBID NP=$(wc -l < $PBS_NODEFILE) module load intel/14.0.3 mkl/14.0.3 openmpi/1.8.8 mpirun -np $NP -machinefile $PBS_NODEFILE -x LD_LIBRARY_PATH \ /share/apps/espresso/espresso-5.1-intel-openmpi/bin/pw.x -npools 1 <ausurf.in >ausurf.log mv ausurf.log *mix* *wfc* *igk* $PBS_O_WORKDIR/
We expect to port these
/scratch changes back to Razor to help with the space crisis there.
bridge from both sides.
There is an interface node
bridge to transfer files between Razor and Trestles. The interface node is only to move data; it can't submit jobs. Please login to the interface node and use
cp/mv to move files between parallel file systems over Infiniband instead of using
scp, which will send files over the slower ethernet network. You may also use
rsync, which, on a single node, defaults to using
cp. On the bridge node, Trestles file systems are mounted at
/home, and Razor file systems are mounted at
/razor/home. Trestles file systems are also mounted at
/trestles/storage and so on. To copy
/storage/$USER/mydir from Razor to Trestles, starting on Razor:
ssh bridge cd /razor/storage/$USER rsync -av mydir /storage/$USER/
The Trestles network doesn't yet have a file transfer node to the world, it's on the list to do. Until then, please use Razor
bridge. Please avoid sending huge files to the login nodes on either Trestles or Razor.
For more information about moving files please visit: Data Transfer to and from AHPCC Clusters
Public queues start with q##. q06h is the shared instance of unused private condo nodes. Please note that shared jobs that request memory far in excess of their requirements may be terminated.
Queue Time Limit Cores(ppn=) Nodes q10m32c 10 min 32 trestles formerly qtraining q30m32c 30 min 32 trestles q06h32c 6 hr 32 trestles q72h32c 72 hr 32 trestles q06h 6 hr 16-48 shared condo,select by ppn and memory properties below
Condo nodes are selected in queue qcondo by PBS node properties. Only a sufficient property is required (i.e. m256gb is unique over the currently installed nodes). Nodes with Intel Broadwell E5-26xx v4 cpus have the property “v4”.
Node Cores(ppn=) Number Properties ABI 3072 GB 48 1 abi:m3072gb Bellaiche 64 GB 16 3 laurent:v4:m64gb Douglas 768 GB 32 1 douglas:m768gb Douglas 256 GB 16 3 douglas:v4:m256gb
Public queues: examples in single-line interactive form
#shared 6-hour $ qsub -I -q q06h32c -l nodes=1:ppn=12 -l walltime=6:00:00 #shared 72-hour $ qsub -I -q q72h32c -l nodes=1:ppn=12 -l walltime=72:00:00
Condo queues: examples in single-line interactive form
#condo 256gb $ qsub -I -q qcondo -l nodes=3:ppn=16:m256gb -l walltime=8:00:00 #condo 768gb $ qsub -I -q qcondo -l nodes=1:ppn=32:m768gb -l walltime=8:00:00 #condo 768gb, equivalent $ qsub -I -q qcondo -l nodes=1:ppn=32:douglas -l walltime=8:00:00 #condo 64gb $ qsub -I -q qcondo -l nodes=1:ppn=16:m64gb -l walltime=10:00:00 #shared 3072gb $ qsub -I -q q06h -l nodes=1:ppn=48:m3072gb -l walltime=6:00:00