Efficient Program Output to /scratch

“Large” program output should not be written by a job directly to main storage /storage[x]/ or /[x]home/. This is because most computational programs are very inefficient at writing output. This results in a huge load on the output file system (that is open file/write one line/close file millions of times). System utilities such as cp/rsync write in large blocks and are much more efficient at writing, so it is better to write directly from the program to a designated fast temporary area /scratch/, then use system utilities to copy any outputs to be saved at the end of the job.

We define “large” as roughly 100MB over the course of a job (that is sum of all outputs=100MB, or a 20MB graphics output overwritten 5 times). This may be extended to any job that is doing something that slows down the shared file systems.

Exceptions: Some programs read from large (input only) files,process the data, and write to other large files. Usually it is acceptable for shared performance to read large files directly from main /storage[x]/, especially when the files are so very large as to present a burden for copying into /scratch/.

There are two temporary file systems on each cluster: (1) /scratch/, a parallel networked file system, GPFS or Lustre, and (2) /localscratch/, a local disk array on each compute node. The parallel file system /scratch/ is better for (1) MPI parallel output (rare even with MPI programs), (2) very large output files that are larger that are larger than the local disk array (see example below), (3) depending on the compute node and system state, the parallel file system should have a higher bandwidth than the local disk. The local array /localscratch is better for (1) very small writes/reads at high rates, since the latency of each read is smaller than the networked system, or (2) any write/read when the parallel file system is overloaded.

NOTE: As of summer 2017 the (now old, pending upgrade) parallel file systems are chronically overloaded, so we recommend using /local_scratch/ whenever feasible. Exception: on Razor-1 12-core nodes, the local disks are very slow (10's of MB/s). Standard Trestles node local disks are both moderately slow (~100 MB/s) and moderately small (~90 GB available size).

PENDING UPGRADE: We expect a complete overhaul of storage by late summer 2017. A number of changes will be made, primarily for storage: (1) razor and trestles clusters will be combined with common storage, (2) there will be no user directories on /scratch/ or /localscratch/, only per-job directories. Per-job scratch directories are already being created for each job but are not yet required. These directories for job 532889.torque will be '/scratch/532889.torque/' and, on the first compute node, '/local_scratch/532889.torque/'. To facilitate data recovery, the directories will be retained for 10 days after the end of the job, unless they fill a significant fraction of the disk, in which case they may be purged after as little as 1 day. We recommend purging the temporary directory at the end of the job, see example below. Example: here is a trestles job that copies the entire source directory to one of the predefined scratch directories, either '/scratch/' or '/localscratch/' defined by $SCRATCH. It has an estimated output size of 90MB and writes to '/localscratch/' unless that partition does not have enough available space at the start of the job. The entire directory and its subdirectories are copied except for a subdirectory called “outputs”. You can add other files located in the directory but not needed by the job to excludes=. At the end of the job, updated files are copied back to the source directory from where the job initiated. According to –remove-source-files, files are then deleted from /scratch/ or /localscratch/ if they match the versions in the original source directory. Please be careful with this option and test with unimportant data before committing to production. The if loop at the end attempts to skip the rsync and resulting delete if, for example, some file system error had the job working in the wrong subdirectory. We have tried to write this script to work even if you have blanks in your directory names, but test with unimportant data.

#PBS -N espresso
#PBS -j oe
#PBS -m ae
#PBS -o zzz.$PBS_JOBID
#PBS -l nodes=1:ppn=32,walltime=6:00:00
#PBS -q q06h32c
module purge
module load intel/14.0.3 mkl/14.0.3 fftw/3.3.6 impi/5.1.2

#copy files
#run to scratch if output bigger than $OUTPUT_SIZE here 90MB
OUTPUT_SIZE=90000000
cd $PBS_O_WORKDIR
Zpwd=`pwd`
Zdir=`dirname ${Zpwd}`
Zbas=`basename ${Zpwd}`
Zlsa=`/share/apps/bin/local_scratch_available.sh`
if [ "$Zlsa" -gt "$OUTPUT_SIZE" ];then
SCRATCH="/local_scratch/$PBS_JOBID"
else
SCRATCH="/scratch/$PBS_JOBID"
fi
echo Zpwd="${Zpwd}" Zdir="${Zdir}" Zbas="${Zbas}" Zlsa="${Zlsa}" SCRATCH="${SCRATCH}"
cd ..
mkdir -p "/${SCRATCH}/${Zbas}"
rsync -av --exclude=outputs "./${Zbas}/" "/${SCRATCH}/"
cd "/${SCRATCH}/${Zbas}/"

#compute step
sort -u $PBS_NODEFILE >hostfile
mpirun -ppn 4 -hostfile hostfile -genv OMP_NUM_THREADS 4 -genv MKL_NUM_THREADS 4 /share/apps/espresso/qe-6.1-intel-mkl-impi/bin/pw.x -npools 1 <ausurf.in >ausurf.log

#copy files back
#not rsync/remove if cd failed and in original dir 
cd ..
Zpwd=`pwd`
if [ "${Zpwd}" -ne "${Zdir}" ];then
rsync -av --remove-source-files "${Zbas}" "${Zdir}/"
fi

A possible enhancement is to wrap the copy-back rsync in the bash loop

if [ $? -eq 0 ];then
fi

so that the copy back only happens if the main computational program (here mpirun…) exited with a return code of 0 (success). This is application-dependent since some applications handle failures internally and return 0 even for failures.

This script is saved in trestles:/share/apps/examples/qe.sh.