Arkansas High Performace Computing Center [hpcwiki]

Storage/Filesystems/Quotas/Backup

Pinnacle, Trestles and Razor clusters share the same Lustre storage. The Karpinski cluster has separate storage because it lacks InfiniBand over which the Lustre storage works.

storage

The main bulk storage area for user rfeynman is /storage/rfeynman/ which is a symlink to /scrfs/storage/rfeynman/ . In most contexts either name may be used. The home directory is /home/rfeynman/ which is a symlink to /scrfs/storage/rfeynman/home, so home is a subsection of your /storage area. home is the storage area to which cd with no arguments takes you, and contains important configuration files and directories such as the file .bashrc and the directory .ssh.

You can organize your files arbitrarily inside your storage area with one exception: cluster operation over many different nodes requires ssh keys to work, which requires /home/rfeynman/.ssh to be present and correctly configured, which requires /scrfs/storage/rfeynman/home to be configured in its default setup, so don't move or rename your home directory.

Your storage directory has a quota of 10 TB. At this time quotas are non-enforcing so you can still write files if over quota, but you will be reminded. You can check your own quota like so:

tres-l1:rfeynman $ lfs quota /scrfs
Disk quotas for usr rfeynman (uid 9594):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
         /scrfs 10870257212       0       0       -   14450       0       0       -
Disk quotas for grp oppenhiemer (gid 572):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
         /scrfs       0       0       0       -       0       0       0       -
tres-l1:rfeynman $

This shows usage of 10.87 TB (over quota) and 14,450 files (no quota set). Group quotas are shown but are not active currently.

Your storage area is not backed up. The most common data loss issue is a user accidentally deleting their own files, or typos with rm -r which is a dangerous command. We suggest that, instead of deleting with rm, if you want to rearrange your storage, make a trash folder and move files to there, make sure it's right, then delete the trash folder.

mkdir /scrfs/storage/rfeynman/trash
cd (to anywhere under /scrfs/storage/rfeynman)
mv some-directory /scrfs/storage/rfeynman/trash/
cd /scrfs/storage/rfeynman
(check trash again)
rm -rf trash

home backup

We do attempt to make disk-to-disk backups of the home area subset of your storage directory. Because the capacity is limited, we make this backup only if the size of the home area is less than 150 GB. You can check the size of your home area like so:

tres-l1:rfeynman:$ cd
tres-l1:rfeynman:$ du -sh ../home
627M	../home
tres-l1:rfeynman:$

so this home area is 627MB < 150 GB and should be backed up. The du command has to go query every file so can take a while if there are a lot (such as millions) of files. The purpose of the home area is to store important files such as source code and configuration data, and large data collections should go elsewhere in your storage area. If your home area fails the 150 GB size test, nothing at all will be backed up.

Lab Storage

Some labs have purchased auxiliary storage and if you are part of that corresponding lab group you can have a directory on it. These have names such as /storageb/. If your lab has some storage, they are suitable for moving over-quota files or for backing up your /scrfs/storage/ area.

scratch and local_scratch

There is a dedicated and small high-speed temporary storage called /scratch/. This is intended for large inputs (and especially outputs) directly to computational jobs. There is also local disk storage on each compute node called /localscratch/. The job queueing system creates for each job temporary job directories /scratch/$SLURMJOBID/ and /localscratch/$SLURMJOBID/ on the first compute node of the job. On torque systems $PBS_JOBID is used instead.

If your job creates more than 500 MB of output, please route output to the job scratch or localscratch directory. There are no quotas on /scratch/ or /localscratch/, but /scratch/ has a total size of 19 TB and /localscratch/ varies by node but may be as small as 90 GB. The purpose for this rerouting is performance. The main storage /scrfs/ is composed of fairly large and slow 8 TB SATA drives that do not well handle hundreds of concurrent streams of data, particularly those with small data blocks. /scrfs/ can handle a fairly large throughput of efficiently large-blocked data, but that is rare in application programs. The NVMe drives of /scratch/ and the mostly SSD drives of /localscratch/ are better for the typically inefficient and small data blocksput out by programs. At the end of your job, copy the files you want to keep from /scratch/ or /localscratch'/' back to main storage /scrfs/. There are no user directories such as/scratch/rfeynman/ since we found that such directories soon filled the small /scratch/partition. Each job directory is normally retained until a week after the job ends unless space becomes critical. See torque_slurm_scripts for some hints on moving data into and out of/scratch/'' areas during jobs.