User Tools

Site Tools


sub_node_jobs

Sub-node Compute Jobs

As compute nodes grow in size (our latest equipment in purchasing will have 96 cores per node and are not the highest-core-count available), we face the issue of trying to efficiently schedule jobs of one core up to less than one node. Slurm doesn't do well when you send the same partition everything from one core to multiple node jobs. Invariably Slurm leaves one core running on an otherwise empty node.

Our solution is to construct one set of partitions for full-node jobs: comp72, comp06, comp01 and another partition of identical hardware for very small jobs: cloud72. Because the cloud partition almost always starts immediately, in its case we don't have separate partitions for shorter time periods.

This schedules fairly efficiently but leaves a gap for jobs of 1/16 to 1/2 node scale or 2 to 16 cores for the examples. One of those fractional-node jobs can be run in the cloud partition without affecting it much. Multiples of these can fill the (small) cloud partition. In particular, where there are thousands of small jobs to be run, it is important to do so efficiently. Here are a couple of methods to run multiple fractional-node jobs efficiently using full compute nodes.

These require some intermediate level scripting. We encourage you to contact HPC support for scripting help, particularly for very large parameter sweeps. In such cases, the arrangement in the file system and other factors can be a big influence on the ease of completion. It's best to test and plan ahead.

at now

This method is good for small jobs that consistently run through their lifetime at, for instance, 4 cores. In this 32-core node, it can run 8 of these concurrently with different parameters. Each run is abstracted into a script to simplify the Slurm submit script.

  1. the multiple sub-jobs should be relatively uniform in workload so that one doesn't take 10 times as long as the others, which would end up a long period of 1 core taking a full compute node, which is what we are trying to avoid here
  2. the wait at the end of the submit script is necessary to keep from Slurm from prematurely ending, as the backgrounded jobs will return the console to Slurm and it will think the overall job is done and log out and either kill the background jobs or leave zombies running, either of which is bad.
  3. in this example, the 8 jobs running the same script with different parameters (1-8) simulate running 8 different sub-jobs with different data in the same directory. If your process to run always has the same input and output files, the sub-jobs will need to be in different directories, which should be handled by the script here called bench.sh. But the 8 at now statements could also be 8 totally different scripts, though it would be hard in that case to ensure that they take about the same time to run.
$ cat bench.sh
#!/bin/sh
. /etc/profile.d/lmod.sh
module load intel/19.0.5 mkl/21.3.0 R/4.3.0
Rscript bench$1.R >bench$1.out

$ cat multiRscript.sh
#!/bin/bash
#SBATCH --partition comp72
#SBATCH --qos comp
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=4
#SBATCH --time=72:00:00
./bench.sh 1 | at now &
./bench.sh 2 | at now &
./bench.sh 3 | at now &
./bench.sh 4 | at now &
./bench.sh 5 | at now &
./bench.sh 6 | at now &
./bench.sh 7 | at now &
./bench.sh 8 | at now &
wait
$ 

Gnu parallel

Gnu parallel (command parallel) is also useful and can do some more complex arrangements of jobs.

Here we have a Matlab program that uses enough memory so that (arbitrarily chosen) 12 will fill a compute node. Command-line batch Matlab arguments with -r: the first argument is the name of the matlab code file and the others are input arguments defined as constants in the program. Twelve doesn't divide evenly into 32, so we just allocate a full node of 32 cores. Gnu parallel will run the commands until they are done, so there could be a list of 12 commands to run immediately, or as in the example there could be more in the command list than the 12 that run concurrently. This can't be infinitely long, the batch job will still time out at its time limit whether Gnu parallel is done or not. Also they should be chosen for uniform load, thus not 13 jobs 12 at a time.

Gnu parallel will retain the console until it is done, so don't include wait at the end of the Slurm script.

$ cat test.sh
module load matlab/r2023b
matlab -nojvm -nodisplay -nosplash -r "$1 $2 $3 $4 $5 $6 $7 $8" &

$ cat commands.txt
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 11
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 13 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 15
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 17
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 19
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 21 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 23 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 25 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 27
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 29 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 31
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 33
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 35
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 37 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 39 
./test.sh Spirality3 8.0 14.5 1 6.633 1 6.633 41

$ cat multimatlab.sh
#!/bin/bash
#SBATCH --partition comp72
#SBATCH --qos comp
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --time=72:00:00
cat commands.txt | parallel --will-cite -q --jobs 12 --colsep '\s+' ./test.sh {}
$
Variable loads

at now and parallel --jobs ## are useful for fixed loads or loads defined by memory like the second example. Where the sub-jobs vary in load over their lifetime, a variation in Gnu parallel can help run the compute node near maximum throughput. This sometimes occurs in bioinformatics workflows where several programs of varying parallelism are chained together. Sometimes the input files are small and memory is not a concern. To get the best throughput, we would try to maximize the amount of work done concurrently until the compute node approaches overload, when it starts spending time just switching between many processes instead of doing useful work.

This case would be similar to the second example, except replace --jobs 12 to run exactly 12 jobs concurrently until the command list is finished with --load 32 to attempt to maintain system load of 32 until the command list is finished. System load is an abstraction derived from how many processes are asking for more time from the system (see current load averages on a compute node with the uptime command). For a CPU-driven load, the maximum throughput value is usually equal to the number of physical cores in the system, so 32 for the comp72 partitions of these examples. Some testing is usually indicated to find the best load value for a particular problem set, please contact HPC support for help.

cat commands.txt | parallel --will-cite -q --load 32 --colsep '\s+' ./test.sh {}
sub_node_jobs.txt · Last modified: 2024/03/01 18:13 by root