User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
queueing_system [2016/03/07 20:30]
pwolinsk created
queueing_system [2017/04/19 15:31] (current)
Line 1: Line 1:
-=== Queueing System ===+===== Queueing System ===== 
 +All jobs on AHPCC clusters which require a significant amount of CPU or memory should be submitted through the queueing system. ​ In general, two types of jobs may be passed into the queue:  ​
 +  * A //**batch job**// - a specific command is executed on the node(s) assigned to the job without the need for user interaction. ​ A vast majority of jobs ran on the HPC clusters are batch jobs.
 +  * An //​**interactive job**// - a login shell is started on the first node assigned to the job.  The user, in turn, specifies the commands to execute at the command prompt.
 +A //**compute node**// is an individual computer which can be used to execute jobs.  Compute nodes are grouped into //​**queues**//​. ​ All nodes assigned to a particular queue are identical. ​ The queues differ from each other by the following factors:
 +  * type of cpu and number of cores on each node
 +  * number of nodes assigned
 +  * the maximum number of nodes allowed to be used by a single job
 +  * amount of memory
 +  * walltime - the maximum amount of execution time for a single job  ​
 +=== Node to Queue Assignment ===
 +All compute nodes are divided into groups called partitions. ​ A node can only belong to one partition. ​ A queue is made up of a collection of partitions. ​ A given partition can be assigned to multiple queues. ​ As a result most nodes are not exclusively assigned to a single queue, but are shared between multiple queues. ​ This configuration improves queue flexibility,​ but conceptually complicates the view of the queueing system for the user, i.e. makes it difficult to predict how many free nodes are there for a given queue. ​ To help to determine the number of available nodes per queue, a script //​**max_job_size**//​ is available:
 +tres-l1:​pwolinsk:​$ max_job_size ​
 +Maximum jobs size in number of nodes for immediate start per queue:
 +        q30m32c: ​ 26 nodes     (max in partition: 26  queue cap:  64)
 +        q06h32c: ​ 26 nodes     (max in partition: 26  queue cap:  64)
 +        q72h32c: ​  8 nodes     (max in partition: ​ 8  queue cap:  32)
 +      qcDouglas: ​  0 nodes     (max in partition: ​ 0  queue cap:   1)
 +          qcABI: ​  0 nodes     (max in partition: ​ 0  queue cap:   1)
 +         ​qcondo: ​  0 nodes     (max in partition: ​ 0  queue cap:   1)
 +      qtraining: ​  2 nodes     (max in partition: ​ 2  queue cap:  64)
 +tres-l1:​pwolinsk:​$ ​
 +The output of the script above shows that a job requesting up to 26 nodes in the queue //​**q06h32c**//​ should start immediately.
 +[[queues|Queues]] - summary of public queues
 +[[batch|Batch Jobs]]
 +[[interactive|Interactive Jobs]]
 +[[condo queues|Condo Queues]]
 +[[walltime extensions|Job Walltime Extensions]]
queueing_system.1457382645.txt.gz · Last modified: 2016/03/07 20:30 by pwolinsk