User Tools

Site Tools


queueing_system

Queueing System

All jobs on AHPCC clusters which require a significant amount of CPU or memory should be submitted through the queueing system. In general, two types of jobs may be passed into the queue:

  • A batch job - a specific command is executed on the node(s) assigned to the job without the need for user interaction. A vast majority of jobs ran on the HPC clusters are batch jobs.
  • An interactive job - a login shell is started on the first node assigned to the job. The user, in turn, specifies the commands to execute at the command prompt.

A compute node is an individual computer which can be used to execute jobs. Compute nodes are grouped into queues. All nodes assigned to a particular queue are identical. The queues differ from each other by the following factors:

  • type of cpu and number of cores on each node
  • number of nodes assigned
  • the maximum number of nodes allowed to be used by a single job
  • amount of memory
  • walltime - the maximum amount of execution time for a single job

    Node to Queue Assignment

    All compute nodes are divided into groups called partitions. A node can only belong to one partition. A queue is made up of a collection of partitions. A given partition can be assigned to multiple queues. As a result most nodes are not exclusively assigned to a single queue, but are shared between multiple queues. This configuration improves queue flexibility, but conceptually complicates the view of the queueing system for the user, i.e. makes it difficult to predict how many free nodes are there for a given queue. To help to determine the number of available nodes per queue, a script maxjobsize is available:

tres-l1:pwolinsk:$ max_job_size 
Maximum jobs size in number of nodes for immediate start per queue:

        q30m32c:  26 nodes     (max in partition: 26  queue cap:  64)
        q06h32c:  26 nodes     (max in partition: 26  queue cap:  64)
        q72h32c:   8 nodes     (max in partition:  8  queue cap:  32)
      qcDouglas:   0 nodes     (max in partition:  0  queue cap:   1)
          qcABI:   0 nodes     (max in partition:  0  queue cap:   1)
         qcondo:   0 nodes     (max in partition:  0  queue cap:   1)
      qtraining:   2 nodes     (max in partition:  2  queue cap:  64)
tres-l1:pwolinsk:$ 

The output of the script above shows that a job requesting up to 26 nodes in the queue q06h32c should start immediately.

Queues - summary of public queues

Batch Jobs

Interactive Jobs

Condo Queues

Job Walltime Extensions

queueing_system.txt · Last modified: 2020/09/21 22:01 by root