[GE users] Newbie question - queues, queue instances, and slots

jagladden gladden at chem.washington.edu
Mon Jun 1 19:29:25 BST 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I am new to SGE, so I am trying it out on a small test cluster as a first step.  Having done some experiments, I find myself a little confused about how SGE handles queue instances and slots.

My test cluster has two compute nodes, with a total of 10 cores, as shown by 'qhost':

[root at testpe bin64]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
compute-0-0             lx26-amd64      2  0.00    2.0G  102.8M    2.0G     0.0
compute-0-1             lx26-amd64      8  0.00   15.7G  119.9M  996.2M     0.0

I have set up two cluster queues.  The first of these is the standard default queue 'all.q' as shown by 'qconf -sq':

[root at testpe ~]# qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpi orte
rerun                 FALSE
slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
...

The second is a "high priority" queue, which is identical except for having a higher default job priority:

[root at testpe ~]# qconf -sq high
qname                 high
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              10
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
...


My point of confusion arises when I submit jobs to both these queues.  There are only 10 CPU's available, and I would expect the queuing system to only allow a maximum of 10 jobs to run at any one time.  What happens in practice is that SGE allows 10 jobs from each of the two queues to run a the same time, for a total of 20 jobs, thus effectively allocating two jobs to each CPU.  In the following example I have submitted 24 jobs, 12 to each queue.  Note that 'qstat' shows 20 of them to be running simultaneously, with four waiting:

[gladden at testpe batchtest]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    110 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-0.local<mailto:all.q at compute-0-0.local>            1
    114 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-0.local<mailto:all.q at compute-0-0.local>            1
    109 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    111 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    112 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    113 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    115 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    116 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    117 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    118 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>            1
    121 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-0.local<mailto:high at compute-0-0.local>             1
    126 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-0.local<mailto:high at compute-0-0.local>             1
    122 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    123 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    124 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    125 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    127 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    128 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    129 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    130 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local<mailto:high at compute-0-1.local>             1
    119 0.55500 test_simpl gladden      qw    06/01/2009 10:08:44                                    1
    120 0.55500 test_simpl gladden      qw    06/01/2009 10:08:45                                    1
    131 0.55500 test_simpl gladden      qw    06/01/2009 10:09:12                                    1
    132 0.55500 test_simpl gladden      qw    06/01/2009 10:09:13                                    1

What I had expected was that SGE would first dispatch 10 jobs from the "high priority" queue and then, as those jobs completed and slots become available, dispatch and run additional jobs from the default queue - but allowing only 10 jobs to run at one time.  Instead, SGE seems to regard that 10 queue instances associated with the "high" queue as being associated with slots that are independent from the 10 that are associated with "all.q".

Have I failed to configure something properly?  Is there not a way to feed jobs from multiple queues to the same set of nodes while limiting the number of active jobs to one per CPU?

James Gladden



More information about the gridengine-users mailing list