[GE users] Newbie question - queues, queue instances, and slots

jagladden gladden at chem.washington.edu
Fri Jun 5 00:59:10 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Craffi,

Thank you for your response.  I have spent the last couple of days (or at least the available portions thereof) pondering the SGE admin documentation and am just now getting round to drafting a reply.

For what its worth, I learned today the answer to my original quandary regarding multiple cluster queues and over-subscription.   I found the following paragraph in a blog post (http://blogs.sun.com/templedf/entry/intro_to_grid_engine_queues):

"One other oddity to point out is that the slot count for a queue is not really a queue attribute. It's actually a queue-level resource (aka complex). To allow multiple queues on the same host to share that host's CPUs without oversubscribing, you can set the slots resource at the host level. Doing so sets a host-wide slot limit, and all queues on that host must then share the given number of slots, regardless of how many slots each queue (or queue instance) may try to offer."

I tried this (I added 'slots' as a resource on one of my dual quad-core nodes and set the value to 8) and it does indeed work.  With this done, I can feed jobs to two cluster queues associated with the same host and no over-subscription results.  My original proposal to use queues as a method of prioritizing jobs did not make a lot if sense, but there appear to be other useful applications for multiple cluster queues such as setting job resource limits.

As to the subject of "seq_no" versus "load" as q queue instance sorting method, I am wondering whether, in a homogeneous cluster dominated by parallel jobs and with only one slot available per CPU core,  "load"based queue sorting ever does anything very useful.  in principle there will be only one process per core anyway, so in terms of CPU use there is not much interesting "load" to sort.  There may be cases where (due to memory of I/O bandwidth issues) scalar jobs might run more efficiently if scattered across nodes where the other cores were idle, but these are probably not typical.  If I understand the"seq_no" option correctly, it will (if use in conjunction with sequentially numbered queue instances) cause the scheduler to always search from one end of the sequence until it finds an open slot or slots.  This should tend to manage the CPU core resource in a manner so as to avoid fragmentation.  Am I missing something here?

James Gladden



craffi wrote:

Hi James,

Thanks for the details - you are correct in that your cluster app mix
is not that different from what many people use. Dealing with a
mixture of parallel and non parallel jobs on the system system is
quite common.

Your detailed reply below did not mention having to distinguish
between high and low priority job types and since the two cluster
queues were the root cause of your problem with
overcommitment you may be able to get the behavior you want in the
much simpler configuration of having a single "all.q" type cluster
queue. Having a single queue for you to test both job types may also
make tuning easier. Some random comments on the issues raised below:

- You don't mention what sort of overall resource allocation policy
you want so I'll go ahead and link to the trivial fairshare-by-user
example. This is what I recommend for initial clusters as it is
simple, easy to set up and easy for users to understand. After the
cluster is in production for a while the users and admins will have a
better idea of what refinements they want to make: http://gridengine.info/2006/01/17/easy-setup-of-equal-user-fairshare-policy

- There are a few ways designate high priority jobs in the single
all.q configuration. The urgency policy can be applied to a custom
requestable resource or even a department/project object and then
anyone who submits a job requesting that resource would inherit
additional entitlements. This method is subject to "abuse" though so
you may have to trust your users to only ask for the resource when the
job is actually important.

- Resource quotas may still help you out quite a bit. For instance you
can put some quota rules in that throttle serial jobs to no more than
80% of the cluster so that you always have some slots free for a
parallel task. The nice thing about quotas is how easy they are to
change on the fly to reflect real world needs. You can do a similar
approach with quotas on your PE objects to keep the cluster from being
consumed by all parallel jobs

- I have mixed feelings on the advice concerning changing queue sort
to seqno. I've used seqno sorting before but not for the need you
describe. It may be worthwhile to see if load based sorting gets you
the behavior you want

- For parallel job starvation look into reservation based scheduling
and "qsub -r y " - that is the current "modern" way to allow parallel
and serial jobs to function within the same cluster queue and on
shared hardware without artificially putting walls up between
resources for specific job types

- Dead on with your $pe_slots discovery. It is exactly what you should
be using if you need to stay within a chassis for a parallel job

- If you don't depend on any particular MPI i strongly recommend
installing OpenMPI as recent versions of it automatically integrate
"tightly" with Grid Engine and tight parallel integration is always
preferred over lose PE integrations.

- For your 16 processor jobs that you would like to see span 2 chassis
only if at all possible - I think there is a SGE best practice for
this but I may not have the mechanics correct in this email: I think
you set up multiple PE objects for each 8-core machine and then take
advantage of the SGE ability to accept wildcards in PE request
strings. If you set up PEs like "MPIHOST-1, MPIHOST-2, MPIHOST-3" etc.
and then submitted a job like "qsub -pe MPIHOST* 16 ./my-parallel-job"
then I *think* you would get your 16 slots spread out across 2
machines rather than scattered all over the cluster. Not sure though -
someone else on the list may be able to tell you the proper way to do
this.


I guess my overall advice is to see if you can get what you need
within a single all.q or similar via submitting parallel jobs with
reservation requests enabled, setting up a fairshare-by-user policy so
that all cluster users are treated equally at first. Resource quotas
and one-off uses of the override or urgency sub policies may fill in
any remaining gaps.


Regards,
Chris




On Jun 1, 2009, at 6:56 PM, jagladden wrote:



Craffi,

Thank you for your prompt reply.  I stumbled into the question posed
in my post after reading the document titled "BEGINNER'S GUIDE TO
SUN? GRID ENGINE 6.2 - Installation and Configuration White Paper -
September 2008" which I found on the SGE project website (https://www.sun.com/offers/details/Sun_Grid_Engine_62_install_and_config.html
).  The document include an appendix titled "Appendix A Configuring
an SMP/Batch Cluster".

This appendix contains suggestions for setting up a cluster that
will run shared memory jobs, non-shared memory parallel jobs, and
serial jobs.  The appendix suggests setting up at least two queues,
one for SMP jobs, and one or more for everything else.  It further
suggested changing the "queue_sort_method" from "load" to "seqno"
and then explicitly providing sequence numbers for each queue
instance - with the queue instances for the SMP queue being ordered
in a monotonically increasing fashion and the queue instances for
the other queues being numbered in the opposite manner.  If I
understand this suggestion correctly, this will cause the scheduler
to "pack" the non-SMP jobs onto nodes starting from one end of the
node pool, and allocate nodes to SMP jobs starting from the other
end.  This apparently avoid the situation wherein the scheduler
effectively balkanizes the node pool by scattering non-SMP jobs
among the least loaded nodes.

After some experimentation, I became suspicious that this suggestion
would in fact create a system which, when heavily subscribed, would
allow the scheduler to assign at least two jobs to each CPU - one
from the SMP queue and one from each of the other queues.  This led
me to perform the two-queue experiment that was the subject of my
original post.

So, with that preamble, let me back up and explain what I am trying
to do.  We are setting up a new cluster for academic research
computing that will consist of 55 dual-quad core nodes.  The work
load is primarily a mixture of serial jobs (locally developed Monte
Carlo codes, etc.) and parallel jobs (mostly Gauusian and VASP).
The parallel jobs are capable of using both shared memory and the
network interconnect for inter-process communication.

In general, the parallel codes don't scale all that well beyond 8
processes, so the most commonly desired scenario will be to allocate
an entire node to a parallel job.  I have done some testing of the
SGE parallel environment, and it appears if I setup a PE with
"allocation_rule=$pe_slots" then jobs requesting this PE will always
have all processes allocated on the same node.  So it appears that I
can handle this case just by setting up a PE for this specific case.

There are however, a couple of slightly more complicated parallel
cases.  First, benchmarking indicates that it is not always
productive, when running Gaussian or VASP, to actually utilize all
eight cores on a node - memory bandwidth starvation results.  So in
some cases it might well be useful  if a job could request exclusive
use of two or mode nodes, with the explicit intent of actually only
starting four processes on each node.  Conversely, there may be a
few cases where utilizing 16 CPUs, allocated exclusively across two
nodes, may prove useful.  As near as I can determine, SGE does not
offer the ability to request specific "job geometries" (n processes
on m nodes), so I am not clear as to how one handles cases like these.

Serial jobs are likely to be the minority case, at least if you
count them by actual CPU seconds used.  As I alluded to earlier, in
this case I believe it is desirable to have the scheduler "pack"
serial jobs onto nodes rather then spread them out across idle
nodes.  The latter scenario could seriously impact the parallel job
throughput by making it hard for the scheduler to find empty nodes
for parallel jobs that require dedicated use of a node.  We could of
course divide the nodes into separated pools for serial and parallel
job use, but this does seem to be a scenario likely to produce an
efficient use of resources.

I suspect there is nothing particularly unusual about our job mix or
computing environment, so I assume that other admins have found
satisfactory solutions for these situations.  I would appreciate any
suggestions you might be able to offer.

James Gladden

craffi wrote:


A few things:

  - Your priority setting has no effect SGE or the order in which
jobs
are dispatched. The param you are setting effectively is setting the
unix nice level of your tasks - this is an OS thing that has nothing
to do with SGE policies, resource allocation or scheduling. Most
people don't use this parameter unless they are intentionally over
subscribing (more running jobs than CPUs on each machine)

  - There are a few different ways to get at what you want, not sure
if its worth going through the details if you are just experimenting
at this point. If you can clearly explain what you want the system to
do we can probably suggest ways to implement it

If you insist on keeping 2 cluster queues you may want to search the
SGE docs for information on "subordinate queues" - that would let you
set up a system by which the low priority queue stops accepting work
when the high priority queues are occupied

You may also want to read up on SGE Resource Quotas which are a
fantastic tool - reading the docs on that may give you some ideas for
resource quotas that would let you simplify the queue structure. For
example it would be trivial to create a global resource quota that
does not let the system have more than 10 active jobs at any one time
-- this is one way to deal with the "20 jobs / 10 processor" issue
you
have noted.

Welcome to SGE!


-Chris



On Jun 1, 2009, at 2:29 PM, jagladden wrote:




I am new to SGE, so I am trying it out on a small test cluster as a
first step.  Having done some experiments, I find myself a little
confused about how SGE handles queue instances and slots.

My test cluster has two compute nodes, with a total of 10 cores, as
shown by 'qhost':

[root at testpe bin64]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -
-       -       -
compute-0-0             lx26-amd64      2  0.00    2.0G  102.8M
2.0G     0.0
compute-0-1             lx26-amd64      8  0.00   15.7G  119.9M
996.2M     0.0

I have set up two cluster queues.  The first of these is the
standard default queue 'all.q' as shown by 'qconf -sq':

[root at testpe ~]# qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpi orte
rerun                 FALSE
slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
...

The second is a "high priority" queue, which is identical except for
having a higher default job priority:

[root at testpe ~]# qconf -sq high
qname                 high
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              10
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
...


My point of confusion arises when I submit jobs to both these
queues.  There are only 10 CPU's available, and I would expect the
queuing system to only allow a maximum of 10 jobs to run at any one
time.  What happens in practice is that SGE allows 10 jobs from each
of the two queues to run a the same time, for a total of 20 jobs,
thus effectively allocating two jobs to each CPU.  In the following
example I have submitted 24 jobs, 12 to each queue.  Note that
'qstat' shows 20 of them to be running simultaneously, with four
waiting:

[gladden at testpe batchtest]$ qstat
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    110 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-0.local<mailto:all.q at compute-0-0.local>
            1
    114 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-0.local<mailto:all.q at compute-0-0.local>
            1
    109 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    111 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    112 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    113 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    115 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    116 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    117 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    118 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local<mailto:all.q at compute-0-1.local>
            1
    121 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-0.local<mailto:high at compute-0-0.local>
             1
    126 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-0.local<mailto:high at compute-0-0.local>
             1
    122 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    123 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    124 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    125 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    127 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    128 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    129 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    130 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local<mailto:high at compute-0-1.local>
             1
    119 0.55500 test_simpl gladden      qw    06/01/2009
10:08:44                                    1
    120 0.55500 test_simpl gladden      qw    06/01/2009
10:08:45                                    1
    131 0.55500 test_simpl gladden      qw    06/01/2009
10:09:12                                    1
    132 0.55500 test_simpl gladden      qw    06/01/2009
10:09:13                                    1

What I had expected was that SGE would first dispatch 10 jobs from
the "high priority" queue and then, as those jobs completed and
slots become available, dispatch and run additional jobs from the
default queue - but allowing only 10 jobs to run at one time.
Instead, SGE seems to regard that 10 queue instances associated with
the "high" queue as being associated with slots that are independent
from the 10 that are associated with "all.q".

Have I failed to configure something properly?  Is there not a way
to feed jobs from multiple queues to the same set of nodes while
limiting the number of active jobs to one per CPU?

James Gladden



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200194

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
].




------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200366

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list