[GE users] Newbie question - queues, queue instances, and slots

craffi dag at sonsorol.org
Tue Jun 2 13:45:05 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi James,

Thanks for the details - you are correct in that your cluster app mix
is not that different from what many people use. Dealing with a
mixture of parallel and non parallel jobs on the system system is
quite common.

Your detailed reply below did not mention having to distinguish
between high and low priority job types and since the two cluster
queues were the root cause of your problem with
overcommitment you may be able to get the behavior you want in the
much simpler configuration of having a single "all.q" type cluster
queue. Having a single queue for you to test both job types may also
make tuning easier. Some random comments on the issues raised below:

- You don't mention what sort of overall resource allocation policy
you want so I'll go ahead and link to the trivial fairshare-by-user
example. This is what I recommend for initial clusters as it is
simple, easy to set up and easy for users to understand. After the
cluster is in production for a while the users and admins will have a
better idea of what refinements they want to make: http://gridengine.info/2006/01/17/easy-setup-of-equal-user-fairshare-policy

- There are a few ways designate high priority jobs in the single
all.q configuration. The urgency policy can be applied to a custom
requestable resource or even a department/project object and then
anyone who submits a job requesting that resource would inherit
additional entitlements. This method is subject to "abuse" though so
you may have to trust your users to only ask for the resource when the
job is actually important.

- Resource quotas may still help you out quite a bit. For instance you
can put some quota rules in that throttle serial jobs to no more than
80% of the cluster so that you always have some slots free for a
parallel task. The nice thing about quotas is how easy they are to
change on the fly to reflect real world needs. You can do a similar
approach with quotas on your PE objects to keep the cluster from being
consumed by all parallel jobs

- I have mixed feelings on the advice concerning changing queue sort
to seqno. I've used seqno sorting before but not for the need you
describe. It may be worthwhile to see if load based sorting gets you
the behavior you want

- For parallel job starvation look into reservation based scheduling
and "qsub -r y " - that is the current "modern" way to allow parallel
and serial jobs to function within the same cluster queue and on
shared hardware without artificially putting walls up between
resources for specific job types

- Dead on with your $pe_slots discovery. It is exactly what you should
be using if you need to stay within a chassis for a parallel job

- If you don't depend on any particular MPI i strongly recommend
installing OpenMPI as recent versions of it automatically integrate
"tightly" with Grid Engine and tight parallel integration is always
preferred over lose PE integrations.

- For your 16 processor jobs that you would like to see span 2 chassis
only if at all possible - I think there is a SGE best practice for
this but I may not have the mechanics correct in this email: I think
you set up multiple PE objects for each 8-core machine and then take
advantage of the SGE ability to accept wildcards in PE request
strings. If you set up PEs like "MPIHOST-1, MPIHOST-2, MPIHOST-3" etc.
and then submitted a job like "qsub -pe MPIHOST* 16 ./my-parallel-job"
then I *think* you would get your 16 slots spread out across 2
machines rather than scattered all over the cluster. Not sure though -
someone else on the list may be able to tell you the proper way to do
this.


I guess my overall advice is to see if you can get what you need
within a single all.q or similar via submitting parallel jobs with
reservation requests enabled, setting up a fairshare-by-user policy so
that all cluster users are treated equally at first. Resource quotas
and one-off uses of the override or urgency sub policies may fill in
any remaining gaps.


Regards,
Chris




On Jun 1, 2009, at 6:56 PM, jagladden wrote:

> Craffi,
>
> Thank you for your prompt reply.  I stumbled into the question posed
> in my post after reading the document titled "BEGINNER'S GUIDE TO
> SUN? GRID ENGINE 6.2 - Installation and Configuration White Paper -
> September 2008" which I found on the SGE project website (https://www.sun.com/offers/details/Sun_Grid_Engine_62_install_and_config.html
> ).  The document include an appendix titled "Appendix A Configuring
> an SMP/Batch Cluster".
>
> This appendix contains suggestions for setting up a cluster that
> will run shared memory jobs, non-shared memory parallel jobs, and
> serial jobs.  The appendix suggests setting up at least two queues,
> one for SMP jobs, and one or more for everything else.  It further
> suggested changing the "queue_sort_method" from "load" to "seqno"
> and then explicitly providing sequence numbers for each queue
> instance - with the queue instances for the SMP queue being ordered
> in a monotonically increasing fashion and the queue instances for
> the other queues being numbered in the opposite manner.  If I
> understand this suggestion correctly, this will cause the scheduler
> to "pack" the non-SMP jobs onto nodes starting from one end of the
> node pool, and allocate nodes to SMP jobs starting from the other
> end.  This apparently avoid the situation wherein the scheduler
> effectively balkanizes the node pool by scattering non-SMP jobs
> among the least loaded nodes.
>
> After some experimentation, I became suspicious that this suggestion
> would in fact create a system which, when heavily subscribed, would
> allow the scheduler to assign at least two jobs to each CPU - one
> from the SMP queue and one from each of the other queues.  This led
> me to perform the two-queue experiment that was the subject of my
> original post.
>
> So, with that preamble, let me back up and explain what I am trying
> to do.  We are setting up a new cluster for academic research
> computing that will consist of 55 dual-quad core nodes.  The work
> load is primarily a mixture of serial jobs (locally developed Monte
> Carlo codes, etc.) and parallel jobs (mostly Gauusian and VASP).
> The parallel jobs are capable of using both shared memory and the
> network interconnect for inter-process communication.
>
> In general, the parallel codes don't scale all that well beyond 8
> processes, so the most commonly desired scenario will be to allocate
> an entire node to a parallel job.  I have done some testing of the
> SGE parallel environment, and it appears if I setup a PE with
> "allocation_rule=$pe_slots" then jobs requesting this PE will always
> have all processes allocated on the same node.  So it appears that I
> can handle this case just by setting up a PE for this specific case.
>
> There are however, a couple of slightly more complicated parallel
> cases.  First, benchmarking indicates that it is not always
> productive, when running Gaussian or VASP, to actually utilize all
> eight cores on a node - memory bandwidth starvation results.  So in
> some cases it might well be useful  if a job could request exclusive
> use of two or mode nodes, with the explicit intent of actually only
> starting four processes on each node.  Conversely, there may be a
> few cases where utilizing 16 CPUs, allocated exclusively across two
> nodes, may prove useful.  As near as I can determine, SGE does not
> offer the ability to request specific "job geometries" (n processes
> on m nodes), so I am not clear as to how one handles cases like these.
>
> Serial jobs are likely to be the minority case, at least if you
> count them by actual CPU seconds used.  As I alluded to earlier, in
> this case I believe it is desirable to have the scheduler "pack"
> serial jobs onto nodes rather then spread them out across idle
> nodes.  The latter scenario could seriously impact the parallel job
> throughput by making it hard for the scheduler to find empty nodes
> for parallel jobs that require dedicated use of a node.  We could of
> course divide the nodes into separated pools for serial and parallel
> job use, but this does seem to be a scenario likely to produce an
> efficient use of resources.
>
> I suspect there is nothing particularly unusual about our job mix or
> computing environment, so I assume that other admins have found
> satisfactory solutions for these situations.  I would appreciate any
> suggestions you might be able to offer.
>
> James Gladden
>
> craffi wrote:
>>
>> A few things:
>>
>>   - Your priority setting has no effect SGE or the order in which
>> jobs
>> are dispatched. The param you are setting effectively is setting the
>> unix nice level of your tasks - this is an OS thing that has nothing
>> to do with SGE policies, resource allocation or scheduling. Most
>> people don't use this parameter unless they are intentionally over
>> subscribing (more running jobs than CPUs on each machine)
>>
>>   - There are a few different ways to get at what you want, not sure
>> if its worth going through the details if you are just experimenting
>> at this point. If you can clearly explain what you want the system to
>> do we can probably suggest ways to implement it
>>
>> If you insist on keeping 2 cluster queues you may want to search the
>> SGE docs for information on "subordinate queues" - that would let you
>> set up a system by which the low priority queue stops accepting work
>> when the high priority queues are occupied
>>
>> You may also want to read up on SGE Resource Quotas which are a
>> fantastic tool - reading the docs on that may give you some ideas for
>> resource quotas that would let you simplify the queue structure. For
>> example it would be trivial to create a global resource quota that
>> does not let the system have more than 10 active jobs at any one time
>> -- this is one way to deal with the "20 jobs / 10 processor" issue
>> you
>> have noted.
>>
>> Welcome to SGE!
>>
>>
>> -Chris
>>
>>
>>
>> On Jun 1, 2009, at 2:29 PM, jagladden wrote:
>>
>>
>>> I am new to SGE, so I am trying it out on a small test cluster as a
>>> first step.  Having done some experiments, I find myself a little
>>> confused about how SGE handles queue instances and slots.
>>>
>>> My test cluster has two compute nodes, with a total of 10 cores, as
>>> shown by 'qhost':
>>>
>>> [root at testpe bin64]# qhost
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>> SWAPTO  SWAPUS
>>> -------------------------------------------------------------------------------
>>> global                  -               -     -       -
>>> -       -       -
>>> compute-0-0             lx26-amd64      2  0.00    2.0G  102.8M
>>> 2.0G     0.0
>>> compute-0-1             lx26-amd64      8  0.00   15.7G  119.9M
>>> 996.2M     0.0
>>>
>>> I have set up two cluster queues.  The first of these is the
>>> standard default queue 'all.q' as shown by 'qconf -sq':
>>>
>>> [root at testpe ~]# qconf -sq all.q
>>> qname                 all.q
>>> hostlist              @allhosts
>>> seq_no                0
>>> load_thresholds       np_load_avg=1.75
>>> suspend_thresholds    NONE
>>> nsuspend              1
>>> suspend_interval      00:05:00
>>> priority              0
>>> min_cpu_interval      00:05:00
>>> processors            UNDEFINED
>>> qtype                 BATCH INTERACTIVE
>>> ckpt_list             NONE
>>> pe_list               make mpich mpi orte
>>> rerun                 FALSE
>>> slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
>>> ...
>>>
>>> The second is a "high priority" queue, which is identical except for
>>> having a higher default job priority:
>>>
>>> [root at testpe ~]# qconf -sq high
>>> qname                 high
>>> hostlist              @allhosts
>>> seq_no                0
>>> load_thresholds       np_load_avg=1.75
>>> suspend_thresholds    NONE
>>> nsuspend              1
>>> suspend_interval      00:05:00
>>> priority              10
>>> min_cpu_interval      00:05:00
>>> processors            UNDEFINED
>>> qtype                 BATCH INTERACTIVE
>>> ckpt_list             NONE
>>> pe_list               make
>>> rerun                 FALSE
>>> slots                 1,[compute-0-0.local=2],[compute-0-1.local=8]
>>> ...
>>>
>>>
>>> My point of confusion arises when I submit jobs to both these
>>> queues.  There are only 10 CPU's available, and I would expect the
>>> queuing system to only allow a maximum of 10 jobs to run at any one
>>> time.  What happens in practice is that SGE allows 10 jobs from each
>>> of the two queues to run a the same time, for a total of 20 jobs,
>>> thus effectively allocating two jobs to each CPU.  In the following
>>> example I have submitted 24 jobs, 12 to each queue.  Note that
>>> 'qstat' shows 20 of them to be running simultaneously, with four
>>> waiting:
>>>
>>> [gladden at testpe batchtest]$ qstat
>>> job-ID  prior   name       user         state submit/start at
>>> queue                          slots ja-task-ID
>>> -----------------------------------------------------------------------------------------------------------------
>>>     110 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-0.local
>>>             1
>>>     114 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-0.local
>>>             1
>>>     109 0.55500 test_simpl gladden      r     06/01/2009 10:08:37 all.q at compute-0-1.local
>>>             1
>>>     111 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local
>>>             1
>>>     112 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local
>>>             1
>>>     113 0.55500 test_simpl gladden      r     06/01/2009 10:08:40 all.q at compute-0-1.local
>>>             1
>>>     115 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local
>>>             1
>>>     116 0.55500 test_simpl gladden      r     06/01/2009 10:08:43 all.q at compute-0-1.local
>>>             1
>>>     117 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local
>>>             1
>>>     118 0.55500 test_simpl gladden      r     06/01/2009 10:08:46 all.q at compute-0-1.local
>>>             1
>>>     121 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-0.local
>>>              1
>>>     126 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-0.local
>>>              1
>>>     122 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local
>>>              1
>>>     123 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local
>>>              1
>>>     124 0.55500 test_simpl gladden      r     06/01/2009 10:09:08 high at compute-0-1.local
>>>              1
>>>     125 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local
>>>              1
>>>     127 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local
>>>              1
>>>     128 0.55500 test_simpl gladden      r     06/01/2009 10:09:11 high at compute-0-1.local
>>>              1
>>>     129 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local
>>>              1
>>>     130 0.55500 test_simpl gladden      r     06/01/2009 10:09:14 high at compute-0-1.local
>>>              1
>>>     119 0.55500 test_simpl gladden      qw    06/01/2009
>>> 10:08:44                                    1
>>>     120 0.55500 test_simpl gladden      qw    06/01/2009
>>> 10:08:45                                    1
>>>     131 0.55500 test_simpl gladden      qw    06/01/2009
>>> 10:09:12                                    1
>>>     132 0.55500 test_simpl gladden      qw    06/01/2009
>>> 10:09:13                                    1
>>>
>>> What I had expected was that SGE would first dispatch 10 jobs from
>>> the "high priority" queue and then, as those jobs completed and
>>> slots become available, dispatch and run additional jobs from the
>>> default queue - but allowing only 10 jobs to run at one time.
>>> Instead, SGE seems to regard that 10 queue instances associated with
>>> the "high" queue as being associated with slots that are independent
>>> from the 10 that are associated with "all.q".
>>>
>>> Have I failed to configure something properly?  Is there not a way
>>> to feed jobs from multiple queues to the same set of nodes while
>>> limiting the number of active jobs to one per CPU?
>>>
>>> James Gladden
>>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200194
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>> ].
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200366

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list