[GE users] subordination and memory oversubscription

reuti reuti at staff.uni-marburg.de
Mon Mar 9 21:57:01 GMT 2009

Am 09.03.2009 um 22:34 schrieb kdoman:

> For older GE clusters without rqs, can I set this h_vmem inside the
> queue definition (qconf -sq ... h_vmem  8G) and accomplish the desired
> result??


In fact: you always have to set it in qsub or the queue  
configuration. The RQS will only check the specified values, but  
won't set/enforce any limit at all. Also attaching it to the  
execution host doesn't enforce any limit.

What you of course can't achieve w/o RQS: the total consumption  
across all queues on one and the same machine. If there is only one  
queue per machine, you can set the total amount in "qconf -me  
<nodename>" under complex_values to limit the total consumption,  
while still requesting/setting it in qsub and/or the queue  

-- Reuti

> Thanks.
> On Wed, Mar 4, 2009 at 11:07 AM, reuti <reuti at staff.uni-marburg.de>  
> wrote:
>> Hi,
>> Am 04.03.2009 um 16:23 schrieb rdickson:
>>> We're running a shared cluster for university research, which  
>>> means we
>>> have a wide variety of job types submitted to SGE 6.1u2.  There is
>>> also
>>> hardware heterogeneity:  We have a number of 16-core nodes coupled
>>> with
>>> Myrinet interconnect, and a number of 4-way nodes coupled only with
>>> Gigabit Ethernet.  The intent is that serial work should go on the
>>> cheaply-connected 4-ways, and parallel work should go on the
>>> expensively-connected 16-ways.
>>> On this cluster we've handled this simply by excluding serial work
>>> from
>>> the 16-way nodes with
>>>> qconf -sq short.q | grep qtype
>>>     qtype                 BATCH INTERACTIVE,[@x4600s=INTERACTIVE]
>>> The hostgroup @x4600s encompasses the 16-way nodes.
>>> However, the types of jobs vary wildly from week to week.  On one
>>> recent
>>> occasion we had a waiting list full of serial work, and no parallel
>>> work
>>> to occupy the 16-way nodes.  Naturally the users wanted access to  
>>> the
>>> 16-ways, which we granted temporarily.  But we're not happy with  
>>> this:
>>> If other users had shown up with some parallel work an hour after we
>>> opened up these nodes to serial work, they would have been within
>>> their
>>> rights to ask why it wasn't available, as spelled out in the
>>> organizational policies.
>>> The canonical answer to this is subordination, so that if parallel
>>> work
>>> shows up the serial work gets suspended.  In fact we're operating a
>>> subordinate queue elsewhere in the organization.  But we have found
>>> from
>>> experience that subordination does not play nicely with h_vmem  
>>> memory
>>> constraints.
>>> If a job uses enough memory to start swapping, then performance
>>> takes a
>>> tremendous hit.  So the canonical answer to *this* is to apply  
>>> h_vmem
>>> resource limits to each host, equal to the physical memory:
>>>> qconf -se cl001 | grep complex_values
>>>     complex_values        h_vmem=64G,mx_endpoints=16,slots=16
>>> ...and set h_vmem as consumable in the system complex.  Which is
>>> what we
>>> do on this cluster.
>>> Now, back to the idle parallel resource.  If we add those hosts  
>>> to the
>>> subordinate queue, then much of the time things will go okay.  But
>>> eventually it will happen that the sum of the memory occupied by
>>> serial
>>> jobs and requested by parallel jobs on a single host will exceed the
>>> physical memory.  When that happens, with h_vmem consumable the
>>> parallel
>>> job will not be scheduled.
>>> Unless we boost the h_vmem in the exechost config to, say, double  
>>> the
>>> physical memory.  The idea here is that it's ok for a suspended  
>>> job to
>>> swap to disk.  But if we do this, we expose ourselves to the risk of
>>> parallel work, entirely in the superordinate queue, oversubscribing
>>> the
>>> memory.  My purely seat-of-the-pants estimate is that in a few  
>>> months
>>> we'd probably see such an event (unless we went on a vigorous
>>> user-education program.)
>>> One of my colleagues asked if we could apply an h_vmem limit per
>>> queue.
>>> There is h_vmem in the queue_conf (qconf -mq short.q), but what that
>>> does is constrains the total h_vmem for any job in that queue,  
>>> right?
>>> Since we want to support large MPI jobs that span nodes and may
>>> exceed a
>>> single host's memory, that's not a solution.  That seems like the
>>> right
>>> idea, though:  A per-queue, per-host limit.
>>> So my question is (finally):  Can anyone suggest a way we can dodge
>>> those two risks?
>>>     1) With exechost h_vmem set to physical RAM, superordinate jobs
>>> can
>>> be prevented from scheduling by the subordinate queue "sitting  
>>> on" the
>>> h_vmem.
>>>     2) With exechost h_vmem set to greater than physical RAM, a  
>>> single
>>> queue can oversubscribe memory on a machine.
>> instead defining h_vmem as complex_value in the exechost
>> configuration (which is working fine and was the proper setting in
>> the past), you can nowadays put this limit also in an RQS. And in
>> this you can define two rules, one for each queue.
>> limit queues onequeue hosts {*} to h_vmem=8G
>> limit queues anotherqueue hosts {*} to h_vmem=8G
>> -- Reuti
>>> Thanks,
>>> --
>>> Ross Dickson         Computational Research Consultant
>>> ACEnet               http://www.ace-net.ca
>>> +1 902 494 6710      Skype: ross.m.dickson
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=120520
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=120550
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=125879
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list