[GE users] subordination and memory oversubscription

kdoman kdoman07 at gmail.com
Mon Mar 9 21:34:58 GMT 2009

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

For older GE clusters without rqs, can I set this h_vmem inside the
queue definition (qconf -sq ... h_vmem  8G) and accomplish the desired


On Wed, Mar 4, 2009 at 11:07 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> Am 04.03.2009 um 16:23 schrieb rdickson:
>> We're running a shared cluster for university research, which means we
>> have a wide variety of job types submitted to SGE 6.1u2.  There is
>> also
>> hardware heterogeneity:  We have a number of 16-core nodes coupled
>> with
>> Myrinet interconnect, and a number of 4-way nodes coupled only with
>> Gigabit Ethernet.  The intent is that serial work should go on the
>> cheaply-connected 4-ways, and parallel work should go on the
>> expensively-connected 16-ways.
>> On this cluster we've handled this simply by excluding serial work
>> from
>> the 16-way nodes with
>>> qconf -sq short.q | grep qtype
>>     qtype                 BATCH INTERACTIVE,[@x4600s=INTERACTIVE]
>> The hostgroup @x4600s encompasses the 16-way nodes.
>> However, the types of jobs vary wildly from week to week.  On one
>> recent
>> occasion we had a waiting list full of serial work, and no parallel
>> work
>> to occupy the 16-way nodes.  Naturally the users wanted access to the
>> 16-ways, which we granted temporarily.  But we're not happy with this:
>> If other users had shown up with some parallel work an hour after we
>> opened up these nodes to serial work, they would have been within
>> their
>> rights to ask why it wasn't available, as spelled out in the
>> organizational policies.
>> The canonical answer to this is subordination, so that if parallel
>> work
>> shows up the serial work gets suspended.  In fact we're operating a
>> subordinate queue elsewhere in the organization.  But we have found
>> from
>> experience that subordination does not play nicely with h_vmem memory
>> constraints.
>> If a job uses enough memory to start swapping, then performance
>> takes a
>> tremendous hit.  So the canonical answer to *this* is to apply h_vmem
>> resource limits to each host, equal to the physical memory:
>>> qconf -se cl001 | grep complex_values
>>     complex_values        h_vmem=64G,mx_endpoints=16,slots=16
>> ...and set h_vmem as consumable in the system complex.  Which is
>> what we
>> do on this cluster.
>> Now, back to the idle parallel resource.  If we add those hosts to the
>> subordinate queue, then much of the time things will go okay.  But
>> eventually it will happen that the sum of the memory occupied by
>> serial
>> jobs and requested by parallel jobs on a single host will exceed the
>> physical memory.  When that happens, with h_vmem consumable the
>> parallel
>> job will not be scheduled.
>> Unless we boost the h_vmem in the exechost config to, say, double the
>> physical memory.  The idea here is that it's ok for a suspended job to
>> swap to disk.  But if we do this, we expose ourselves to the risk of
>> parallel work, entirely in the superordinate queue, oversubscribing
>> the
>> memory.  My purely seat-of-the-pants estimate is that in a few months
>> we'd probably see such an event (unless we went on a vigorous
>> user-education program.)
>> One of my colleagues asked if we could apply an h_vmem limit per
>> queue.
>> There is h_vmem in the queue_conf (qconf -mq short.q), but what that
>> does is constrains the total h_vmem for any job in that queue, right?
>> Since we want to support large MPI jobs that span nodes and may
>> exceed a
>> single host's memory, that's not a solution.  That seems like the
>> right
>> idea, though:  A per-queue, per-host limit.
>> So my question is (finally):  Can anyone suggest a way we can dodge
>> those two risks?
>>     1) With exechost h_vmem set to physical RAM, superordinate jobs
>> can
>> be prevented from scheduling by the subordinate queue "sitting on" the
>> h_vmem.
>>     2) With exechost h_vmem set to greater than physical RAM, a single
>> queue can oversubscribe memory on a machine.
> instead defining h_vmem as complex_value in the exechost
> configuration (which is working fine and was the proper setting in
> the past), you can nowadays put this limit also in an RQS. And in
> this you can define two rules, one for each queue.
> limit queues onequeue hosts {*} to h_vmem=8G
> limit queues anotherqueue hosts {*} to h_vmem=8G
> -- Reuti
>> Thanks,
>> --
>> Ross Dickson         Computational Research Consultant
>> ACEnet               http://www.ace-net.ca
>> +1 902 494 6710      Skype: ross.m.dickson
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=120520
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=120550
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list