[GE users] subordination and memory oversubscription

reuti reuti at staff.uni-marburg.de
Wed Mar 4 16:07:43 GMT 2009


Am 04.03.2009 um 16:23 schrieb rdickson:

> We're running a shared cluster for university research, which means we
> have a wide variety of job types submitted to SGE 6.1u2.  There is  
> also
> hardware heterogeneity:  We have a number of 16-core nodes coupled  
> with
> Myrinet interconnect, and a number of 4-way nodes coupled only with
> Gigabit Ethernet.  The intent is that serial work should go on the
> cheaply-connected 4-ways, and parallel work should go on the
> expensively-connected 16-ways.
> On this cluster we've handled this simply by excluding serial work  
> from
> the 16-way nodes with
>> qconf -sq short.q | grep qtype
>     qtype                 BATCH INTERACTIVE,[@x4600s=INTERACTIVE]
> The hostgroup @x4600s encompasses the 16-way nodes.
> However, the types of jobs vary wildly from week to week.  On one  
> recent
> occasion we had a waiting list full of serial work, and no parallel  
> work
> to occupy the 16-way nodes.  Naturally the users wanted access to the
> 16-ways, which we granted temporarily.  But we're not happy with this:
> If other users had shown up with some parallel work an hour after we
> opened up these nodes to serial work, they would have been within  
> their
> rights to ask why it wasn't available, as spelled out in the
> organizational policies.
> The canonical answer to this is subordination, so that if parallel  
> work
> shows up the serial work gets suspended.  In fact we're operating a
> subordinate queue elsewhere in the organization.  But we have found  
> from
> experience that subordination does not play nicely with h_vmem memory
> constraints.
> If a job uses enough memory to start swapping, then performance  
> takes a
> tremendous hit.  So the canonical answer to *this* is to apply h_vmem
> resource limits to each host, equal to the physical memory:
>> qconf -se cl001 | grep complex_values
>     complex_values        h_vmem=64G,mx_endpoints=16,slots=16
> ...and set h_vmem as consumable in the system complex.  Which is  
> what we
> do on this cluster.
> Now, back to the idle parallel resource.  If we add those hosts to the
> subordinate queue, then much of the time things will go okay.  But
> eventually it will happen that the sum of the memory occupied by  
> serial
> jobs and requested by parallel jobs on a single host will exceed the
> physical memory.  When that happens, with h_vmem consumable the  
> parallel
> job will not be scheduled.
> Unless we boost the h_vmem in the exechost config to, say, double the
> physical memory.  The idea here is that it's ok for a suspended job to
> swap to disk.  But if we do this, we expose ourselves to the risk of
> parallel work, entirely in the superordinate queue, oversubscribing  
> the
> memory.  My purely seat-of-the-pants estimate is that in a few months
> we'd probably see such an event (unless we went on a vigorous
> user-education program.)
> One of my colleagues asked if we could apply an h_vmem limit per  
> queue.
> There is h_vmem in the queue_conf (qconf -mq short.q), but what that
> does is constrains the total h_vmem for any job in that queue, right?
> Since we want to support large MPI jobs that span nodes and may  
> exceed a
> single host's memory, that's not a solution.  That seems like the  
> right
> idea, though:  A per-queue, per-host limit.
> So my question is (finally):  Can anyone suggest a way we can dodge
> those two risks?
>     1) With exechost h_vmem set to physical RAM, superordinate jobs  
> can
> be prevented from scheduling by the subordinate queue "sitting on" the
> h_vmem.
>     2) With exechost h_vmem set to greater than physical RAM, a single
> queue can oversubscribe memory on a machine.

instead defining h_vmem as complex_value in the exechost  
configuration (which is working fine and was the proper setting in  
the past), you can nowadays put this limit also in an RQS. And in  
this you can define two rules, one for each queue.

limit queues onequeue hosts {*} to h_vmem=8G
limit queues anotherqueue hosts {*} to h_vmem=8G

-- Reuti

> Thanks,
> -- 
> Ross Dickson         Computational Research Consultant
> ACEnet               http://www.ace-net.ca
> +1 902 494 6710      Skype: ross.m.dickson
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=120520
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list