[GE users] subordination and memory oversubscription

rdickson ross.dickson at dal.ca
Wed Mar 4 15:23:04 GMT 2009

Hi folks.

We're running a shared cluster for university research, which means we 
have a wide variety of job types submitted to SGE 6.1u2.  There is also 
hardware heterogeneity:  We have a number of 16-core nodes coupled with 
Myrinet interconnect, and a number of 4-way nodes coupled only with 
Gigabit Ethernet.  The intent is that serial work should go on the 
cheaply-connected 4-ways, and parallel work should go on the 
expensively-connected 16-ways. 

On this cluster we've handled this simply by excluding serial work from 
the 16-way nodes with

    > qconf -sq short.q | grep qtype
    qtype                 BATCH INTERACTIVE,[@x4600s=INTERACTIVE]

The hostgroup @x4600s encompasses the 16-way nodes.

However, the types of jobs vary wildly from week to week.  On one recent 
occasion we had a waiting list full of serial work, and no parallel work 
to occupy the 16-way nodes.  Naturally the users wanted access to the 
16-ways, which we granted temporarily.  But we're not happy with this:  
If other users had shown up with some parallel work an hour after we 
opened up these nodes to serial work, they would have been within their 
rights to ask why it wasn't available, as spelled out in the 
organizational policies. 

The canonical answer to this is subordination, so that if parallel work 
shows up the serial work gets suspended.  In fact we're operating a 
subordinate queue elsewhere in the organization.  But we have found from 
experience that subordination does not play nicely with h_vmem memory 

If a job uses enough memory to start swapping, then performance takes a 
tremendous hit.  So the canonical answer to *this* is to apply h_vmem 
resource limits to each host, equal to the physical memory:

    > qconf -se cl001 | grep complex_values
    complex_values        h_vmem=64G,mx_endpoints=16,slots=16

...and set h_vmem as consumable in the system complex.  Which is what we 
do on this cluster.

Now, back to the idle parallel resource.  If we add those hosts to the 
subordinate queue, then much of the time things will go okay.  But 
eventually it will happen that the sum of the memory occupied by serial 
jobs and requested by parallel jobs on a single host will exceed the 
physical memory.  When that happens, with h_vmem consumable the parallel 
job will not be scheduled.

Unless we boost the h_vmem in the exechost config to, say, double the 
physical memory.  The idea here is that it's ok for a suspended job to 
swap to disk.  But if we do this, we expose ourselves to the risk of 
parallel work, entirely in the superordinate queue, oversubscribing the 
memory.  My purely seat-of-the-pants estimate is that in a few months 
we'd probably see such an event (unless we went on a vigorous 
user-education program.)

One of my colleagues asked if we could apply an h_vmem limit per queue.  
There is h_vmem in the queue_conf (qconf -mq short.q), but what that 
does is constrains the total h_vmem for any job in that queue, right?  
Since we want to support large MPI jobs that span nodes and may exceed a 
single host's memory, that's not a solution.  That seems like the right 
idea, though:  A per-queue, per-host limit.

So my question is (finally):  Can anyone suggest a way we can dodge 
those two risks?

    1) With exechost h_vmem set to physical RAM, superordinate jobs can 
be prevented from scheduling by the subordinate queue "sitting on" the 

    2) With exechost h_vmem set to greater than physical RAM, a single 
queue can oversubscribe memory on a machine.


Ross Dickson         Computational Research Consultant
ACEnet               http://www.ace-net.ca
+1 902 494 6710      Skype: ross.m.dickson


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list