[GE users] subordination and memory oversubscription
ross.dickson at dal.ca
Wed Mar 4 15:23:04 GMT 2009
We're running a shared cluster for university research, which means we
have a wide variety of job types submitted to SGE 6.1u2. There is also
hardware heterogeneity: We have a number of 16-core nodes coupled with
Myrinet interconnect, and a number of 4-way nodes coupled only with
Gigabit Ethernet. The intent is that serial work should go on the
cheaply-connected 4-ways, and parallel work should go on the
On this cluster we've handled this simply by excluding serial work from
the 16-way nodes with
> qconf -sq short.q | grep qtype
qtype BATCH INTERACTIVE,[@x4600s=INTERACTIVE]
The hostgroup @x4600s encompasses the 16-way nodes.
However, the types of jobs vary wildly from week to week. On one recent
occasion we had a waiting list full of serial work, and no parallel work
to occupy the 16-way nodes. Naturally the users wanted access to the
16-ways, which we granted temporarily. But we're not happy with this:
If other users had shown up with some parallel work an hour after we
opened up these nodes to serial work, they would have been within their
rights to ask why it wasn't available, as spelled out in the
The canonical answer to this is subordination, so that if parallel work
shows up the serial work gets suspended. In fact we're operating a
subordinate queue elsewhere in the organization. But we have found from
experience that subordination does not play nicely with h_vmem memory
If a job uses enough memory to start swapping, then performance takes a
tremendous hit. So the canonical answer to *this* is to apply h_vmem
resource limits to each host, equal to the physical memory:
> qconf -se cl001 | grep complex_values
...and set h_vmem as consumable in the system complex. Which is what we
do on this cluster.
Now, back to the idle parallel resource. If we add those hosts to the
subordinate queue, then much of the time things will go okay. But
eventually it will happen that the sum of the memory occupied by serial
jobs and requested by parallel jobs on a single host will exceed the
physical memory. When that happens, with h_vmem consumable the parallel
job will not be scheduled.
Unless we boost the h_vmem in the exechost config to, say, double the
physical memory. The idea here is that it's ok for a suspended job to
swap to disk. But if we do this, we expose ourselves to the risk of
parallel work, entirely in the superordinate queue, oversubscribing the
memory. My purely seat-of-the-pants estimate is that in a few months
we'd probably see such an event (unless we went on a vigorous
One of my colleagues asked if we could apply an h_vmem limit per queue.
There is h_vmem in the queue_conf (qconf -mq short.q), but what that
does is constrains the total h_vmem for any job in that queue, right?
Since we want to support large MPI jobs that span nodes and may exceed a
single host's memory, that's not a solution. That seems like the right
idea, though: A per-queue, per-host limit.
So my question is (finally): Can anyone suggest a way we can dodge
those two risks?
1) With exechost h_vmem set to physical RAM, superordinate jobs can
be prevented from scheduling by the subordinate queue "sitting on" the
2) With exechost h_vmem set to greater than physical RAM, a single
queue can oversubscribe memory on a machine.
Ross Dickson Computational Research Consultant
+1 902 494 6710 Skype: ross.m.dickson
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users