[GE users] Does oversubscribing nodes actually hurt anything?

Marconnet, James E Mr /Computer Sciences Corporation james.marconnet at smdc.army.mil
Wed Feb 16 22:37:39 GMT 2005


We have some Penguin nodes and some RLX nodes. The Penguin nodes are
approximately 2GHz dual-processor, hyperthreaded. The RLX nodes are 800 MHz
single-processor. 

We're using SGE6.0u1, setting slots to 4 for the Penguin nodes and to 1 for
the RLX nodes, since most jobs are CPU-bound.

Usually we end up with one job running on each RLX node and usually 4 on
each Penguin node. But sometimes when people submit jobs from different
ques, there are 2 or even 3 jobs running on a RLX node, and sometimes as
many as 8 jobs running on a Penguin node. All jobs try to get as much CPU as
they can. And the Load Factors go way above 1.0. Nodes operating in alarm
state "a" are common.

As long as there is enough memory, disk space, network capacity, etc. etc.
does this oversubscribing actually hurt anything, such as overheating the
nodes, or causing file corruption? It seems to me that there is only 100% of
each CPU available, no matter what you do. And that if one node can run X
jobs at a time, then each of them from the same manufacturer should be able
to do it just about as well as the others.

The question asked, I've seen some of our nodes that seem to be able to take
oversubscribing, and others that seem to actually suddenly hang or to do
something else unexpected like give repeated bizarre Java errors when more
jobs are running at once there than we planned. It all depends.

I've tried setting the load limit to 0.75. But then the Penguin nodes
sometimes only get 3 jobs running at a time, seemingly wasting throughput.
And still sometimes they get 4-5 jobs.

Thanks!
Jim



More information about the gridengine-users mailing list