[GE users] Does oversubscribing nodes actually hurt anything?

Reuti reuti at staff.uni-marburg.de
Wed Feb 16 23:20:13 GMT 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

You can't damage anything by oversubscribing. When the load is above 1, it will 
simply share the CPU-time with all the tasks. If we neglect the limit of the 
memory/IO bandwidth, it should be the same time in total running 5 jobs one 
after the other on a single CPU, or let them running all at once. Because of 
the mentioned limits, running all at once maybe slower, depending on the type 
of application. E.g. for determine the maximum power consumption of the nodes 
for our UPS, I measured one node idling and with full load. It can't get 

The effects you observe we had on some nodes with cheap power supplies inside, 
because they were intended for desktop usage, where any word processing 
software won't use the CPU at 100%. Maybe these power supplies have a too small 
time to give the maximum power output, or on the other hand maybe you need an 
extra fan in these machines.

Another sidenote: do you have an UPS? The main intention for us to buy an 
online UPS was to remove spikes in the mains from the electricity company. 
Sometimes from 44 nodes on one cluster we got 6 completely shut down, 8 in the 
land of lost bytes, 7 rebooted and working and the remaining ones survived (I 
always forgot to play lottery with these numbers we got).

For us the HT on Dual-Xeons seems only to allow 3 jobs at once on a machine, 
more will slow down the other jobs. But this depends on the used software. 
Maybe you can have 4.

But you can also set the slots to 4 in the exec host definition. This is then 
the upper limit for all slots in total on a machine, regardless from which 
queues they are coming. If you do this, you can remove the load threshold on 
the queues.

FYI: the alarm state in SGE simply means, that the load on a machine is above 
the defined load threshold.

Cheers - Reuti

Quoting "Marconnet, James E Mr /Computer Sciences Corporation" 
<james.marconnet at smdc.army.mil>:

> We have some Penguin nodes and some RLX nodes. The Penguin nodes are
> approximately 2GHz dual-processor, hyperthreaded. The RLX nodes are 800 MHz
> single-processor. 
> We're using SGE6.0u1, setting slots to 4 for the Penguin nodes and to 1 for
> the RLX nodes, since most jobs are CPU-bound.
> Usually we end up with one job running on each RLX node and usually 4 on
> each Penguin node. But sometimes when people submit jobs from different
> ques, there are 2 or even 3 jobs running on a RLX node, and sometimes as
> many as 8 jobs running on a Penguin node. All jobs try to get as much CPU
> as
> they can. And the Load Factors go way above 1.0. Nodes operating in alarm
> state "a" are common.
> As long as there is enough memory, disk space, network capacity, etc. etc.
> does this oversubscribing actually hurt anything, such as overheating the
> nodes, or causing file corruption? It seems to me that there is only 100%
> of
> each CPU available, no matter what you do. And that if one node can run X
> jobs at a time, then each of them from the same manufacturer should be able
> to do it just about as well as the others.
> The question asked, I've seen some of our nodes that seem to be able to
> take
> oversubscribing, and others that seem to actually suddenly hang or to do
> something else unexpected like give repeated bizarre Java errors when more
> jobs are running at once there than we planned. It all depends.
> I've tried setting the load limit to 0.75. But then the Penguin nodes
> sometimes only get 3 jobs running at a time, seemingly wasting throughput.
> And still sometimes they get 4-5 jobs.
> Thanks!
> Jim

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list