[GE users] Overallocated cluster

udowaechter udo.waechter at uni-osnabrueck.de
Sat May 23 00:26:28 BST 2009


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi.
I have the impression that ganglia's and sge's load reports can not be  
compared.
I have not really looked into the code of ganglia's load report. We  
also use ganglia to monitor our machines.
My experience is, that ganglia reports a higher value for the load  
than sge does.
Usually when a machine with 8 cores runs at (sge's) np_load_short =  
1.0 or so and "uptime" reports a load of something above 8, then  
ganglia reports a load above 100%. Most of the time significantly  
above 100%.

Thus, we do not rely on ganglia's load report.
In our case, we have jobs that do need about 80% cpu and 20% of IO  
during their execution thus we configured our nodes to have  
number_of_cores+number_of_cores/2 slots in SGE. With this  
configuration their CPU-Load is most of the time 100% on all cores.  
With the same amount of slots as cpu_cores the cpu-load on the  
machines usually was only about 60-80%.

Anyway and IMHO, as long as IO-Wait and/or context-switches values do  
not too high, an oversubscription of hosts is not really something one  
wants to avoid.

This of course depends on the jobs that run in your grid. If they  
produce a lot of you, then it might not be so wise to incriese the  
number of slots above the number of cores.

An alternative to all this would be to have a suspend threshold on  
your queues. Thus, if the load of a machine gets above a certain load,  
some jobs are suspended and resumed a soon as the load value drops  
again. We have that that for a queue where not-so-important jobs run.

Happy computing,
udo.


On 12.05.2009, at 20:15, rmc7777 wrote:

> Hi,
>
> We have a 32-node (64-CPU) Apple G5 cluster running SGE.  We use  
> Ganglia to monitor the load on the cluster.  Ganglia shows that the  
> cluster is chronically overallocated, that is, running with an  
> average load much greater than 100%.  I would like to manage the  
> load with SGE such that jobs remain in a pending state until the  
> average load drops below 100%.  When the average load drops below  
> 100% jobs could be submitted to the run queue until the load goes  
> over 100% again.  Can you do this with SGE?  How would you configure  
> the queues or queue resources to accomplish this? thx.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=194687
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].
>

-- 
:: udo waechter - root at zoide.net :: N 52?16'30.5" E 8?3'10.1"
:: genuine input for your ears: http://auriculabovinari.de
::                          your eyes: http://ezag.zoide.net
::                          your brain: http://zoide.net

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=198374

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Application/PKCS7-SIGNATURE (Name: "smime.p7s") 2.2 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list