[GE users] nodes overloaded: processes placed on already full nodes

reuti reuti at staff.uni-marburg.de
Wed Dec 15 15:28:11 GMT 2010


Am 15.12.2010 um 16:13 schrieb templedf:

> This is a known issue.  When scheduling parallel jobs with 6.2 to 6.2u5, 
> the scheduler ignores host load.

Yep.

>  This often results in jobs piling up 
> on a few nodes while other nodes are idle.

As far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.


>  The issue is fixed in 6.2u6 
> (currently only available in product form).
> 
> Daniel
> 
> On 12/15/10 06:55 AM, steve_s wrote:
>> On Dec 15 15:16 +0100, reuti wrote:
>>>> However, lately we observed the following. We have a bunch of 8-core
>>>> nodes connected by Infiniband and running MPI jobs across nodes. We found
>>>> that processed often get placed on full nodes which have 8 MPI processes
>>>> already running. This leaves us with many oversubscribed (load 16
>>>> instead of 8) nodes. This happens although there are many empty nodes
>>>> left in the queue. It is almost as if the slots already taken on one
>>>> node are ignored by SGE.
>>> 
>>> how many slots are defined in the queue definition, and how many queues do you have defined?
>> 
>>     $ qconf -sql
>>     adde.q
>>     all.q
>>     test.q
>>     vtc.q
>> 
>> Only the first and last queue are used and only the first is used for
>> parallel jobs. Nodes belong to only one queue at a time such that jobs
>> in different queues cannot run on the same node.

Did you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

-- Reuti


>> 8 slots (see attachment for full output).
>> 
>>     $ qconf -sq adde.q | grep slot
>>     slots                 8
>> 
>> Thank you.
>> 
>> best,
>> Steve
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305831
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305837

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list