[GE users] nodes overloaded: processes placed on already full nodes

templedf daniel.templeton at oracle.com
Wed Dec 15 15:13:12 GMT 2010


This is a known issue.  When scheduling parallel jobs with 6.2 to 6.2u5, 
the scheduler ignores host load.  This often results in jobs piling up 
on a few nodes while other nodes are idle.  The issue is fixed in 6.2u6 
(currently only available in product form).

Daniel

On 12/15/10 06:55 AM, steve_s wrote:
> On Dec 15 15:16 +0100, reuti wrote:
>>> However, lately we observed the following. We have a bunch of 8-core
>>> nodes connected by Infiniband and running MPI jobs across nodes. We found
>>> that processed often get placed on full nodes which have 8 MPI processes
>>> already running. This leaves us with many oversubscribed (load 16
>>> instead of 8) nodes. This happens although there are many empty nodes
>>> left in the queue. It is almost as if the slots already taken on one
>>> node are ignored by SGE.
>>
>> how many slots are defined in the queue definition, and how many queues do you have defined?
>
>      $ qconf -sql
>      adde.q
>      all.q
>      test.q
>      vtc.q
>
> Only the first and last queue are used and only the first is used for
> parallel jobs. Nodes belong to only one queue at a time such that jobs
> in different queues cannot run on the same node.
>
>
> 8 slots (see attachment for full output).
>
>      $ qconf -sq adde.q | grep slot
>      slots                 8
>
> Thank you.
>
> best,
> Steve
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305831

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list