[GE users] More slots scheduled than available on execution host

s_kreidl sabine.kreidl at uibk.ac.at
Wed Aug 12 12:27:05 BST 2009


Hi Reuti,

reuti schrieb:
>>
>> The circumstances are similar to the last time:
>> 1. We are in a brutal load situation (99.8% load).
>>     
>
> load on the cluster or the machine where there qmaster runs? Was it  
> also running out of memory and got some oom (out-of-memory)-killer  
> action from the kernel (should appear in /var/log/messages)?
>
>   
No, not at all! What I meant was load of the total cluster, which means, 
pretty much every core on the cluster was occupied (some of them 
inadvertently twice, due to the described scheduling error) and running 
on 100%. Neither the master nor any of the execution hosts failed. 
Everything is running smoothly right now, and nothing strange happens 
with the exception of the sporadic over-subscription of nodes (and some 
advanced reservation problems, which are probably off place here).
And to answer your second question: The job was submitted with a 
standard qsub and nothing strange was there in the "qstat -j ..." info.

I guess I found a trace of the possible cause for what happened. In the 
messages of the nodes in question I found the warning messages:
    08/10/2009 13:35:29|  main|n002|W|local configuration n002 not 
defined - using global configuration
    08/10/2009 14:40:51|  main|n002|W|local configuration n002 not 
defined - using global configuration
and indeed, there is no slot limitation in the "global execution host" 
settings. I also have some "comlib errors" in the messages files of the 
nodes, but judging from the times of occurrence they are not directly 
related to the above two warning lines.

As the cluster is homogeneous, I thought, no problem, as a quick 
workaround let's add slots=8 to the global config, but this has a 
devastating impact on the scheduler, no jobs are scheduled anymore. An 
extract from "qhost -F slots" gives:
    n125                    lx24-amd64      8  8.02   31.4G    5.3G    
8.0G  127.1M
        Host Resource(s):      gc:slots=-958.000000
    n126                    lx24-amd64      8  0.00   31.4G  338.9M    
8.0G     0.0
        Host Resource(s):      gc:slots=-958.000000
A restart of the sge_master does not repair this. Only elimination of 
the global slot limit helps, which again leaves me with the danger of 
over-subscription.

Please, do you have any further suggestions? There is something going 
terribly wrong here and I can't leave the cluster running like this much 
longer.
Many thanks,
Sabine

>> 2. Not all slots of the hosts in question were occupied by  
>> sequential jobs, means there were open slots on the host, just not  
>> enough.
>>     
>
> The parallel job used correctly the assigned nodes with its qrsh  
> command?
>
> -- Reuti
>
>   
>   
>> Has anyone an idea, where and how I could start searching for the  
>> problem? I'm out of ideas here.
>>
>> Regards,
>> Sabine
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=211821
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211902
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211978

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list