[GE users] More slots scheduled than available on execution host

reuti reuti at staff.uni-marburg.de
Wed Aug 12 12:38:41 BST 2009


Hi,

Am 12.08.2009 um 13:27 schrieb s_kreidl:

> Hi Reuti,
>
> reuti schrieb:
>>>
>>> The circumstances are similar to the last time:
>>> 1. We are in a brutal load situation (99.8% load).
>>>
>>
>> load on the cluster or the machine where there qmaster runs? Was it
>> also running out of memory and got some oom (out-of-memory)-killer
>> action from the kernel (should appear in /var/log/messages)?
>>
>>
> No, not at all! What I meant was load of the total cluster, which  
> means,
> pretty much every core on the cluster was occupied (some of them
> inadvertently twice, due to the described scheduling error) and  
> running
> on 100%. Neither the master nor any of the execution hosts failed.
> Everything is running smoothly right now, and nothing strange happens
> with the exception of the sporadic over-subscription of nodes (and  
> some
> advanced reservation problems, which are probably off place here).
> And to answer your second question: The job was submitted with a
> standard qsub and nothing strange was there in the "qstat -j ..."  
> info.
>
> I guess I found a trace of the possible cause for what happened. In  
> the
> messages of the nodes in question I found the warning messages:
>     08/10/2009 13:35:29|  main|n002|W|local configuration n002 not
> defined - using global configuration
>     08/10/2009 14:40:51|  main|n002|W|local configuration n002 not
> defined - using global configuration
> and indeed, there is no slot limitation in the "global execution host"
> settings. I also have some "comlib errors" in the messages files of  
> the
> nodes, but judging from the times of occurrence they are not directly
> related to the above two warning lines.
>
> As the cluster is homogeneous, I thought, no problem, as a quick
> workaround let's add slots=8 to the global config, but this has a
> devastating impact on the scheduler, no jobs are scheduled anymore. An
> extract from "qhost -F slots" gives:
>     n125                    lx24-amd64      8  8.02   31.4G    5.3G
> 8.0G  127.1M
>         Host Resource(s):      gc:slots=-958.000000
>     n126                    lx24-amd64      8  0.00   31.4G  338.9M
> 8.0G     0.0
>         Host Resource(s):      gc:slots=-958.000000
> A restart of the sge_master does not repair this. Only elimination of
> the global slot limit helps, which again leaves me with the danger of
> over-subscription.

this will limit the cluster to 8 slots in total, not per host. There  
is already an RFE to have something like "cluster-queue" for hosts,  
i.e. "cluster-hosts". For now you have to add this complex to each  
nodes, But it can be done in a loop like:

$ for i in `seq -w 1 4`; do qconf -aattr exechost complex_values  
slots=8 node$i; done

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1952

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211981

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list