[GE users] More slots scheduled than available on execution host

s_kreidl sabine.kreidl at uibk.ac.at
Wed Aug 12 13:05:33 BST 2009


Thanks for the clarification of the global slot limit and the command 
line for the host configuration,
but the complex slots=8 was added to every host configuration from the 
very beginning, e.g.:

# qconf -se n001
hostname              n001
load_scaling          NONE
complex_values        slots=8
load_values           arch=lx24-amd64,num_proc=8,mem_total=32186.718750M, \
                      swap_total=8191.992188M,virtual_total=40378.710938M, \
                      load_avg=8.050000,load_short=8.100000, \
                      load_medium=8.050000,load_long=8.030000, \
                      mem_free=30992.058594M,swap_free=8191.992188M, \
                      virtual_free=39184.050781M,mem_used=1194.660156M, \
                      swap_used=0.000000M,virtual_used=1194.660156M, \
                      cpu=100.000000,np_load_avg=1.006250, \
                      np_load_short=1.012500,np_load_medium=1.006250, \
                      np_load_long=1.003750
processors            8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

reuti schrieb:
> Hi,
>
> Am 12.08.2009 um 13:27 schrieb s_kreidl:
>
>   
>> Hi Reuti,
>>
>> reuti schrieb:
>>     
>>>> The circumstances are similar to the last time:
>>>> 1. We are in a brutal load situation (99.8% load).
>>>>
>>>>         
>>> load on the cluster or the machine where there qmaster runs? Was it
>>> also running out of memory and got some oom (out-of-memory)-killer
>>> action from the kernel (should appear in /var/log/messages)?
>>>
>>>
>>>       
>> No, not at all! What I meant was load of the total cluster, which  
>> means,
>> pretty much every core on the cluster was occupied (some of them
>> inadvertently twice, due to the described scheduling error) and  
>> running
>> on 100%. Neither the master nor any of the execution hosts failed.
>> Everything is running smoothly right now, and nothing strange happens
>> with the exception of the sporadic over-subscription of nodes (and  
>> some
>> advanced reservation problems, which are probably off place here).
>> And to answer your second question: The job was submitted with a
>> standard qsub and nothing strange was there in the "qstat -j ..."  
>> info.
>>
>> I guess I found a trace of the possible cause for what happened. In  
>> the
>> messages of the nodes in question I found the warning messages:
>>     08/10/2009 13:35:29|  main|n002|W|local configuration n002 not
>> defined - using global configuration
>>     08/10/2009 14:40:51|  main|n002|W|local configuration n002 not
>> defined - using global configuration
>> and indeed, there is no slot limitation in the "global execution host"
>> settings. I also have some "comlib errors" in the messages files of  
>> the
>> nodes, but judging from the times of occurrence they are not directly
>> related to the above two warning lines.
>>
>> As the cluster is homogeneous, I thought, no problem, as a quick
>> workaround let's add slots=8 to the global config, but this has a
>> devastating impact on the scheduler, no jobs are scheduled anymore. An
>> extract from "qhost -F slots" gives:
>>     n125                    lx24-amd64      8  8.02   31.4G    5.3G
>> 8.0G  127.1M
>>         Host Resource(s):      gc:slots=-958.000000
>>     n126                    lx24-amd64      8  0.00   31.4G  338.9M
>> 8.0G     0.0
>>         Host Resource(s):      gc:slots=-958.000000
>> A restart of the sge_master does not repair this. Only elimination of
>> the global slot limit helps, which again leaves me with the danger of
>> over-subscription.
>>     
>
> this will limit the cluster to 8 slots in total, not per host. There  
> is already an RFE to have something like "cluster-queue" for hosts,  
> i.e. "cluster-hosts". For now you have to add this complex to each  
> nodes, But it can be done in a loop like:
>
> $ for i in `seq -w 1 4`; do qconf -aattr exechost complex_values  
> slots=8 node$i; done
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1952
>
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211981
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211984

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list