[GE users] More slots scheduled than available on execution host

s_kreidl sabine.kreidl at uibk.ac.at
Wed Aug 5 11:52:17 BST 2009


Hi Reuti,

"qconf -mc " delivers

    slots               s          INT         <=    YES        
    YES        1        1000

which should be the default, right?
The master is still running the same session as when the problem 
occurred, and with the exception of this one node and that one parallel 
job, nothing like that ever happened before (the cluster was loaded 96% 
with 3 large parallel jobs in the waiting list, during the problematic 
time frame).

Maybe I should just forget about the whole thing, it just leaves an 
uneasy feeling.

Best,
Sabine

reuti schrieb:
> Hi,
>
> Am 05.08.2009 um 10:34 schrieb s_kreidl:
>
>   
>> Kasper, thanks for the quick reply.
>>
>> Could anyone, familiar with the internals of SGE with respect to this
>> issue, confirm that setting a complex limit on the execution host  
>> level
>> does indeed not limit the overall consumption of this complex on the
>> host, but queue-wise instead?
>>     
>
> no, it's total per node across all queues. The queue-wise setting is  
> just in the queue configuration on its own.
>
>
>   
>> This would actually contradict what's written about complex_values in
>> the host_conf manpage:
>>
>>     The quotas are related to the resource consumption of all jobs  
>> on a
>>     host in the case of  consumable resources
>>
>>     [...] an available resource amount is determined by subtracting  
>> the
>>     current resource consumption of all running jobs on the host from
>>     the quota in the  complex_values list. [ I indeed got a
>>     "hc:slots=-6.000000" from "qhost -F slots" for the host in  
>> question. ]
>>     
>
> You can stop/start the qmaster when it ran wild and see whether it  
> normalizes for further job submissions. The definition of the complex  
> is also untouched (qconf -mc)?
>
> -- Reuti
>
>   
>>     Jobs can only be dispatched to a host if no resource requests  
>> exceed
>>     any corresponding resource availability obtained by this scheme.
>>
>> And it would also contradict years of experience with SGE at our site.
>> But please let me know, if we are going wrong here.
>>
>> Thanks again,
>> Sabine
>>
>>
>> kasper_fischer schrieb:
>>     
>>> Hi Sabine,
>>>
>>> I think the problem is that the value slots=8 in your execution host
>>> configuration is for each queue on the host. Therefore you can use 8
>>> slots in the parallel queue and 8 in 8 in the sequential queue.  
>>> using a
>>> maximum of 16 slots. If you want to limit the slots to a total of  
>>> 8 for
>>> all queue you should define a Resource Quota Set with qconf -arqs or
>>> something similar (see the man pages).
>>>
>>> I hope this helps.
>>>
>>> Best regards,
>>>
>>> Kasper
>>>
>>> s_kreidl schrieb:
>>>
>>>       
>>>> Dear users list,
>>>>
>>>> recently one of our execution hosts was deliberately  
>>>> oversubscribed by SGE. More specifically 7 slave hosts and the  
>>>> master (of a 42 slot job, $fillup pe) were scheduled on a node  
>>>> that was already loaded with 6 sequential jobs.
>>>>
>>>> We are using SGE 6.2u2_1 on a CentOS 5.
>>>>
>>>> The execution host in question n032 is limited to 8 slots:
>>>>
>>>> # qconf -se n032
>>>> hostname              n032
>>>> load_scaling          NONE
>>>> complex_values        slots=8
>>>> ...
>>>>
>>>> There are two queues configured on that host, one for sequential,  
>>>> one for parallel jobs, no subordination, no extra slot  
>>>> limitations, as I assumed, the slot limit at the execution host  
>>>> level would be enough (right?).
>>>>
>>>> Unfortunately the parallel job isn't running anymore, so the only  
>>>> proof for my observation comes from the monitoring output of the  
>>>> scheduler (just a small excerpt of one scheduler run):
>>>> ::::::::
>>>> 88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>> 88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>> 93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
>>>> 93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
>>>> ::::::::
>>>>
>>>> My colleagues assured me, that no one made any configuration  
>>>> changes in the relevant time frame.
>>>>
>>>> This has never happened before.
>>>>
>>>> I'd be really grateful for any hint on where I might be going  
>>>> wrong in the configuration, respectively where I should start  
>>>> digging for the problem.
>>>>
>>>> Best regards,
>>>> Sabine
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>> dsForumId=38&dsMessageId=210907
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>>>
>>>>         
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=210914
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=210997
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211008
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211019

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list