[GE users] More slots scheduled than available on execution host

reuti reuti at staff.uni-marburg.de
Wed Aug 5 10:28:53 BST 2009


Hi,

Am 05.08.2009 um 10:34 schrieb s_kreidl:

> Kasper, thanks for the quick reply.
>
> Could anyone, familiar with the internals of SGE with respect to this
> issue, confirm that setting a complex limit on the execution host  
> level
> does indeed not limit the overall consumption of this complex on the
> host, but queue-wise instead?

no, it's total per node across all queues. The queue-wise setting is  
just in the queue configuration on its own.


> This would actually contradict what's written about complex_values in
> the host_conf manpage:
>
>     The quotas are related to the resource consumption of all jobs  
> on a
>     host in the case of  consumable resources
>
>     [...] an available resource amount is determined by subtracting  
> the
>     current resource consumption of all running jobs on the host from
>     the quota in the  complex_values list. [ I indeed got a
>     "hc:slots=-6.000000" from "qhost -F slots" for the host in  
> question. ]

You can stop/start the qmaster when it ran wild and see whether it  
normalizes for further job submissions. The definition of the complex  
is also untouched (qconf -mc)?

-- Reuti

>
>     Jobs can only be dispatched to a host if no resource requests  
> exceed
>     any corresponding resource availability obtained by this scheme.
>
> And it would also contradict years of experience with SGE at our site.
> But please let me know, if we are going wrong here.
>
> Thanks again,
> Sabine
>
>
> kasper_fischer schrieb:
>> Hi Sabine,
>>
>> I think the problem is that the value slots=8 in your execution host
>> configuration is for each queue on the host. Therefore you can use 8
>> slots in the parallel queue and 8 in 8 in the sequential queue.  
>> using a
>> maximum of 16 slots. If you want to limit the slots to a total of  
>> 8 for
>> all queue you should define a Resource Quota Set with qconf -arqs or
>> something similar (see the man pages).
>>
>> I hope this helps.
>>
>> Best regards,
>>
>> Kasper
>>
>> s_kreidl schrieb:
>>
>>> Dear users list,
>>>
>>> recently one of our execution hosts was deliberately  
>>> oversubscribed by SGE. More specifically 7 slave hosts and the  
>>> master (of a 42 slot job, $fillup pe) were scheduled on a node  
>>> that was already loaded with 6 sequential jobs.
>>>
>>> We are using SGE 6.2u2_1 on a CentOS 5.
>>>
>>> The execution host in question n032 is limited to 8 slots:
>>>
>>> # qconf -se n032
>>> hostname              n032
>>> load_scaling          NONE
>>> complex_values        slots=8
>>> ...
>>>
>>> There are two queues configured on that host, one for sequential,  
>>> one for parallel jobs, no subordination, no extra slot  
>>> limitations, as I assumed, the slot limit at the execution host  
>>> level would be enough (right?).
>>>
>>> Unfortunately the parallel job isn't running anymore, so the only  
>>> proof for my observation comes from the monitoring output of the  
>>> scheduler (just a small excerpt of one scheduler run):
>>> ::::::::
>>> 88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>> 88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>> 93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
>>> 93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
>>> ::::::::
>>>
>>> My colleagues assured me, that no one made any configuration  
>>> changes in the relevant time frame.
>>>
>>> This has never happened before.
>>>
>>> I'd be really grateful for any hint on where I might be going  
>>> wrong in the configuration, respectively where I should start  
>>> digging for the problem.
>>>
>>> Best regards,
>>> Sabine
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=210907
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=210914
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=210997
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211008

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list