[GE users] More slots scheduled than available on execution host

reuti reuti at staff.uni-marburg.de
Wed Aug 5 12:39:11 BST 2009


Am 05.08.2009 um 12:52 schrieb s_kreidl:

> Hi Reuti,
>
> "qconf -mc " delivers
>
>     slots               s          INT         <=    YES
>     YES        1        1000
>
> which should be the default, right?

Yes.

> The master is still running the same session as when the problem
> occurred, and with the exception of this one node and that one  
> parallel
> job, nothing like that ever happened before (the cluster was loaded  
> 96%
> with 3 large
> parallel jobs in the waiting list, during the problematic
> time frame).

As said: you could stop/start the qmaster to reset the internal  
counters.

-- Reuti

> Maybe I should just forget about the whole thing, it just leaves an
> uneasy feeling.
>
> Best,
> Sabine
>
> reuti schrieb:
>> Hi,
>>
>> Am 05.08.2009 um 10:34 schrieb s_kreidl:
>>
>>
>>> Kasper, thanks for the quick reply.
>>>
>>> Could anyone, familiar with the internals of SGE with respect to  
>>> this
>>> issue, confirm that setting a complex limit on the execution host
>>> level
>>> does indeed not limit the overall consumption of this complex on the
>>> host, but queue-wise instead?
>>>
>>
>> no, it's total per node across all queues. The queue-wise setting is
>> just in the queue configuration on its own.
>>
>>
>>
>>> This would actually contradict what's written about  
>>> complex_values in
>>> the host_conf manpage:
>>>
>>>     The quotas are related to the resource consumption of all jobs
>>> on a
>>>     host in the case of  consumable resources
>>>
>>>     [...] an available resource amount is determined by subtracting
>>> the
>>>     current resource consumption of all running jobs on the host  
>>> from
>>>     the quota in the  complex_values list. [ I indeed got a
>>>     "hc:slots=-6.000000" from "qhost -F slots" for the host in
>>> question. ]
>>>
>>
>> You can stop/start the qmaster when it ran wild and see whether it
>> normalizes for further job submissions. The definition of the complex
>> is also untouched (qconf -mc)?
>>
>> -- Reuti
>>
>>
>>>     Jobs can only be dispatched to a host if no resource requests
>>> exceed
>>>     any corresponding resource availability obtained by this scheme.
>>>
>>> And it would also contradict years of experience with SGE at our  
>>> site.
>>> But please let me know, if we are going wrong here.
>>>
>>> Thanks again,
>>> Sabine
>>>
>>>
>>> kasper_fischer schrieb:
>>>
>>>> Hi Sabine,
>>>>
>>>> I think the problem is that the value slots=8 in your execution  
>>>> host
>>>> configuration is for each queue on the host. Therefore you can  
>>>> use 8
>>>> slots in the parallel queue and 8 in 8 in the sequential queue.
>>>> using a
>>>> maximum of 16 slots. If you want to limit the slots to a total of
>>>> 8 for
>>>> all queue you should define a Resource Quota Set with qconf - 
>>>> arqs or
>>>> something similar (see the man pages).
>>>>
>>>> I hope this helps.
>>>>
>>>> Best regards,
>>>>
>>>> Kasper
>>>>
>>>> s_kreidl schrieb:
>>>>
>>>>
>>>>> Dear users list,
>>>>>
>>>>> recently one of our execution hosts was deliberately
>>>>> oversubscribed by SGE. More specifically 7 slave hosts and the
>>>>> master (of a 42 slot job, $fillup pe) were scheduled on a node
>>>>> that was already loaded with 6 sequential jobs.
>>>>>
>>>>> We are using SGE 6.2u2_1 on a CentOS 5.
>>>>>
>>>>> The execution host in question n032 is limited to 8 slots:
>>>>>
>>>>> # qconf -se n032
>>>>> hostname              n032
>>>>> load_scaling          NONE
>>>>> complex_values        slots=8
>>>>> ...
>>>>>
>>>>> There are two queues configured on that host, one for sequential,
>>>>> one for parallel jobs, no subordination, no extra slot
>>>>> limitations, as I assumed, the slot limit at the execution host
>>>>> level would be enough (right?).
>>>>>
>>>>> Unfortunately the parallel job isn't running anymore, so the only
>>>>> proof for my observation comes from the monitoring output of the
>>>>> scheduler (just a small excerpt of one scheduler run):
>>>>> ::::::::
>>>>> 88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>>>>> 88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>>>>> 93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
>>>>> 93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
>>>>> ::::::::
>>>>>
>>>>> My colleagues assured me, that no one made any configuration
>>>>> changes in the relevant time frame.
>>>>>
>>>>> This has never happened before.
>>>>>
>>>>> I'd be really grateful for any hint on where I might be going
>>>>> wrong in the configuration, respectively where I should start
>>>>> digging for the problem.
>>>>>
>>>>> Best regards,
>>>>> Sabine
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=210907
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=210914
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=210997
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=211008
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=211019
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211028

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list