[GE users] More slots scheduled than available on execution host

s_kreidl sabine.kreidl at uibk.ac.at
Wed Aug 5 09:34:29 BST 2009


Kasper, thanks for the quick reply.

Could anyone, familiar with the internals of SGE with respect to this 
issue, confirm that setting a complex limit on the execution host level 
does indeed not limit the overall consumption of this complex on the 
host, but queue-wise instead?
This would actually contradict what's written about complex_values in 
the host_conf manpage:

    The quotas are related to the resource consumption of all jobs on a
    host in the case of  consumable resources

    [...] an available resource amount is determined by subtracting the
    current resource consumption of all running jobs on the host from
    the quota in the  complex_values list. [ I indeed got a
    "hc:slots=-6.000000" from "qhost -F slots" for the host in question. ]

    Jobs can only be dispatched to a host if no resource requests exceed
    any corresponding resource availability obtained by this scheme.

And it would also contradict years of experience with SGE at our site.
But please let me know, if we are going wrong here.

Thanks again,

kasper_fischer schrieb:
> Hi Sabine,
> I think the problem is that the value slots=8 in your execution host
> configuration is for each queue on the host. Therefore you can use 8
> slots in the parallel queue and 8 in 8 in the sequential queue. using a
> maximum of 16 slots. If you want to limit the slots to a total of 8 for
> all queue you should define a Resource Quota Set with qconf -arqs or
> something similar (see the man pages).
> I hope this helps.
> Best regards,
> Kasper
> s_kreidl schrieb:
>> Dear users list,
>> recently one of our execution hosts was deliberately oversubscribed by SGE. More specifically 7 slave hosts and the master (of a 42 slot job, $fillup pe) were scheduled on a node that was already loaded with 6 sequential jobs.
>> We are using SGE 6.2u2_1 on a CentOS 5.
>> The execution host in question n032 is limited to 8 slots:
>> # qconf -se n032
>> hostname              n032
>> load_scaling          NONE
>> complex_values        slots=8
>> ...
>> There are two queues configured on that host, one for sequential, one for parallel jobs, no subordination, no extra slot limitations, as I assumed, the slot limit at the execution host level would be enough (right?).
>> Unfortunately the parallel job isn't running anymore, so the only proof for my observation comes from the monitoring output of the scheduler (just a small excerpt of one scheduler run):
>> ::::::::
>> 88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
>> 88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
>> 93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
>> 93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
>> ::::::::
>> My colleagues assured me, that no one made any configuration changes in the relevant time frame.
>> This has never happened before.
>> I'd be really grateful for any hint on where I might be going wrong in the configuration, respectively where I should start digging for the problem.
>> Best regards,
>> Sabine
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210907
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210914
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list