[GE users] More slots scheduled than available on execution host

kasper_fischer kasper.fischer at ruhr-uni-bochum.de
Tue Aug 4 16:59:59 BST 2009


Hi Sabine,

I think the problem is that the value slots=8 in your execution host
configuration is for each queue on the host. Therefore you can use 8
slots in the parallel queue and 8 in 8 in the sequential queue. using a
maximum of 16 slots. If you want to limit the slots to a total of 8 for
all queue you should define a Resource Quota Set with qconf -arqs or
something similar (see the man pages).

I hope this helps.

Best regards,

Kasper

s_kreidl schrieb:
> Dear users list,
>
> recently one of our execution hosts was deliberately oversubscribed by SGE. More specifically 7 slave hosts and the master (of a 42 slot job, $fillup pe) were scheduled on a node that was already loaded with 6 sequential jobs.
>
> We are using SGE 6.2u2_1 on a CentOS 5.
>
> The execution host in question n032 is limited to 8 slots:
>
> # qconf -se n032
> hostname              n032
> load_scaling          NONE
> complex_values        slots=8
> ...
>
> There are two queues configured on that host, one for sequential, one for parallel jobs, no subordination, no extra slot limitations, as I assumed, the slot limit at the execution host level would be enough (right?).
>
> Unfortunately the parallel job isn't running anymore, so the only proof for my observation comes from the monitoring output of the scheduler (just a small excerpt of one scheduler run):
> ::::::::
> 88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
> 88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
> 93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
> 93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
> ::::::::
>
> My colleagues assured me, that no one made any configuration changes in the relevant time frame.
>
> This has never happened before.
>
> I'd be really grateful for any hint on where I might be going wrong in the configuration, respectively where I should start digging for the problem.
>
> Best regards,
> Sabine
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210907
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210914

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list