[GE users] More slots scheduled than available on execution host

s_kreidl sabine.kreidl at uibk.ac.at
Tue Aug 4 16:10:31 BST 2009


Dear users list,

recently one of our execution hosts was deliberately oversubscribed by SGE. More specifically 7 slave hosts and the master (of a 42 slot job, $fillup pe) were scheduled on a node that was already loaded with 6 sequential jobs.

We are using SGE 6.2u2_1 on a CentOS 5.

The execution host in question n032 is limited to 8 slots:

# qconf -se n032
hostname              n032
load_scaling          NONE
complex_values        slots=8
...

There are two queues configured on that host, one for sequential, one for parallel jobs, no subordination, no extra slot limitations, as I assumed, the slot limit at the execution host level would be enough (right?).

Unfortunately the parallel job isn't running anymore, so the only proof for my observation comes from the monitoring output of the scheduler (just a small excerpt of one scheduler run):
::::::::
88898:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88898:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
88899:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88899:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
88900:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88900:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
88901:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88901:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
88902:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88902:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
88903:1:RUNNING:1249054905:864060:H:n032.:slots:1.000000
88903:1:RUNNING:1249054905:864060:Q:all.q at n032.:slots:1.000000
93515:1:RUNNING:1249308495:864060:H:n032.:slots:8.000000
93515:1:RUNNING:1249308495:864060:Q:par.q at n032.:slots:8.000000
::::::::

My colleagues assured me, that no one made any configuration changes in the relevant time frame.

This has never happened before.

I'd be really grateful for any hint on where I might be going wrong in the configuration, respectively where I should start digging for the problem.

Best regards,
Sabine

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210907

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list