[GE users] problems with queueing and scheduling after upgrading to 6.2u5
serge.nosov2 at gmail.com
Tue Mar 16 01:46:27 GMT 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
After upgrading from 6.1u5 to 6.2u5 we are experiencing a whole slew of problems with the SGE. One most notable and annoying is the segfaulting of the sge_shepherd. I wrote about it in a parallel thread and I need to paste some traces there.
This time, however, I would like to discuss the queueing and scheduling problems.
To overcome lack of per-slot preemption in 6.1.u5, we configured the following queues to use 4 slots per node:
hight_1.q -> medium_1.q -> low_1.q
hight_2.q -> medium_2.q -> low_2.q
hight_3.q -> medium_3.q -> low_3.q
hight_4.q -> medium_4.q -> low_4.q
So, for example, medium_1.q would preempt low_1.q, and hight_1.q would preempt both medium_1.q and low_1.q.
High queues had hard wall-clock limit of 1 hour, medium queues had 3 hours and low queues were unlimited.
To specify the type of queue to use, a user needed to request a complex "low", "medium", or "high", which could be satisfied only by corresponding queues.
Also, these complexes had 1000, 2000, 3000 urgency tickets respectively to push higher priority jobs up in the scheduler.
Everything worked fine with 6.1u5. After the upgrade, however, we see the following behaviour:
- jobs will get assigned to a different queue despite the requested complex, e.g., to low_3.q despite "medium" complex being requested
- those miss-assigned jobs will not be killed by exceeding the hard wall-clock limit.
- instead of preempting the lowest priority job on one node, a higher priority job will be preempted on another nod, e.g. hight_3.q will preempt medium_3.q on node "A" rather than low_2.q on node "B". As the result, lowest priority jobs continue to run, where as medium priority jobs get suspended.
I was wondering if there were any changes in the way SGE should be configured that I overlooked.
More information about the gridengine-users