[GE users] Preemption vs dedicating nodes by group?

Justus Loerke loerke at molgen.mpg.de
Fri May 6 12:12:36 BST 2005

Hi Jim,

the suspension of the subordinated queue is triggered by the 
subordination mechanism itself, I believe. It's a property of 
subordinated queues that jobs will run on them only if the "top" queue 
(i.e. the queue they are subordinated to) is idle and jobs running on 
them will be suspended once a jobs starts on the primary queue. I did no 
other configuration than setting up secondary queues as subordinated 
queues, all suspension/migration is handled by the GE itself. As to how 
the GE handles this, I can't really say anything. My guess would be just 
simple checking whether the prim. queue is empty and sending a 
suspension to the subordinated sec. queue once a jobs is scheduled to 
the prim. one; I tried doing this manually with a script myself, but it 
turned out to be work already done by someone else :)

I did look at load sensors first (exactly the one you mentioned, since 
using idle workstations is my next project, once I've got some 
checkpointing mechanism running, plus another one written by myself that 
monitor the jobs submitted to the 2 main groups of queues and checks for 
empty queues) but found that the subordination method works without a 
sensor, so there's no real need for any; as I said, the 
suspension/migration is triggered by the GE once the primary queue is 
not idle anymore, so that's a lot better than having to do all this 
manually or with some script.

Right now my users are educated in personal talks, since I only have a 
handful of them and it isn't really that much work. This also gives me 
the opportunity to work out existing problems with potential file 
corruption, etc; most of them are very helpful and also very interested 
in some checkpointing and migration schemes (especially if this can be 
used to include workstations in the grid), since we've had frequent 
problems with dying jobs. I'm in the happy position that I don't have to 
keep a watchful eye on my users; most of them understand quite easily 
that it's better to migrate their jobs automatically than to have the 
jobs that are running on secondary queues suspended (potentially) 
forever because they conveniently "forgot " the checkpointing flag. They 
may be immune from the kill/restart, but once the primary queue fills 
up, they will find their jobs suspended for a long time, if they are 
unlucky.... But you are right, having jobs killed and restarted from the 
beginning is no option and hard to communicate to their owners; since I 
have no (real) checkpointing system running at this time (still trying 
to recompile with the Condor libs), the secondary (i.e. subordinated) 
queues are all disabled, to avoid data corruption and user irritation. 
But at least in this respect I expect very little problems once 
checkpointing works.

Hope that helps :)



Dipl. Phys. Justus Loerke
- UltraStrukturNetzwerk -
Max Planck Institute for Molecular Genetics
Ihnestr. 63-73
D-14195 Berlin

Tel.:   +49-30-8413-1644
Fax:    +49-30-8413-1385
E-mail: loerke at molgen.mpg.de  

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list