[GE users] Preemption vs dedicating nodes by group?
jmarconnet at knology.net
Sun May 8 13:45:45 BST 2005
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
We just had emails cross in the mail, assuming my email got thru this time.
I had to start using my home email since the list was simply ignoring my
emails from work, even though it continued sending me emails from the group.
Rogue Electrons! This weekend my latest email to the group bounced with a
failure notice for the first time in my experience. So I edited it some and
sent it out again just a few seconds before your email arrived.
Thanks so much for this latest info. As near as I can tell from reading your
email, it's not that I'm so dense about triggering running job preemption,
but that SGE just does it when ques are properly subordinated. I did not
read/catch that anywhere I've been looking. Hope it works for you in
production. I'll just go ahead and try this approach in my two test ques
next week and see for myself what happens.
Thanks, too, for sharing about how you deal with your users. That is very
helpful info. Glad to hear you have a reasonable number of great users to
deal with. My experience so far with email to users on the latest ques,
avoiding gotchas, etc. is that they simply often don't read it, often even
denying you shared that "nugget" of info it turns out they needed. Yes, I
know I'm sometimes verbose, and that the most important stuff must go at the
top, perhaps in the subject line.
----- Original Message -----
From: "Justus Loerke" <loerke at molgen.mpg.de>
To: <users at gridengine.sunsource.net>
Sent: Friday, May 06, 2005 6:12 AM
Subject: Re: [GE users] Preemption vs dedicating nodes by group?
> Hi Jim,
> the suspension of the subordinated queue is triggered by the
> subordination mechanism itself, I believe. It's a property of
> subordinated queues that jobs will run on them only if the "top" queue
> (i.e. the queue they are subordinated to) is idle and jobs running on
> them will be suspended once a jobs starts on the primary queue. I did no
> other configuration than setting up secondary queues as subordinated
> queues, all suspension/migration is handled by the GE itself. As to how
> the GE handles this, I can't really say anything. My guess would be just
> simple checking whether the prim. queue is empty and sending a
> suspension to the subordinated sec. queue once a jobs is scheduled to
> the prim. one; I tried doing this manually with a script myself, but it
> turned out to be work already done by someone else :)
> I did look at load sensors first (exactly the one you mentioned, since
> using idle workstations is my next project, once I've got some
> checkpointing mechanism running, plus another one written by myself that
> monitor the jobs submitted to the 2 main groups of queues and checks for
> empty queues) but found that the subordination method works without a
> sensor, so there's no real need for any; as I said, the
> suspension/migration is triggered by the GE once the primary queue is
> not idle anymore, so that's a lot better than having to do all this
> manually or with some script.
> Right now my users are educated in personal talks, since I only have a
> handful of them and it isn't really that much work. This also gives me
> the opportunity to work out existing problems with potential file
> corruption, etc; most of them are very helpful and also very interested
> in some checkpointing and migration schemes (especially if this can be
> used to include workstations in the grid), since we've had frequent
> problems with dying jobs. I'm in the happy position that I don't have to
> keep a watchful eye on my users; most of them understand quite easily
> that it's better to migrate their jobs automatically than to have the
> jobs that are running on secondary queues suspended (potentially)
> forever because they conveniently "forgot " the checkpointing flag. They
> may be immune from the kill/restart, but once the primary queue fills
> up, they will find their jobs suspended for a long time, if they are
> unlucky.... But you are right, having jobs killed and restarted from the
> beginning is no option and hard to communicate to their owners; since I
> have no (real) checkpointing system running at this time (still trying
> to recompile with the Condor libs), the secondary (i.e. subordinated)
> queues are all disabled, to avoid data corruption and user irritation.
> But at least in this respect I expect very little problems once
> checkpointing works.
> Hope that helps :)
> Dipl. Phys. Justus Loerke
> - UltraStrukturNetzwerk -
> Max Planck Institute for Molecular Genetics
> Ihnestr. 63-73
> D-14195 Berlin
> Tel.: +49-30-8413-1644
> Fax: +49-30-8413-1385
> E-mail: loerke at molgen.mpg.de
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> No virus found in this incoming message.
> Checked by AVG Anti-Virus.
> Version: 7.0.308 / Virus Database: 266.11.6 - Release Date: 5/6/2005
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.6 - Release Date: 5/6/2005
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users