[GE users] Default queue state for rebooted machines?
magnus.soderberg at switchcore.com
Wed May 4 10:59:42 BST 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Robert Griffiths wrote:
> Morning all,
> I was just wondering if there is a way to configure SGE (version 5.3p6) in such a way
> that when a known machine/execution host reboots itself (for whatever reason) and
> becomes operational again, it should not automatically be placed in the set of active
> execution hosts? Something as simple as starting all queueus in disabled mode would be
> Our jobs rely on data being uploaded into shared memory before they can run and,
> because the machine rebooted and was devoid of our data in its shared memory, SGE sent
> huge amounts of jobs to that machine because it was processing them so quickly. We
> would have been better off if the machine had remained dead!
I think your barking up the wrong tree here. As I see it you have 2 problems:
1. Your jobs doesn't check for input/doesn't load input before starting.
Fix the job scripts so they either check for correct input before running (with some
proper timeout) or loads the input if they can determine that themselves. If no input
found, exit the job with some other value than 0.
2. Incorrectly set up machines/jobs becomes a black hole.
Make sure jobs fail with some proper value so the machine/queue is actually put in an
error state. Rather boring, but as you point out, having a black hole eating all jobs is
much worse. I don't remember right now how to finish a job so the queue/machine ends up in
an error state, but I do remember having read it somewhere. It has also happened to me
(inadvertently) so I know it can be done.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users