[GE users] Default queue state for rebooted machines?

Robert Griffiths Robert.Griffiths at mitsubishi-sec-intl.com
Wed May 4 11:18:21 BST 2005

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Magnus,

Thanks for your input.

I like your idea of having the job scripts check for the existence of the
shared data. However, we already have this kind of checking within the code
of the *executable* to look for the appropriate segment and it throws an
exception when it cannot find the segment it's looking for. As it happens,
by the time the jobs are executing, it's already too late to stop that
machine from becoming a black hole unless, as you state, you can force the
queue into an Error state after a thrown exception.

I'll have a dig through the admin manuals again to see if I can find
something about forcing a queue error state - that would be an acceptable



-----Original Message-----
From: Magnus Söderberg [mailto:magnus.soderberg at switchcore.com] 
Sent: 04 May 2005 11:00
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Default queue state for rebooted machines?

Robert Griffiths wrote:
> Morning all,
> I was just wondering if there is a way to configure SGE (version 5.3p6) in
such a way
> that when a known machine/execution host reboots itself (for whatever
reason) and
> becomes operational again, it should not automatically be placed in the
set of active
> execution hosts? Something as simple as starting all queueus in disabled
mode would be
> ideal.
> Our jobs rely on data being uploaded into shared memory before they can
run and,
> because the machine rebooted and was devoid of our data in its shared
memory, SGE sent
> huge amounts of jobs to that machine because it was processing them so
quickly. We
> would have been better off if the machine had remained dead!
I think your barking up the wrong tree here. As I see it you have 2
1. Your jobs doesn't check for input/doesn't load input before starting.

Fix the job scripts so they either check for correct input before running
(with some 
proper timeout) or loads the input if they can determine that themselves. If
no input 
found, exit the job with some other value than 0.

2. Incorrectly set up machines/jobs becomes a black hole.
Make sure jobs fail with some proper value so the machine/queue is actually
put in an 
error state. Rather boring, but as you point out, having a black hole eating
all jobs is 
much worse. I don't remember right now how to finish a job so the
queue/machine ends up in 
an error state, but I do remember having read it somewhere. It has also
happened to me 
(inadvertently) so I know it can be done.


Magnus Söderberg

Mitsubishi Securities International plc ('MSI') is 
registered in England, company number 1698498 and 
registered office at 6 Broadgate, London EC2M 2AA. 
MSI is part of the Mitsubishi Tokyo Financial Group 
and is authorised and regulated by The Financial 
Services Authority. This message is intended solely 
for the individual addressee named above. The 
information contained in this e-mail is confidential 
and may be legally privileged. If you are not the 
intended recipient please delete in its entirety. 
Messages sent via this medium may be subject to 
delays, non-delivery and unauthorised alteration. 
The information contained herein or attached hereto 
has been obtained from sources we believe to be 
reliable but we do not represent that it is accurate 
or complete. Any reference to past performance should 
not be taken as an indication of future performance. 
The information contained herein or attached hereto 
is not to be construed as an offer or solicitation to 
buy or sell any security, instrument or investment. 
MSI or any affiliated company, may have an interest, 
position, or effect transactions, in any investment 
mentioned herein. Any opinions or recommendations 
expressed herein are solely those of the author or 
analyst and are subject to change without notice.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list