[GE users] Default queue state for rebooted machines?

Reuti reuti at staff.uni-marburg.de
Wed May 4 11:32:14 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi, is the queue dedicated to one program? Maybe it (the test) can be 
put in a queue prolog which exits with -1 in this case and send the 
queue to error state. The job will be requeued in this case. - Reuti

Robert Griffiths wrote:
> Hi Magnus,
> 
> Thanks for your input.
> 
> I like your idea of having the job scripts check for the existence of the
> shared data. However, we already have this kind of checking within the code
> of the *executable* to look for the appropriate segment and it throws an
> exception when it cannot find the segment it's looking for. As it happens,
> by the time the jobs are executing, it's already too late to stop that
> machine from becoming a black hole unless, as you state, you can force the
> queue into an Error state after a thrown exception.
> 
> I'll have a dig through the admin manuals again to see if I can find
> something about forcing a queue error state - that would be an acceptable
> solution!
> 
> Cheers,
> 
> Rob
> 
> -----Original Message-----
> From: Magnus Söderberg [mailto:magnus.soderberg at switchcore.com] 
> Sent: 04 May 2005 11:00
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Default queue state for rebooted machines?
> 
> 
> Robert Griffiths wrote:
> 
>>Morning all,
>>
>>I was just wondering if there is a way to configure SGE (version 5.3p6) in
> 
> such a way
> 
>>that when a known machine/execution host reboots itself (for whatever
> 
> reason) and
> 
>>becomes operational again, it should not automatically be placed in the
> 
> set of active
> 
>>execution hosts? Something as simple as starting all queueus in disabled
> 
> mode would be
> 
>>ideal.
>>
>>Our jobs rely on data being uploaded into shared memory before they can
> 
> run and,
> 
>>because the machine rebooted and was devoid of our data in its shared
> 
> memory, SGE sent
> 
>>huge amounts of jobs to that machine because it was processing them so
> 
> quickly. We
> 
>>would have been better off if the machine had remained dead!
>>
> 
> .....
> I think your barking up the wrong tree here. As I see it you have 2
> problems:
> 1. Your jobs doesn't check for input/doesn't load input before starting.
> 
> Fix the job scripts so they either check for correct input before running
> (with some 
> proper timeout) or loads the input if they can determine that themselves. If
> no input 
> found, exit the job with some other value than 0.
> 
> 2. Incorrectly set up machines/jobs becomes a black hole.
> Make sure jobs fail with some proper value so the machine/queue is actually
> put in an 
> error state. Rather boring, but as you point out, having a black hole eating
> all jobs is 
> much worse. I don't remember right now how to finish a job so the
> queue/machine ends up in 
> an error state, but I do remember having read it somewhere. It has also
> happened to me 
> (inadvertently) so I know it can be done.
> 
> 
> regards
> 
> Magnus Söderberg
> 
> 
> ****************************************************************
> Mitsubishi Securities International plc ('MSI') is 
> registered in England, company number 1698498 and 
> registered office at 6 Broadgate, London EC2M 2AA. 
> MSI is part of the Mitsubishi Tokyo Financial Group 
> and is authorised and regulated by The Financial 
> Services Authority. This message is intended solely 
> for the individual addressee named above. The 
> information contained in this e-mail is confidential 
> and may be legally privileged. If you are not the 
> intended recipient please delete in its entirety. 
> Messages sent via this medium may be subject to 
> delays, non-delivery and unauthorised alteration. 
> The information contained herein or attached hereto 
> has been obtained from sources we believe to be 
> reliable but we do not represent that it is accurate 
> or complete. Any reference to past performance should 
> not be taken as an indication of future performance. 
> The information contained herein or attached hereto 
> is not to be construed as an offer or solicitation to 
> buy or sell any security, instrument or investment. 
> MSI or any affiliated company, may have an interest, 
> position, or effect transactions, in any investment 
> mentioned herein. Any opinions or recommendations 
> expressed herein are solely those of the author or 
> analyst and are subject to change without notice.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list