[GE users] Job Restart on App Failure

templedf daniel.templeton at oracle.com
Thu Mar 25 14:42:03 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

That still involves changing the wrapper script.  It's definitely the 
easiest approach.  If you're really adamant about not changing the 
wrapper script, you could do the same thing in a custom starter or 
epilog.  From the epilog you can find the exit status from the 
exit_status line in the 
<execd_spool_dir>/active_jobs/<jobid>.<taskid>/usage file.  A qdel will 
show up as exit status = 128+SIGKILL (137 on Solaris).

The epilog would look something like:

#!/bin/sh

status=`grep exit_status $SGE_JOB_SPOOL_DIR/usage | cut -d= -f2`

if [ $status -ne 0 ]; then
   exit 99
else
   exit 0
fi


Daniel

On 03/25/10 07:13, yooniverse wrote:
> I'll suggest it and see if that's possible.  Thanks.
>
> -----Original Message-----
> From: rayson [mailto:rayrayson at gmail.com]
> Sent: Wednesday, March 24, 2010 1:17 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Job Restart on App Failure
>
> You can exit the job with code 99:
>
> http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/htmlman8/sge_shepherd.html?pathrev=V62u5_TAG
>
> Rayson
>
>
>
> On 3/24/10, yooniverse<yoon.s.chung at chase.com>  wrote:
>>
>>
>>
>> Hi,
>>
>>
>>
>> I have a request from a user who would like to know if there is a way to
>> automatically have the job restarted if the app it executes exits for any
>> reason other than from a client-side termination of the job (qdel, breaking
>> a qsub ?sync, etc.) or with a exit code 0.  The app is not very reliable,
>> but it must continue to run without intervention.
>>
>>
>>
>> I know that SGE can restart a job if the execd terminates abnormally (e.g.,
>> server crash), but was wondering if there is an interesting way to make SGE
>> behave this way without having to customize his submission script to have
>> some kind of conditional logic to resubmit.
>>
>>
>>
>> Any thoughts?
>>
>>
>>
>> Thanks,
>>
>> Yoon
>>
>> This transmission may contain information that is privileged, confidential,
>> legally privileged, and/or exempt from disclosure under applicable law. If
>> you are not the intended recipient, you are hereby notified that any
>> disclosure, copying, distribution, or use of the information contained
>> herein (including any reliance thereon) is STRICTLY PROHIBITED. Although
>> this transmission and any attachments are believed to be free of any virus
>> or other defect that might affect any computer system into which it is
>> received and opened, it is the responsibility of the recipient to ensure
>> that it is virus free and no responsibility is accepted by JPMorgan Chase&
>> Co., its subsidiaries and affiliates, as applicable, for any loss or damage
>> arising in any way from its use. If you received this transmission in error,
>> please immediately contact the sender and destroy the material in its
>> entirety, whether in electronic or hard copy format. Thank you.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251148
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251326
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251332

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list