[GE users] Random job failures

bdo brendon.oliver at gmail.com
Tue Dec 8 01:58:30 GMT 2009


On Monday 07 December 2009 22:56:52 craffi wrote:


> I have an epilog.sh script that does this for the pattern "LICENSE NOT
> FOUND" in job output files. I used it for dealing with flexlm license
> related errors. I can mail it to you if you want an example epilog
> script to work off of.

I just realised I'd missed your offer - if you could pls send me a copy of 
your script that'd be great.  I've not written an epilog script before, so 
having a working one to start from would be a big help.

I've set KEEP_ACTIVE on the exec hosts and have had 1 job fail, but it appears 
that this setting doesn't preserve the job-script, only the contents of the 
temporary job directory.

ie. on the exec host where the job failed, I have the directory:

/usr/local/sge/default/spool/{exec-host}/active_jobs/6463147.23

containing the files:

addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid  
trace  usage

The file 'error' is empty, and exit_status just contains the single line '0'.  

The pertinent info from the trace file is:
12/08/2009 11:19:06 [4000:15618]: execvp(...command & args elided...)
12/08/2009 11:25:10 [103:15617]: wait3 returned 15618 (status: 36352; 
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 142)
12/08/2009 11:25:10 [103:15617]: job exited with exit status 142
12/08/2009 11:25:10 [103:15617]: reaped "job" with pid 15618
12/08/2009 11:25:10 [103:15617]: job exited not due to signal
12/08/2009 11:25:10 [103:15617]: job exited with status 142
12/08/2009 11:25:10 [103:15617]: now sending signal KILL to pid -15618
12/08/2009 11:25:10 [103:15617]: writing usage file to "usage"
12/08/2009 11:25:10 [103:15617]: no tasker to notify
12/08/2009 11:25:10 [103:15617]: no epilog script to start

exit status 142 doesn't mean anything to me, and the perl program itself 
doesn't have that as an exit value of its own (of course, it could be being 
propagated from something called by the perl program, but I've no idea what 
that could be at the moment).

the stderr log from the job gives me:
/usr/local/sge/default/spool/{exec-host}/job_scripts/6463147: line 172: 15634 
Alarm clock             $PROG $ARGS


but the {exec-host}/job_scripts/6463147 file doesn't exist after the job has 
terminated.  I guess I'll need to set up the epilog to grab a copy of that 
file before the process terminates.


thanks & regards,

- Brendon.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232136

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list