[GE users] Random job failures

craffi dag at sonsorol.org
Mon Dec 7 11:56:52 GMT 2009

Before the suggestions below I'd offer that the usual reason behind 
"jobs work on head node but fail oddly when run under SGE" can sometimes 
be traced to shell, path or environment variables that differ between 
how you run interactively versus under SGE. It's not enough just to run 
and rerun the application - you need to pretty carefully examine the 
total job environment to see "what is different"

With that said ...

Two suggestions:

(1) Look into the keep_active=true parameter, more info in the sge_conf 

           This value should only be set for  debugging  purposes.
           If  set  to  true, the execution daemon will not remove
           the spool directory maintained by sge_shepherd(8) for a

This value if set will prevent the deletion of the temporary SGE job 
directory - allowing you visibility into some neat 
state/status/trace/error files that normally don't stick around all that 

This is a fantastic and under-apprecaited debugging tool on Grid Engine. 
The only thing to remember is to disable it when you are done as you 
could easily fill up your spooling volume on a big or active cluster if 
these directories never get cleaned up and deleted.

(2) The second suggestion is that you could consider writing an epilog 
script that would run globally and automatically detect these errors as 
they happen. This can be used to point you towards the correct jobdir 
after you set keep_active=true or it could resubmit the job, send you an 
email or do whatever else you need.

The epilog would do something like:

- exit quickly if the job exit status indicates no errors with the job
-- code snippet:

JOB_EXIT_STATUS="`sed -ne 's/^exit_status=//p' $SGE_JOB_SPOOL_DIR/usage 
| tail -1`"

If you detect a job that exits with an error you can then grep your job 
specific output files for a pattern matching the error you are looking for:

  ERRORDETECT="`grep -c "Alarm clock   " $SGE_O_WORKDIR/*.log `"

... or similar.

I have an epilog.sh script that does this for the pattern "LICENSE NOT 
FOUND" in job output files. I used it for dealing with flexlm license 
related errors. I can mail it to you if you want an example epilog 
script to work off of.


bdo wrote:
> Hi all,
> Hoping someone can point me where to look for the cause of this: one
> particular type of job I run fails randomly with the following error in the
> job's stderr log:
> /usr/local/sge/default/spool/{grid-host}/job_scripts/6044343: line 172: 27572
> Alarm clock             $PROG $ARGS
> of course, by the time I find out about the failure, the job script file has
> long since been deleted.  It's not always the same {grid-host} that gets the
> error - I've seen it happen on several, but it is always at the same line of
> the job script file.
> The program being run by the grid-job is a perl program we've written in-
> house, but no matter what I try, I can't induce it to fail on-demand with that
> error.  I've checked the perl program too, but haven't seen anything that
> would cause this error.
> If it's of any bearing, there are 6 exec hosts in the cluster queue which is
> configured with 8 slots.  The exec hosts are all Gentoo linux boxes, and the
> qmaster is a FreeBSD box.  All machines have SGE 6.0u7 installed (we're not in
> a position to upgrade just at the moment).
> Any thoughts on where I should look?  Suggestions much appreciated.
> Thanks&  regards,
> - Brendon.
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=231938
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list