[GE users] Random job failures

bdo brendon.oliver at gmail.com
Mon Dec 7 21:50:28 GMT 2009

On Tuesday 08 December 2009 00:01:05 reuti wrote:

> > Hoping someone can point me where to look for the cause of this: one
> > particular type of job I run fails randomly with the following
> > error in the
> > job's stderr log:
> >
> > /usr/local/sge/default/spool/{grid-host}/job_scripts/6044343: line
> > 172: 27572
> > Alarm clock             $PROG $ARGS
> I don't think that the script was deleted from the jobs point of
> view: when you have a file open (like it's done for the job script
> when it starts), it can be deleted by another process and won't be
> visible from the command line any longer. But the job script is still
> open and can be accessed from the job. In `lsof` such files have the
> suffix "(deleted)" at the end.

But doesn't this script-file get cleaned up automatically after the job 
completes (whether successful or not)?  I think Chris' suggestion to look at 
the KEEP_ACTIVE setting may help me here.  I'll be having at look at that 
> What program is "Alarm clock"? When it's an error message maybe it
> just tries to tell you the correct syntax for the command, maybe
> $PROG or $ARGS are empty?

I've no idea what the reference to 'Alarm clock' is, that's part of what I'm 
trying to track down.  The only mention of the word 'alarm' in the perl 
program being run is where alarm() is used to setup up a timeout around some 
potentially long-running code sections.  At first I thought this might have 
been the culprit, but after adding some explicit logging in the perl program 
to indicate where & when the alarm() is used, I'm pretty confident that that's 
not the issue.

Thanks & regards,

- Brendon


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list