[GE users] Random job failures

bdo brendon.oliver at gmail.com
Mon Dec 7 02:35:01 GMT 2009


Hi all,

Hoping someone can point me where to look for the cause of this: one 
particular type of job I run fails randomly with the following error in the 
job's stderr log:

/usr/local/sge/default/spool/{grid-host}/job_scripts/6044343: line 172: 27572 
Alarm clock             $PROG $ARGS

of course, by the time I find out about the failure, the job script file has 
long since been deleted.  It's not always the same {grid-host} that gets the 
error - I've seen it happen on several, but it is always at the same line of 
the job script file.

The program being run by the grid-job is a perl program we've written in-
house, but no matter what I try, I can't induce it to fail on-demand with that 
error.  I've checked the perl program too, but haven't seen anything that 
would cause this error.

If it's of any bearing, there are 6 exec hosts in the cluster queue which is 
configured with 8 slots.  The exec hosts are all Gentoo linux boxes, and the 
qmaster is a FreeBSD box.  All machines have SGE 6.0u7 installed (we're not in 
a position to upgrade just at the moment).

Any thoughts on where I should look?  Suggestions much appreciated.


Thanks & regards,

- Brendon.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=231938

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list