[GE users] Random job failures

reuti reuti at staff.uni-marburg.de
Mon Dec 7 13:01:05 GMT 2009


Hi,

Am 07.12.2009 um 03:35 schrieb bdo:

> Hoping someone can point me where to look for the cause of this: one
> particular type of job I run fails randomly with the following  
> error in the
> job's stderr log:
>
> /usr/local/sge/default/spool/{grid-host}/job_scripts/6044343: line  
> 172: 27572
> Alarm clock             $PROG $ARGS

I don't think that the script was deleted from the jobs point of  
view: when you have a file open (like it's done for the job script  
when it starts), it can be deleted by another process and won't be  
visible from the command line any longer. But the job script is still  
open and can be accessed from the job. In `lsof` such files have the  
suffix "(deleted)" at the end.

What program is "Alarm clock"? When it's an error message maybe it  
just tries to tell you the correct syntax for the command, maybe  
$PROG or $ARGS are empty?

-- Reuti


> of course, by the time I find out about the failure, the job script  
> file has
> long since been deleted.  It's not always the same {grid-host} that  
> gets the
> error - I've seen it happen on several, but it is always at the  
> same line of
> the job script file.
>
> The program being run by the grid-job is a perl program we've  
> written in-
> house, but no matter what I try, I can't induce it to fail on- 
> demand with that
> error.  I've checked the perl program too, but haven't seen  
> anything that
> would cause this error.
>
> If it's of any bearing, there are 6 exec hosts in the cluster queue  
> which is
> configured with 8 slots.  The exec hosts are all Gentoo linux  
> boxes, and the
> qmaster is a FreeBSD box.  All machines have SGE 6.0u7 installed  
> (we're not in
> a position to upgrade just at the moment).
>
> Any thoughts on where I should look?  Suggestions much appreciated.
>
>
> Thanks & regards,
>
> - Brendon.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=231938
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232013

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list