[GE users] queue going into an error state.

Reuti reuti at staff.uni-marburg.de
Wed Jun 28 21:52:12 BST 2006


Hi,

Am 28.06.2006 um 22:34 schrieb Iwona Sakrejda:

> Hi,
>
> Some of my queues intermittently go into Error status. The problem  
> comes and goes.
> Sometimes it just pops up and goes away, sometimes persists for a  
> few days.
> Here are error messages from the excution host associated with this  
> problem:
>
> Greatful for suggestions,
>
> Iwona
>
>
> 06/28/2006 13:08:16|execd|pc2623|E|shepherd of job 807621.1 died  
> through signal = 7
> 06/28/2006 13:08:16|execd|pc2623|W|reaping job "807621" ptf  
> complains: Job does not exist
> 06/28/2006 13:08:16|execd|pc2623|E|abnormal termination of shepherd  
> for job 807621.1: no "exit_status" file
> 06/28/2006 13:08:16|execd|pc2623|E|cant open file active_jobs/ 
> 807621.1/error: No such file or directory
> 06/28/2006 13:08:16|execd|pc2623|E|can't open pid file "active_jobs/ 
> 807621.1/pid" for job 807621.1
> 06/28/2006 13:08:17|execd|pc2623|E|shepherd of job 754236.1 died  
> through signal = 7
> 06/28/2006 13:08:17|execd|pc2623|W|reaping job "754236" ptf  
> complains: Job does not exist
> 06/28/2006 13:08:17|execd|pc2623|E|abnormal termination of shepherd  
> for job 754236.1: no "exit_status" file
> 06/28/2006 13:08:17|execd|pc2623|E|cant open file active_jobs/ 
> 754236.1/error: No such file or directory
> 06/28/2006 13:08:17|execd|pc2623|E|can't open pid file "active_jobs/ 
> 754236.1/pid" for job 754236.1
>

is the spool directory local on the nodes or somewhere shared in  
$SGE_ROOT/default/spool from a file server? And on which platform:  
signal 7 is a SIGBUS on Linux, which seems to be odd?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list