[GE users] queue going into an error state.

Iwona Sakrejda isakrejda at lbl.gov
Wed Jun 28 22:39:43 BST 2006


Hi,

My replies after your questions...

Reuti wrote:

> Am 28.06.2006 um 22:34 schrieb Iwona Sakrejda:

>> Some of my queues intermittently go into Error status. The problem  
>> comes and goes.
>> Sometimes it just pops up and goes away, sometimes persists for a  few 
>> days.
>> Here are error messages from the excution host associated with this  
>> problem:
>>
>> Greatful for suggestions,
>>
>> Iwona
>>
>>
>> 06/28/2006 13:08:16|execd|pc2623|E|shepherd of job 807621.1 died  
>> through signal = 7
>> 06/28/2006 13:08:16|execd|pc2623|W|reaping job "807621" ptf  
>> complains: Job does not exist
>> 06/28/2006 13:08:16|execd|pc2623|E|abnormal termination of shepherd  
>> for job 807621.1: no "exit_status" file
>> 06/28/2006 13:08:16|execd|pc2623|E|cant open file active_jobs/ 
>> 807621.1/error: No such file or directory
>> 06/28/2006 13:08:16|execd|pc2623|E|can't open pid file "active_jobs/ 
>> 807621.1/pid" for job 807621.1
>> 06/28/2006 13:08:17|execd|pc2623|E|shepherd of job 754236.1 died  
>> through signal = 7
>> 06/28/2006 13:08:17|execd|pc2623|W|reaping job "754236" ptf  
>> complains: Job does not exist
>> 06/28/2006 13:08:17|execd|pc2623|E|abnormal termination of shepherd  
>> for job 754236.1: no "exit_status" file
>> 06/28/2006 13:08:17|execd|pc2623|E|cant open file active_jobs/ 
>> 754236.1/error: No such file or directory
>> 06/28/2006 13:08:17|execd|pc2623|E|can't open pid file "active_jobs/ 
>> 754236.1/pid" for job 754236.1
>>
> 
> is the spool directory local on the nodes or somewhere shared in  
> $SGE_ROOT/default/spool from a file server? 
The spool directory is local on the compute nodes.

And on which platform:
> signal 7 is a SIGBUS on Linux, which seems to be odd?
[root at pc2623 root]# cat /etc/redhat-release
Scientific Linux Release 303 (pdsf)

(this is almost identical to RHE 3.0)

[root at pc2623 root]# uname -a
Linux pc2623 2.4.21-27.0.2.ELp1smp #1 SMP Thu Feb 10 17:17:17 PST 2005 i686 i686 i386 GNU/Linux

Isn't it that the shepperd dies because it cannot open the file with the job description?


Thank You,

iwona



> 
> -- Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list