[GE users] a 't' status blocking a node

Reuti reuti at staff.uni-marburg.de
Thu Mar 15 10:32:50 GMT 2007


Hi,

Am 14.03.2007 um 18:13 schrieb Alexis Salzman:

> Hi all,
> with  SGE 6.0u6 on AMD Opteron under Redhat (linux 2.4.21-27.ELsmp)  
> a strange behaviour  occurs  some times.
> When the job start it  change to 't'  state.  Nothing append.  
> After  a while  using qdel the user try unsuccessfully to delete is  
> job but the state goes to 'dt' and nothing append. As admin with  
> qmon i try to delete it without more  success.
> sshd on the running node became impossible.
> Directly on the runinng node :
>  A netstat on the runing node indicate a CLOSE_WAIT for sgeexecd.
>  /etc/init.d/sgeexecd stop block
>  kill -9 <pid sgeexecd> fails
> reboot block while trying to shut down  sgeexecd => hard reboot needed

did you also check the /var/log/messages on this machine? I saw such  
a behavior when the machine was running out of memory and therefore  
killed randomly some of the running processes there (OOM killer).

-- Reuti


> In the spool/message of the running node i find nothing before the  
> reboot (16:30) about the blocking job (17285) witch start and block  
> around 15:30 :
> .
> .
> .
> 03/12/2007 11:28:01|execd|homer01|E|shepherd of job 17156.1 exited  
> with exit status = 25
> 03/14/2007 16:30:56|execd|homer01|I|starting up 6.0u6
> 03/14/2007 16:31:12|execd|homer01|E|acknowledge for unknown job  
> 17285.1/master
> 03/14/2007 16:31:12|execd|homer01|E|incorrect config file for job  
> 17285.1
> 03/14/2007 16:31:12|execd|homer01|E|ERROR: unlinking "jobs/ 
> 00/0001/7285.1": No such file or directory
> 03/14/2007 16:31:12|execd|homer01|E|can not remove file job spool  
> file: jobs/00/0001/7285.1
> 03/14/2007 16:31:12|execd|homer01|E|can't remove directory  
> "active_jobs/17285.1": opendir(active_jobs/17285.1) failed: No such  
> file or directo
> ry
>
> (few time and the job is deleted from SGE)
> .
> .
> .
>
> I didn't play with the qmaster as i was afraid to propagate this  
> blocking behaviour.... I'am not a gourou ...
> Any ideas ?
> Is their some kind of parameter or command to skip the reboot ... ?
> Is their some network latency parameter ???
> Thanks in advance for any help.
> A.S.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list