[GE users] a 't' status blocking a node
reuti at staff.uni-marburg.de
Thu Mar 15 10:32:50 GMT 2007
Am 14.03.2007 um 18:13 schrieb Alexis Salzman:
> Hi all,
> with SGE 6.0u6 on AMD Opteron under Redhat (linux 2.4.21-27.ELsmp)
> a strange behaviour occurs some times.
> When the job start it change to 't' state. Nothing append.
> After a while using qdel the user try unsuccessfully to delete is
> job but the state goes to 'dt' and nothing append. As admin with
> qmon i try to delete it without more success.
> sshd on the running node became impossible.
> Directly on the runinng node :
> A netstat on the runing node indicate a CLOSE_WAIT for sgeexecd.
> /etc/init.d/sgeexecd stop block
> kill -9 <pid sgeexecd> fails
> reboot block while trying to shut down sgeexecd => hard reboot needed
did you also check the /var/log/messages on this machine? I saw such
a behavior when the machine was running out of memory and therefore
killed randomly some of the running processes there (OOM killer).
> In the spool/message of the running node i find nothing before the
> reboot (16:30) about the blocking job (17285) witch start and block
> around 15:30 :
> 03/12/2007 11:28:01|execd|homer01|E|shepherd of job 17156.1 exited
> with exit status = 25
> 03/14/2007 16:30:56|execd|homer01|I|starting up 6.0u6
> 03/14/2007 16:31:12|execd|homer01|E|acknowledge for unknown job
> 03/14/2007 16:31:12|execd|homer01|E|incorrect config file for job
> 03/14/2007 16:31:12|execd|homer01|E|ERROR: unlinking "jobs/
> 00/0001/7285.1": No such file or directory
> 03/14/2007 16:31:12|execd|homer01|E|can not remove file job spool
> file: jobs/00/0001/7285.1
> 03/14/2007 16:31:12|execd|homer01|E|can't remove directory
> "active_jobs/17285.1": opendir(active_jobs/17285.1) failed: No such
> file or directo
> (few time and the job is deleted from SGE)
> I didn't play with the qmaster as i was afraid to propagate this
> blocking behaviour.... I'am not a gourou ...
> Any ideas ?
> Is their some kind of parameter or command to skip the reboot ... ?
> Is their some network latency parameter ???
> Thanks in advance for any help.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users