[GE users] a 't' status blocking a node
alexis.salzman at medysys.com
Wed Mar 14 17:13:30 GMT 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
with SGE 6.0u6 on AMD Opteron under Redhat (linux 2.4.21-27.ELsmp) a
strange behaviour occurs some times.
When the job start it change to 't' state. Nothing append. After a
while using qdel the user try unsuccessfully to delete is job but the
state goes to 'dt' and nothing append. As admin with qmon i try to
delete it without more success.
sshd on the running node became impossible.
Directly on the runinng node :
* A netstat on the runing node indicate a CLOSE_WAIT for sgeexecd.
* /etc/init.d/sgeexecd stop block
* kill -9 <pid sgeexecd> fails
* reboot block while trying to shut down sgeexecd => hard reboot needed
In the spool/message of the running node i find nothing before the
reboot (16:30) about the blocking job (17285) witch start and block
around 15:30 :
03/12/2007 11:28:01|execd|homer01|E|shepherd of job 17156.1 exited with
exit status = 25
03/14/2007 16:30:56|execd|homer01|I|starting up 6.0u6
03/14/2007 16:31:12|execd|homer01|E|acknowledge for unknown job
03/14/2007 16:31:12|execd|homer01|E|incorrect config file for job 17285.1
03/14/2007 16:31:12|execd|homer01|E|ERROR: unlinking
"jobs/00/0001/7285.1": No such file or directory
03/14/2007 16:31:12|execd|homer01|E|can not remove file job spool file:
03/14/2007 16:31:12|execd|homer01|E|can't remove directory
"active_jobs/17285.1": opendir(active_jobs/17285.1) failed: No such file
(few time and the job is deleted from SGE)
I didn't play with the qmaster as i was afraid to propagate this
blocking behaviour.... I'am not a gourou ...
Any ideas ?
Is their some kind of parameter or command to skip the reboot ... ?
Is their some network latency parameter ???
Thanks in advance for any help.
More information about the gridengine-users