[GE users] a 't' status blocking a node

Alexis Salzman alexis.salzman at medysys.com
Wed Mar 14 17:13:30 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi all,
with  SGE 6.0u6 on AMD Opteron under Redhat (linux 2.4.21-27.ELsmp) a
strange behaviour  occurs  some times.
When the job start it  change to 't'  state.  Nothing append. After  a
while  using qdel the user try unsuccessfully to delete is job but the
state goes to 'dt' and nothing append. As admin with qmon i try to
delete it without more  success.
sshd on the running node became impossible.
Directly on the runinng node :

    *  A netstat on the runing node indicate a CLOSE_WAIT for sgeexecd.
    *  /etc/init.d/sgeexecd stop block
    *  kill -9 <pid sgeexecd> fails
    * reboot block while trying to shut down  sgeexecd => hard reboot needed


In the spool/message of the running node i find nothing before the
reboot (16:30) about the blocking job (17285) witch start and block
around 15:30 :
.
.
.
03/12/2007 11:28:01|execd|homer01|E|shepherd of job 17156.1 exited with
exit status = 25
03/14/2007 16:30:56|execd|homer01|I|starting up 6.0u6
03/14/2007 16:31:12|execd|homer01|E|acknowledge for unknown job
17285.1/master
03/14/2007 16:31:12|execd|homer01|E|incorrect config file for job 17285.1
03/14/2007 16:31:12|execd|homer01|E|ERROR: unlinking
"jobs/00/0001/7285.1": No such file or directory
03/14/2007 16:31:12|execd|homer01|E|can not remove file job spool file:
jobs/00/0001/7285.1
03/14/2007 16:31:12|execd|homer01|E|can't remove directory
"active_jobs/17285.1": opendir(active_jobs/17285.1) failed: No such file
or directo
ry

(few time and the job is deleted from SGE)
.
.
.

I didn't play with the qmaster as i was afraid to propagate this
blocking behaviour.... I'am not a gourou ...
Any ideas ?
Is their some kind of parameter or command to skip the reboot ... ?
Is their some network latency parameter ???
Thanks in advance for any help.
A.S.



More information about the gridengine-users mailing list