[GE users] execd behaviour in case of qmaster crash
ahaupt at ifh.de
Thu Jun 11 07:27:32 BST 2009
we have a test setup of SGE 6.2u2 with a configured shadow master here.
Yesterday I managed to crash the qmaster host somehow (submitted a
parallel job as array job and requested reservation). Shortly after that
qmaster did not react any more but it could still be pinged.
The shadow master reacted after some minutes and took over the qmaster
process and modified $SGE_ROOT/$SGE_CELL/common/act_qmaster correctly.
But (all!) the execd processes on several nodes did not break their
connection to the crashed qmaster (now even after hours). That way I can
now submit/query jobs - but they don't get executed any more.
The exec and master hosts are RHEL5 systems. Netstat shows the attempt
to send some packets to the crashed master (lolek-vm1). But this seems
to hang forever.
[hpbl1] ~ # netstat -tp | egrep '(Proto|sge)'
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 10889 hpbl1.ifh.de:47891 lolek-vm1.ifh.d:sge_qmaster ESTABLISHED 5466/sge_execd
Is there somewhere a timeout that I missed to configure?
Thanks & cheers,
| Andreas Haupt | E-Mail: andreas.haupt at desy.de
| DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt
| Platanenallee 6 | Phone: +49/33762/7-7359
| D-15738 Zeuthen | Fax: +49/33762/7-7216
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users