[GE users] execd behaviour in case of qmaster crash

ah_sunsource ahaupt at ifh.de
Thu Jun 11 07:27:32 BST 2009


we have a test setup of SGE 6.2u2 with a configured shadow master here.
Yesterday I managed to crash the qmaster host somehow (submitted a
parallel job as array job and requested reservation). Shortly after that
qmaster did not react any more but it could still be pinged.

The shadow master reacted after some minutes and took over the qmaster
process and modified $SGE_ROOT/$SGE_CELL/common/act_qmaster correctly.
But (all!) the execd processes on several nodes did not break their
connection to the crashed qmaster (now even after hours). That way I can
now submit/query jobs - but they don't get executed any more.

The exec and master hosts are RHEL5 systems. Netstat shows the attempt
to send some packets to the crashed master (lolek-vm1). But this seems
to hang forever.

[hpbl1] ~ # netstat -tp | egrep '(Proto|sge)'
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0  10889 hpbl1.ifh.de:47891          lolek-vm1.ifh.d:sge_qmaster ESTABLISHED 5466/sge_execd      

Is there somewhere a timeout that I missed to configure?

Thanks & cheers,

| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216


