[GE users] execd behaviour in case of qmaster crash

rayson rayrayson at gmail.com
Thu Jun 11 07:33:25 BST 2009


You can try to manually migrate the master to another host, but the
shadow master should automatically handle everything for you.

http://gridengine.sunsource.net/howto/sge_migrate.html

Is your $SGE_ROOT shared??

Rayson



On 6/11/09, ah_sunsource <ahaupt at ifh.de> wrote:
> Hi,
>
> we have a test setup of SGE 6.2u2 with a configured shadow master here.
> Yesterday I managed to crash the qmaster host somehow (submitted a
> parallel job as array job and requested reservation). Shortly after that
> qmaster did not react any more but it could still be pinged.
>
> The shadow master reacted after some minutes and took over the qmaster
> process and modified $SGE_ROOT/$SGE_CELL/common/act_qmaster correctly.
> But (all!) the execd processes on several nodes did not break their
> connection to the crashed qmaster (now even after hours). That way I can
> now submit/query jobs - but they don't get executed any more.
>
> The exec and master hosts are RHEL5 systems. Netstat shows the attempt
> to send some packets to the crashed master (lolek-vm1). But this seems
> to hang forever.
>
> [hpbl1] ~ # netstat -tp | egrep '(Proto|sge)'
> Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
> tcp        0  10889 hpbl1.ifh.de:47891          lolek-vm1.ifh.d:sge_qmaster ESTABLISHED 5466/sge_execd
>
> Is there somewhere a timeout that I missed to configure?
>
> Thanks & cheers,
> Andreas
>
> --
> | Andreas Haupt             | E-Mail: andreas.haupt at desy.de
> |  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
> |  Platanenallee 6          | Phone:  +49/33762/7-7359
> |  D-15738 Zeuthen          | Fax:    +49/33762/7-7216
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201501
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201503

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list