[GE issues] [Issue 2900] New - qmaster fail-over results in very slow execd reconnect

templedf dan.templeton at sun.com
Tue Feb 3 15:20:16 GMT 2009


http://gridengine.sunsource.net/issues/show_bug.cgi?id=2900
                 Issue #|2900
                 Summary|qmaster fail-over results in very slow execd reconnect
               Component|gridengine
                 Version|6.2u1
                Platform|All
                     URL|
              OS/Version|All
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P2
            Subcomponent|execution
             Assigned to|pollinger
             Reported by|templedf






------- Additional comments from templedf at sunsource.net Tue Feb  3 07:20:15 -0800 2009 -------
In an 25-node cluster with one shadow host, when the qmaster fails over to the
shadow, the execution daemons take over 15 minutes to reconnect to the new
master.  The fail-over is triggered by taking down the primary interface on the
master (ifconfig eth0 down).  The hosts are running RHEL 5.  This is 100%
reproducable.  When the master is migrated back (sgemaster -migrate), the execds
reconnect immediately.  If the master's primary interface is brought back up
before the 15 minutes have passed, the execds immediately begin reconnecting to
the new master.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=101697

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list