Opened 11 years ago

Last modified 9 years ago

#914 new defect

IZ624: GE service stays in RUNNING state if connection qmaster is lost

Reported by: rhierlmeier Owned by:
Priority: normal Milestone:
Component: hedeby Version: current
Severity: Keywords: Sun gridengine_adapter
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=624]

        Issue #:      624                      Platform:     Sun         Reporter: rhierlmeier (rhierlmeier)
       Component:     hedeby                      OS:        All
     Subcomponent:    gridengine_adapter       Version:      current        CC:    None defined
        Status:       NEW                      Priority:     P3
      Resolution:                             Issue type:    DEFECT
                                           Target milestone: 1.0u5next
      Assigned to:    rhierlmeier (rhierlmeier)
      QA Contact:     rhierlmeier
          URL:
       * Summary:     GE service stays in RUNNING state if connection qmaster is lost
   Status whiteboard:
      Attachments:


     Issue 624 blocks:
   Votes for issue 624:     Vote for this issue


   Opened: Tue Feb 24 02:03:00 -0700 2009 
------------------------


   Description

   If during an update of a GE service the connection to qmaster is lost the GE
   service does not go into UNKNOWN state.

   Evaluation

   The bug can cause create confusion, because if the problem occurs no more jgdi
   events are received. Changes on the resources on qmaster sides are no longer
   reflected on hedeby side.
   Customer will detect that some resources are missing or outdated.

   Suggested Fix/Work Around

   A 'sdmadm sds' followed by an 'sdmadm sus' on the affect GE service solves the
   problem.

   Analysis

   The problem occurs always if connection to qmaster is lost while service is in
   RELOADING state. The connectionLost method of class GEServiceAdapterImpl gets an
   InvalidStateTransistionException on starting the reconnect
   service transition. In this case only a log message in level FINE is written,
   but no further action is triggered.

   Similar problems can also occur if the connection to qmaster is lost during the
   startup of the service.

   To solve the problem the connectionLost method should take care about the
   service state. If service is in RELOADING or STARTING state it should wait until
   the service goes into RUNNING state and triggers then the reconnect.

   How to test

   Reproducing this bug is nearly impossible because it only occurs on very rare
   timing conditions. We need a junit test for this specific scenario. This junit
   test must block the reloading and starting phase of a ge service and inject the
   CONNECT_LOST event into the JGDI event mechanism.
   The test must checks that service goes from RELOADING to RUNNING and finally
   into UNKNOWN state.



   ATC: 0.5 PD
   ETC: 4 PD
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.