Opened 12 years ago
Last modified 10 years ago
#914 new defect
IZ624: GE service stays in RUNNING state if connection qmaster is lost
Reported by: | rhierlmeier | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | hedeby | Version: | current |
Severity: | Keywords: | Sun gridengine_adapter | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=624]
Issue #: 624 Platform: Sun Reporter: rhierlmeier (rhierlmeier) Component: hedeby OS: All Subcomponent: gridengine_adapter Version: current CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: 1.0u5next Assigned to: rhierlmeier (rhierlmeier) QA Contact: rhierlmeier URL: * Summary: GE service stays in RUNNING state if connection qmaster is lost Status whiteboard: Attachments: Issue 624 blocks: Votes for issue 624: Vote for this issue Opened: Tue Feb 24 02:03:00 -0700 2009 ------------------------ Description If during an update of a GE service the connection to qmaster is lost the GE service does not go into UNKNOWN state. Evaluation The bug can cause create confusion, because if the problem occurs no more jgdi events are received. Changes on the resources on qmaster sides are no longer reflected on hedeby side. Customer will detect that some resources are missing or outdated. Suggested Fix/Work Around A 'sdmadm sds' followed by an 'sdmadm sus' on the affect GE service solves the problem. Analysis The problem occurs always if connection to qmaster is lost while service is in RELOADING state. The connectionLost method of class GEServiceAdapterImpl gets an InvalidStateTransistionException on starting the reconnect service transition. In this case only a log message in level FINE is written, but no further action is triggered. Similar problems can also occur if the connection to qmaster is lost during the startup of the service. To solve the problem the connectionLost method should take care about the service state. If service is in RELOADING or STARTING state it should wait until the service goes into RUNNING state and triggers then the reconnect. How to test Reproducing this bug is nearly impossible because it only occurs on very rare timing conditions. We need a junit test for this specific scenario. This junit test must block the reloading and starting phase of a ge service and inject the CONNECT_LOST event into the JGDI event mechanism. The test must checks that service goes from RELOADING to RUNNING and finally into UNKNOWN state. ATC: 0.5 PD ETC: 4 PD ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 ------- Milestone changed
Note: See
TracTickets for help on using
tickets.