Opened 50 years ago

Last modified 9 years ago

#878 new task

IZ524: wrong setup of jmx server in qmaster results in dropping of jgdi connection

Reported by: zwierzak Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0
Severity: Keywords: Sun gridengine_adapter
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=524]

        Issue #:      524                      Platform:     Sun         Reporter: zwierzak (zwierzak)
       Component:     hedeby                      OS:        All
     Subcomponent:    gridengine_adapter       Version:      1.0            CC:    None defined
        Status:       NEW                      Priority:     P3
      Resolution:                             Issue type:    TASK
                                           Target milestone: 1.0u5next
      Assigned to:    rhierlmeier (rhierlmeier)
      QA Contact:     rhierlmeier
          URL:
       * Summary:     wrong setup of jmx server in qmaster results in dropping of jgdi connection
   Status whiteboard:
      Attachments:


     Issue 524 blocks:
   Votes for issue 524:     Vote for this issue


   Opened: Thu Jul 24 04:50:00 -0700 2008 
------------------------


   wrong setup of jmx server in qmaster results in dropping of jgdi connection,

   > If you see such log ......
   >
   >> 07/23/2008 19:35:03|15|I|The resource provider has been stopped
   >> 07/23/2008 19:35:07|21|I|The spare pool has been stopped.
   >> 07/23/2008 19:35:07|18|I|Shutdown finished
   >> 07/23/2008 19:35:24|10|I|startup jvm (pid=20414)
   >> 07/23/2008 19:35:26|11|I|Secure mbean server started
   (service:jmx:rmi:///jndi/rmi://foo.bar:48309/system)
   >> 07/23/2008 19:35:26|12|I|The spare pool has been started.
   >> 07/23/2008 19:35:26|13|I|The reporter has been started.
   >> 07/23/2008 19:35:26|15|I|Service service: Starting Grid Engine service
   >> 07/23/2008 19:35:27|15|W|Service service: Connection to qmaster has been lost
   >> 07/23/2008 19:35:27|15|I|Service service: qmaster not running, try reconnect
   every 60 seconds
   >> 07/23/2008 19:35:28|16|I|The resource provider has been started
   >
   >
   > the state of "service" component is started, "service" service is unknown
   >
   > the reason can be that during installation of GE (JMX step) you dont specify
   the password for server
   if you set too short passwd the problem is the same

   There should be some error reported somehow by geadapter that jmx server in
   qmaster may have some problems
               ------- Additional comments from zwierzak Tue Aug 5 03:52:20 -0700 2008 -------
   Description:

   Logs of GE-Adapter are misleading when jmx thread in qmaster is wrongly
   configured and it does not run.

   > If you see such log ......
   >
   >> 07/23/2008 19:35:03|15|I|The resource provider has been stopped
   >> 07/23/2008 19:35:07|21|I|The spare pool has been stopped.
   >> 07/23/2008 19:35:07|18|I|Shutdown finished
   >> 07/23/2008 19:35:24|10|I|startup jvm (pid=20414)
   >> 07/23/2008 19:35:26|11|I|Secure mbean server started
   (service:jmx:rmi:///jndi/rmi://foo.bar:48309/system)
   >> 07/23/2008 19:35:26|12|I|The spare pool has been started.
   >> 07/23/2008 19:35:26|13|I|The reporter has been started.
   >> 07/23/2008 19:35:26|15|I|Service service: Starting Grid Engine service
   >> 07/23/2008 19:35:27|15|W|Service service: Connection to qmaster has been lost
   >> 07/23/2008 19:35:27|15|I|Service service: qmaster not running, try reconnect
   every 60 seconds
   >> 07/23/2008 19:35:28|16|I|The resource provider has been started
   >
   >
   > the state of "service" component is started, "service" service is unknown

   From the log track user/administrator would assume that qmaster was running but
   it crashed or dropped connection, it misleading because jmx thread in qmaster
   was never up.

   When jmx thread in qmaster is wrongly configured (no password for jmx server,
   password too short)

   Evaluation:

   Hedeby has no problems it's just about improving log entry.

   Suggested Fix / Work Around:

   Improve log entry. User needs to go to qmaster machine and check logs if jmx
   thread is running to find out what is going on.

   Analysis:

   With this task GEAdapterImpl.java log should be improved to avoid confusion.
   Should we contact GE guys and request that starting of qmaster with wrongly
   configured jmx thread (that is not starting at all) will print error message to
   user?? (file rfe??) Currently just error to qmaster logs is written and qmaster
   itself is running without jmx thread. User is unaware that something went wrong.

   How to test:

   Try to connect to qmaster when qmaster is not running or jmx thread in qmaster
   is not running. Check if the logs are proper ones.

   ETC:

   2 PD

   ATC:

   0.5 PD
               ------- Additional comments from torsten Tue Oct 28 03:30:18 -0700 2008 -------
   I stumbled over the same problem (insufficient information in log file) when
   there are connection problems from GE adapter to qmaster. In my case this was
   because of wrong certificates in the GE adapter configuration.

   Please find below the description and analysis I did. The proposed fix has an
   ETC of 0.5PD.

   Unhelpful error message when connection from GE adapter to qmaster is lost

   Description:
   When the connection from GE adapter to qmaster is lost, the following line
   appears in the respective VM-log:

   10/28/2008 08:43:51|12|W|Service geadapter: Connection to qmaster has been lost

   Analysis:
   A GrmException that is thrown from GEConnection.connect() is caught and logged
   logged from line 784 in GEServiceImpl.java (message "gsi.lost").

   The GrmException contains valuable information about the cause of the connection
   loss and is even handed into the logging message as a parameter. BUT this
   parameter is never used in the gsi.lost message.

   => Add this parameter in the messages.properties file!

   gsi.lost = Service {0}: Connection to qmaster has been lost. Cause: {1}
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:10 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.