Opened 48 years ago

Last modified 7 years ago

#910 new task

IZ610: JVM memory/file descriptor monitor

Reported by: afisch Owned by:
Priority: high Milestone:
Component: hedeby Version: 1.0u2
Severity: Keywords: Sun util
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=610]

        Issue #:      610          Platform:     Sun         Reporter: afisch (afisch)
       Component:     hedeby          OS:        All
     Subcomponent:    util         Version:      1.0u2          CC:    None defined
        Status:       NEW          Priority:     P2
      Resolution:                 Issue type:    TASK
                               Target milestone: 1.0u5next
      Assigned to:    rhierlmeier (rhierlmeier)
      QA Contact:     rhierlmeier
          URL:
       * Summary:     JVM memory/file descriptor monitor
   Status whiteboard:
      Attachments:


     Issue 610 blocks:
   Votes for issue 610:     Vote for this issue


   Opened: Mon Dec 8 07:31:00 -0700 2008 
------------------------


   Description:
   After initial setup Hedeby should be able work unsupervised for any time period
   necessary. Thus reliability and availability are crucial properties for any
   running instance. Resource leaks like Memory/file descriptor leaks can be a
   severe issue to the usability as these kind of problems usually require
   periodical restarts. They are usually hard to spot if it takes some time to
   reach the resource limits. In addition they may only occur under very specific
   conditions.

   If the configuration service JVM runs out of memory/file descriptors, it can not
   be reached via sdmadm command anymore. Any attempt to reach it, will result in
   strange errors like: Connection lost, Connection reset,Unexpected EOF etc.
   Unfortunately the JVM can not automatically recover from it. A file descriptor
   leak that lead to these symptoms was described in issue 601. The leak can be
   observed if a system is installed in user mode. It becomes apparent after 8
   hours. Although issue 601 is fixed, the possibility remains that similar issues
   are not discovered jet or will be introduced by future changes.


   Evaluation:
   This issue is rated as p2 task. It does not fix any problem. However the
   monitoring feature is a valuable tool that will allow us to detect a class of
   severe bugs which are currently hard to spot.


   Fix/Work around:
   Fix: The memory usage/limit of all Hedeby JVMs can be monitored with the sdmadm
   command show_jvm. This command should be extended to show the file descriptor
   usage/limit in the same way. This extension would help to monitor the systems
   resource consumption by manually polling. However the command becomes useless
   once a JVM reached its resource limit, as the JVM then can not be reached any
   more. To be able to analyze a JVM, that already run out of resources, a second
   monitoring mechanism is needed. It should periodically check if the JVM is
   running out of resources and report it if a certain threshold is reached.
   Work Around: To check the current memory consumption there is the show_jvm
   sdmadm command.
   To check the file descriptor usage there is the lsof command or the
   /proc/<pid>/fd/ dir that lists all file descriptors of a process.


   Analysis:
   The current resource consumption can be monitored with JMX MXBeans. There is the
   com.sun.management.UnixOperatingSystemMXBean that can provide the information
   for file descriptor usage. Additionally there is the MemoryMXBean, that is
   currently used by Hedeby to show the memory information in the show_jvm command.

   A convenient way to use the MXBeans would be to register an Notification
   listener to it. This is possible for the MemoryMXBean:

      MemoryMXBean mbean = ManagementFactory.getMemoryMXBean();
                 NotificationEmitter emitter = (NotificationEmitter) mbean;
          NotificationListener listener = new NotificationListener() {

              public void handleNotification(Notification notif, Object handback) {
                  // handle notification
              }
          };

   For file descriptor monitoring however it is not possible to register a
   NotificationListener as the com.sun.management.UnixOperatingSystemMXBean is not
   implementing the required NotificationEmitter interface. Furthermore the
   com.sun.management.UnixOperatingSystemMXBean is vendor specific code and might
   not be available on all platforms. This problem can be addressed with a
   Reflection based wrapper as it is implemented for the FilePreferencesTest class
   (UnixOperatingSystemMXBeanWrapper).
   Resource monitoring serves mainly QA purpose. Thus a customer is not missing a
   feature if a resource monitor is not available. Thus the file descriptor monitor
   mechanism would be switched off if the required class is not available.

   As there is only limited support to get automatically notified, we have to use a
   Thread that is constantly checking the resource status. A good candidate for
   such a thread would be the life cycle thread in JVMImpl as this thread is send
   to sleep after startup it is idle and just waits for JVM shut down. It could be
   periodically waked to check the current resource consumption, log a warning if
   the JVM is getting low on any resource and put back to sleep. If we implement it
   like this, we have to keep the problem in mind, that the thread interrupted
   information can get lost during logging (see issue 538 and issue 607). A
   reasonable checking interval would be in the minutes range. It does not need to
   be too small, as resource leaks usually do not grow too fast. The warning
   threshold value can be calculated by the corresponding max memory/file
   descriptor values that are provided by the MXBeans, too.(eg. mem usage > 75% of
   max mem).

   A nice way to scale the logging sensitivity is to log warnings for different
   thresholds with different log levels:

   if(usage >75%){
      log.log(Level.WARNING,...
   }else if(usage >60%){
      log.log(Level.INFO,...
   ...

   To be able to define a specific log level via modify_log_level command  we
   should use a dedicated class a separated logger. With this mechanism it should
   be allowed to switch of the feature too. The feature code should be only
   executed if the logger is enabled. A periodical check of the log level is
   required to allow a dynamic switch on/off of the feature.
   The warning should be logged by the reporter component as it will allow to
   easily spot problems with any of the JVMs installed for a Hedeby instance.


   How to test:
   There should be a JUnit test to check the behavior if a JVM runs out of
   resources (see: FilePreferencesTest for the MXBean usage  and
   LoggerHandlerMockup for to test the logging).
   There should be a TS test that chooses an appropriate log level for the resource
   usage checking class and checks the reporter component log file for any memory
   low warning message. This test should be started twice, as first and as last TS
   test. With the double checking approach we can monitor two time windows. The one
   while running a testing session and the one between testing sessions. This test
   would not address the test of the functionality but is a means to find resource
   leaks.


   ETC:
   5 PD{
      1.5 PD Feature
      1.5 PD JUnit test
      2.0 PD TS tests
   }
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.