Opened 48 years ago

Last modified 7 years ago

#941 new defect

IZ715: JVM with cloud service hits file descriptor limit

Reported by: rhierlmeier Owned by:
Priority: high Milestone:
Component: hedeby Version: 1.0u5
Severity: Keywords: Sun bootstrap
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=715]

        Issue #:      715             Platform:     Sun         Reporter: rhierlmeier (rhierlmeier)
       Component:     hedeby             OS:        All
     Subcomponent:    bootstrap       Version:      1.0u5          CC:    None defined
        Status:       NEW             Priority:     P2
      Resolution:                    Issue type:    DEFECT
                                  Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
          URL:
       * Summary:     JVM with cloud service hits file descriptor limit
   Status whiteboard:
      Attachments:


     Issue 715 blocks:
   Votes for issue 715:     Vote for this issue


   Opened: Wed Dec 23 01:11:00 -0700 2009 
------------------------


   JVM with cloud service hits file descriptor limit

   Description:

   With the improvements of cloud adapter in SDM1.0u5 it can happen that the JVM
   that hosts a cloud service hits the file descriptor limit. A similar error
   message like the following can be found in the log files:

   12/22/2009 10:17:18|993|tractServiceAdapter$UninstallAction.doExecute|W|Service
   sge2: Could not uninstall resource res#56:
   Cannot store host resource srv135[HOST_ERROR,0]:
   /var/spool/sdm/spool/sge2/res#56.srf.0 (Too many open files)

   Evaluation:

   The problem leads to a complete unusable system. Must be restarted completely.
   With changing the system the problem will be reappear soon.

   Analysis:

   The error message (Too many open files) is only a symptom. The real cause is
   that the jvm that runs the executor runs out of memory. The executor jvm still
   accepts incoming requests, however it does not process them. With each scripts
   that should be executes via the executor a new connection to the executor jvm
   is opened and the file descriptor limit will be slowly but surely reached.


   How to fix:

   1. Executor must reject incoming scripts when a certain memory usage is reached
   to prevent OutOfMemoryException. We had something similar already in the code.
   Executor thrown RejectedExecutionException when the maxPoolSize was reached.
   Unfortunately we disabled it with the fix of issue 686 (Executor does not
   consider the maxPoolSize parameter). However the old behavior did not consider
   the memory usage of the executor.

   2. Reduce the memory foo print of scripts that are executed via executor. In
   the current implementation all necessary data (scripts, input data) are
   transfered to the executor as an command object. This command object is put
   into the (memory) queue of executor. When the command is peeked from the queue
   the scripts are stored to disk and then exec is called. It would be possible to
   store the scripts before the command object is put into queue.


   4. Shutdown the jvm if certain memory threshold is reached. Administrator
   should get a clear error message with an hint that the heap size should be
   increased. The MemoryMXBean of the java platform mbean service can deliver a
   notification when the memory consumption reaches a threshold. This notification
   can be used to perform a clear jvm shutdown before OutputMemoryException
   occurs. The art will be finding out a proper value for the threshold.


   Workaround:

   A simple work around it increasing the max heap size of the jvm that hosts the
   executor. Edit the global configuration of the sdm system and increase the max
   heap size of the executor jvm by setting jvm args.

   % sdmadm mgc
   ...
       <common:jvm port="0" user="root" name="executor_vm">
           <common:component name="executor" .../>
           <common:jvmArg>-Xmx128M</common:jvmArg>

   <common:jvmArg>-Dcom.sun.grid.grm.management.connectionTimeout=60</common:jvmArg>
       </common:jvm>
   ...


   For simple installed systems the jvm args of the cs_vm must be changed (executor
   is running in the cs_vm):

   % sdmadm mgc
   ...
       <common:jvm port="0" user="root" name="executor_vm">
           <common:component name="executor" .../>
           <common:jvmArg>-Xmx1024M</common:jvmArg>

   <common:jvmArg>-Dcom.sun.grid.grm.management.connectionTimeout=60</common:jvmArg>
       </common:jvm>
   ...

   Restart the jvms.

   However this workaround has the drawback that all jvms will consume more
   memory. Also the jvms that run on cloud hosts.


   How to test

   Setup a SDM system with a simhost cloud service. Add a large number of resource
   to the service.

   Move the resource to different service. Check that all resources are moved.
   Check the log files that no error occurred.


   ETC: 5PD

Change History (0)

Note: See TracTickets for help on using tickets.