IZ715: JVM with cloud service hits file descriptor limit
|Reported by:||rhierlmeier||Owned by:|
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=715]
Issue #: 715 Platform: Sun Reporter: rhierlmeier (rhierlmeier) Component: hedeby OS: All Subcomponent: bootstrap Version: 1.0u5 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: 1.0u5next Assigned to: adoerr (adoerr) QA Contact: adoerr URL: * Summary: JVM with cloud service hits file descriptor limit Status whiteboard: Attachments: Issue 715 blocks: Votes for issue 715: Vote for this issue Opened: Wed Dec 23 01:11:00 -0700 2009 ------------------------ JVM with cloud service hits file descriptor limit Description: With the improvements of cloud adapter in SDM1.0u5 it can happen that the JVM that hosts a cloud service hits the file descriptor limit. A similar error message like the following can be found in the log files: 12/22/2009 10:17:18|993|tractServiceAdapter$UninstallAction.doExecute|W|Service sge2: Could not uninstall resource res#56: Cannot store host resource srv135[HOST_ERROR,0]: /var/spool/sdm/spool/sge2/res#56.srf.0 (Too many open files) Evaluation: The problem leads to a complete unusable system. Must be restarted completely. With changing the system the problem will be reappear soon. Analysis: The error message (Too many open files) is only a symptom. The real cause is that the jvm that runs the executor runs out of memory. The executor jvm still accepts incoming requests, however it does not process them. With each scripts that should be executes via the executor a new connection to the executor jvm is opened and the file descriptor limit will be slowly but surely reached. How to fix: 1. Executor must reject incoming scripts when a certain memory usage is reached to prevent OutOfMemoryException. We had something similar already in the code. Executor thrown RejectedExecutionException when the maxPoolSize was reached. Unfortunately we disabled it with the fix of issue 686 (Executor does not consider the maxPoolSize parameter). However the old behavior did not consider the memory usage of the executor. 2. Reduce the memory foo print of scripts that are executed via executor. In the current implementation all necessary data (scripts, input data) are transfered to the executor as an command object. This command object is put into the (memory) queue of executor. When the command is peeked from the queue the scripts are stored to disk and then exec is called. It would be possible to store the scripts before the command object is put into queue. 4. Shutdown the jvm if certain memory threshold is reached. Administrator should get a clear error message with an hint that the heap size should be increased. The MemoryMXBean of the java platform mbean service can deliver a notification when the memory consumption reaches a threshold. This notification can be used to perform a clear jvm shutdown before OutputMemoryException occurs. The art will be finding out a proper value for the threshold. Workaround: A simple work around it increasing the max heap size of the jvm that hosts the executor. Edit the global configuration of the sdm system and increase the max heap size of the executor jvm by setting jvm args. % sdmadm mgc ... <common:jvm port="0" user="root" name="executor_vm"> <common:component name="executor" .../> <common:jvmArg>-Xmx128M</common:jvmArg> <common:jvmArg>-Dcom.sun.grid.grm.management.connectionTimeout=60</common:jvmArg> </common:jvm> ... For simple installed systems the jvm args of the cs_vm must be changed (executor is running in the cs_vm): % sdmadm mgc ... <common:jvm port="0" user="root" name="executor_vm"> <common:component name="executor" .../> <common:jvmArg>-Xmx1024M</common:jvmArg> <common:jvmArg>-Dcom.sun.grid.grm.management.connectionTimeout=60</common:jvmArg> </common:jvm> ... Restart the jvms. However this workaround has the drawback that all jvms will consume more memory. Also the jvms that run on cloud hosts. How to test Setup a SDM system with a simhost cloud service. Add a large number of resource to the service. Move the resource to different service. Check that all resources are moved. Check the log files that no error occurred. ETC: 5PD
Change History (0)
Note: See TracTickets for help on using tickets.