Opened 12 years ago
Last modified 10 years ago
#905 new defect
IZ602: "sdmadm sdj -j <VM_NAME> -h localhost" can result in JVM process left running
Reported by: | easymf | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | hedeby | Version: | 1.0u2 |
Severity: | Keywords: | Sun bootstrap | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=602]
Issue #: 602 Platform: Sun Reporter: easymf (easymf) Component: hedeby OS: All Subcomponent: bootstrap Version: 1.0u2 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: 1.0u5next Assigned to: adoerr (adoerr) QA Contact: adoerr URL: * Summary: "sdmadm sdj -j <VM_NAME> -h localhost" can result in JVM process left running Status whiteboard: Attachments: Issue 602 blocks: Votes for issue 602: Vote for this issue Opened: Mon Nov 10 12:17:00 -0700 2008 ------------------------ Description Sometimes, user can hit a situation when process of shutting down a JVM leaves the JVM process running even the message says it was stopped and it is necessary to kill it manually. A user can verify he hit the bug performing following steps: 1. check PID of JVM process (the first number in output) --> more <hedeby_local_spool>/run/<VM_NAME>\@<LOCALHOST_NAME> 1234 56789 2. shutdown JVM using sdmadm --> sdmadm sdj -j <VM_NAME> -h localhost jvm host result message -------------------------- rp_vm ge1 STOPPED Process killed 3. check if java process for JVM is still running (identified by above PID) --> ps -ef | grep 1234 <userid> 1234 1 0 19:06:57 pts/12 0:03 java -Dcom.sun.grid Evaluation Not very visible, but definitely annoying issue as the only workaround is to kill the process manually and perform (manually) cleanup which is not described anywhere. Suggested Fix/Work Around To bypass the problem, a user needs to send SIGKILL to the process of JVM: --> kill -9 1234 Then, it is needed to cleanup leftover files - this varies depending of killed JVM. 1. if JVM is running configuration service ("cs_vm"): a) remove file "<hedeby_local_spool>/run/cs_vm\@<LOCALHOST_NAME>" b) remove file "<hedeby_local_spool>/spool/cs/active_jvm/cs_vm\@<LOCALHOST_NAME>.xml" 2. any other JVM a) remove file "<hedeby_local_spool>/run/<VM_NAME>\@<LOCALHOST_NAME>" b) remove file "<hedeby_local_spool>/spool/cs/active_jvm/<VM_NAME>\@<LOCALHOST_NAME>.xml" c) remove all files that belong to components running inside the JVM (VM_NAME) and on the host (LOCALHOST_NAME) from directory "<hedeby_local_spool>/spool/cs/active_component" on a host that hosts JVM running configuration service (cs_vm) !!!! Fixing a problem is rather complex. I propose to do following: 1. introduce feature to allow SIGKILL in com.sun.grid.grm.util.Platform 2. perform at least partial cleanup (step 2.c from workaround is also possible, but needs to have CS running) and refer to documentation in message (it'd be nice to give some code in the message and user would find this code in HOW-TOs or FAQs) 3. change implementation of com.sun.grid.grm.ui.component.StopJVMCommand in a way, that it will try to kill nonresponding JVM using a SIGTERM first, and only if it will not be enough, it will use SIGKILL Analysis The problematic part is com.sun.grid.grm.ui.component.StopJVMCommand$JVMStopper.stop() between lines 240-245: if(Platform.getPlatform().killProcess(pid)) { return I18NManager.formatMessage("StopJVMCommand.JVMStopper.processKilled", BUNDLE_NAME); } throw new GrmException("ui.stopjvmcommand.kill_failed", BUNDLE_NAME, getJvmName()); In fact, "Platform.getPlatform().killProcess(pid)" will return "true" for Unix platform always, becasue of implementation of co.sun.grid.grm.util.UnixPlatform.killProcess(int pid): public boolean killProcess(int pid) { try { log.log(Level.FINE, "unixplatform.kill", pid); int result = exec("kill " + pid, null, null, null, null); return result == 0; } catch(InterruptedException ex) { return false; } catch(IOException ex) { return false; } } As shown, method performs "kill $PID" on unix architectures, which means "SIGTERM" is delivered to process identified by $PID. The problem is, that return code 0 means only that SIGTERM was successfully delivered to process, not that process was terminated. How to test a First, providing a unit tests for improved implementation (for methods sending SIGTERM and SIGKILL etc.) in a form of junit tests. Then, providing a TS test for "sdmadm sdj" command and ideally, involving shutdown of a JVM process that ignores "SIGTERM". ATC: 1 PD (because of <BEEP/> problems with issuezilla) ETC: 5 PD ------- Additional comments from rhierlmeier Mon Nov 10 23:54:16 -0700 2008 ------- The proposed solution for the problem is highly dangerous. I am totally against fixing it in this way. We should find out why the the jvms do not stop, it can only be that a thread inside of the jvm did not die. If any of the developer runs into this problem again please use jstack to get the stacktrace of a all thread of the "unresponsive" jvm. After finishing this discussion I would further raise the priority of this issue because the described scenario can be an indicator for a deadlock. ------- Additional comments from adoerr Tue Nov 11 03:13:43 -0700 2008 ------- I agree with Richard. We need to find out what is the root cause for this problem. Using SIGKILL in addition to SIGTERM to kill a JVM is an option. However, it should only be used as a last resort. Killing a JVM with SIGKILL can easily introduce all kinds of inconsistencies. ------- Additional comments from afisch Thu Dec 4 10:20:41 -0700 2008 ------- This issue most likely is related to task 607 ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 ------- Milestone changed
Note: See
TracTickets for help on using
tickets.