Opened 12 years ago

Last modified 10 years ago

#905 new defect

IZ602: "sdmadm sdj -j <VM_NAME> -h localhost" can result in JVM process left running

Reported by: easymf Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u2
Severity: Keywords: Sun bootstrap


[Imported from gridengine issuezilla]

        Issue #:      602             Platform:     Sun         Reporter: easymf (easymf)
       Component:     hedeby             OS:        All
     Subcomponent:    bootstrap       Version:      1.0u2          CC:    None defined
        Status:       NEW             Priority:     P3
      Resolution:                    Issue type:    DEFECT
                                  Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
       * Summary:     "sdmadm sdj -j <VM_NAME> -h localhost" can result in JVM process left running
   Status whiteboard:

     Issue 602 blocks:
   Votes for issue 602:     Vote for this issue

   Opened: Mon Nov 10 12:17:00 -0700 2008 


   Sometimes, user can hit a situation when process of shutting down a JVM leaves
   the JVM process running even the message says it was stopped and it is necessary
   to kill it manually. A user can verify he hit the bug performing following steps:

   1. check PID of JVM process (the first number in output)
   --> more <hedeby_local_spool>/run/<VM_NAME>\@<LOCALHOST_NAME>

   2. shutdown JVM using sdmadm
   --> sdmadm sdj -j <VM_NAME> -h localhost
   jvm   host result  message
   rp_vm ge1  STOPPED Process killed

   3. check if java process for JVM is still running (identified by above PID)
   --> ps -ef | grep 1234
   <userid> 1234     1   0 19:06:57 pts/12      0:03 java -Dcom.sun.grid


   Not very visible, but definitely annoying issue as the only workaround is to
   kill the process manually and perform (manually) cleanup which is not described

   Suggested Fix/Work Around

   To bypass the problem, a user needs to send SIGKILL to the process of JVM:

   --> kill -9 1234

   Then, it is needed to cleanup leftover files - this varies depending of killed JVM.

   1. if JVM is running configuration service ("cs_vm"):
           a) remove file "<hedeby_local_spool>/run/cs_vm\@<LOCALHOST_NAME>"
           b) remove file
   2. any other JVM
           a) remove file "<hedeby_local_spool>/run/<VM_NAME>\@<LOCALHOST_NAME>"
           b) remove file
           c) remove all files that belong to components running inside the JVM (VM_NAME)
   and on the host (LOCALHOST_NAME) from directory
   "<hedeby_local_spool>/spool/cs/active_component" on a host that hosts JVM
   running configuration service (cs_vm) !!!!

   Fixing a problem is rather complex. I propose to do following:

   1. introduce feature to allow SIGKILL in com.sun.grid.grm.util.Platform
   2. perform at least partial cleanup (step 2.c from workaround is also possible,
   but needs to have CS running) and refer to documentation in message (it'd be
   nice to give some code in the message and user would find this code in HOW-TOs
   or FAQs)
   3. change implementation of com.sun.grid.grm.ui.component.StopJVMCommand in a
   way, that it will try to kill nonresponding JVM using a SIGTERM first, and only
   if it will not be enough, it will use SIGKILL


   The problematic part is
   com.sun.grid.grm.ui.component.StopJVMCommand$JVMStopper.stop() between lines

   if(Platform.getPlatform().killProcess(pid)) {
   I18NManager.formatMessage("StopJVMCommand.JVMStopper.processKilled", BUNDLE_NAME);
   throw new GrmException("ui.stopjvmcommand.kill_failed", BUNDLE_NAME, getJvmName());

   In fact, "Platform.getPlatform().killProcess(pid)" will return "true" for Unix
   platform always, becasue of implementation of
   co.sun.grid.grm.util.UnixPlatform.killProcess(int pid):

   public boolean killProcess(int pid) {
           try {
               log.log(Level.FINE, "unixplatform.kill", pid);
               int result = exec("kill " + pid, null, null, null, null);
               return result == 0;
           } catch(InterruptedException ex) {
               return false;
           } catch(IOException ex) {
               return false;

   As shown, method performs "kill $PID" on unix architectures, which means
   "SIGTERM" is delivered to process identified by $PID. The problem is, that
   return code 0 means only that SIGTERM was successfully delivered to process, not
   that process was terminated.

   How to test
   First, providing a unit tests for improved implementation (for methods sending
   SIGTERM and SIGKILL etc.) in a form of junit tests. Then, providing a TS test
   for "sdmadm sdj" command and ideally, involving shutdown of a JVM process that
   ignores "SIGTERM".

   ATC: 1 PD (because of <BEEP/> problems with issuezilla)
   ETC: 5 PD
               ------- Additional comments from rhierlmeier Mon Nov 10 23:54:16 -0700 2008 -------
   The proposed solution for the problem is highly dangerous. I am totally against
   fixing it in this way.

   We should find out why the the jvms do not stop, it can only be that a thread
   inside of the jvm did not die.

   If any of the developer runs into this problem again please use jstack to get
   the stacktrace of a all thread of the "unresponsive" jvm.

   After finishing this discussion I would further raise the priority of this issue
   because the described scenario can be an indicator for a deadlock.

               ------- Additional comments from adoerr Tue Nov 11 03:13:43 -0700 2008 -------
   I agree with Richard. We need to find out what is the root cause for this
   problem. Using SIGKILL in addition to SIGTERM to kill a JVM is an option.
   However, it should only be used as a last resort. Killing a JVM with SIGKILL can
   easily introduce all kinds of inconsistencies.

               ------- Additional comments from afisch Thu Dec 4 10:20:41 -0700 2008 -------
   This issue most likely is related to task 607
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.