Opened 47 years ago

Last modified 6 years ago

#909 new task

IZ607: Thread may not terminate if its interrupt flag is cleared

Reported by: afisch Owned by:
Priority: high Milestone:
Component: hedeby Version: 1.0
Severity: Keywords: Sun misc
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=607]

        Issue #:      607          Platform:     Sun         Reporter: afisch (afisch)
       Component:     hedeby          OS:        All
     Subcomponent:    misc         Version:      1.0            CC:    None defined
        Status:       NEW          Priority:     P2
      Resolution:                 Issue type:    TASK
                               Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
          URL:
       * Summary:     Thread may not terminate if its interrupt flag is cleared
   Status whiteboard:
      Attachments:


     Issue 607 blocks:
   Votes for issue 607:     Vote for this issue


   Opened: Thu Dec 4 09:30:00 -0700 2008 
------------------------


   Description:
   The recommended way to stop a running thread is to simply interrupt it. This
   means to call the treads interrupt() method. If we work with executors the
   future.cancel mechanism exactly depends on the thread interruption mechanism.

   JavaDoc explains the consequences of the interruption call as follows:

   If this thread is blocked in an invocation of the wait(), wait(long), or
   wait(long, int) methods of the Object class, or of the join(), join(long),
   join(long, int), sleep(long), or sleep(long, int), methods of this class, then
   its interrupt status will be cleared and it will receive an InterruptedException.

   If this thread is blocked in an I/O operation upon an interruptible channel then
   the channel will be closed, the thread's interrupt status will be set, and the
   thread will receive a ClosedByInterruptException.

   If this thread is blocked in a Selector then the thread's interrupt status will
   be set and it will return immediately from the selection operation, possibly
   with a non-zero value, just as if the selector's wakeup method were invoked.

   If none of the previous conditions hold then this thread's interrupt status will
   be set.

   This implies that the termination request for the thread can be ether detected
   by checking its interrupted flag XOR the thread will get an
   InterruptedException. If this InterruptedException is catched and not handled in
   a proper way the interruption information is lost and the tread might not know
   anymore that it should terminate. The problem that the interrupted information
   gets lost might occur within our own code where we can fix it or in external
   code where a workaround might be complicated. This can lead to serious problems
   if we for example need to shut down a component that internally manages a set of
   worker threads.



   Evaluation:
   The issue is rated as a p2 task. We should try to fix interruption unaware code
   soon, as it is potentially dangerous. Additionally discussing/fixing this issue
   will rise the awareness of how to avoid problems with thread interruption.



   Suggested Fix / Work Around:
   To fix this problem we will have to adress three aspekts of the problem.

   1) To minimize the risk that an interupted thread remains waiting in alien code
   or that it returns from alien code with a cleared interrupted flag, we have to
   implement any thread termination in a specific way. The thread that calls the
   interrupt for the other threads should check if the threads really terminate and
   if not, it should retry to interrupt any remaining thread. This should be
   adressed with a reimplemented Executor service. Refreshing the interrupted
   information would allread significantly decrease the chance that a thread
   remains stuck somewhere in the code.

   2) Check interuptable code whether it is interuption aware. This means that we
   have to ensure that nowhere in our code the interruption information will get
   lost if it is transformed into a Interrupted exception.

   3) The logging routines we use are alie ncode that not interruption aware.
   Logging clears the interrupted flag and does not throw an InterruptedException.
   Additionally the logged information gets lost (see issue#538). But as logging is
   used frequently in our code we should try to make the logging interrupt aware,
   too. This can be done with a solution proposed for issue 538:
      a) If an already interrupted thread entering the logging routines. The
   interrupted flag is temporarily removed from the thread. Thus the logging can be
   done in a safe way. When leaving the logging routines the interrupted flag of
   the thread is restored.
      b) The first change will not help if the interruption arises while the thread
   is already within the logging routines. To avoid this situation a
   synchronization mechanism is needed. Logging and thread interruption should be
   mutual exclusive operations.

   If 1) is done a lot of problems will be solved as the interrupted information
   can not get lost any more. However a thread could still be stuck in code that
   clears the interrupted information and loops. We can further minimize this risk
   by fixing 2). Fixing 3) will not further decrease the risk but only ensures that
   logging is possible while shutting down and it might speed up the shut down as
   the interrupted information will not get lost. However after fixing this issue
   the problem remains that a thread may get stuck in loops within alien code.
   After the fix, the risks of a failing thread termination is only significantly
   reduced!

   Analysis:

   1) "Thread interruption with retry" feature:
   We have to look for places that might involve thread interruption() calls.
   Currently there are 18 classes:
   Torsten provided the tool ack, that makes it especially comfortable to search
   for critical locations in the code.

   ack --java '.interrupt\(|.cancel\(|.shutdown\(' --ignore-dir=test -l
   security/src/com/sun/grid/grm/security/ca/impl/GrmCAServiceImpl.java
   common/src/com/sun/grid/grm/ui/component/StopJVMCommand.java
   common/src/com/sun/grid/grm/executor/impl/ExecutorImpl.java
   common/src/com/sun/grid/grm/util/prefs/FilePreferences.java
   common/src/com/sun/grid/grm/util/NonblockingCommand.java
   common/src/com/sun/grid/grm/util/EventListenerSupport.java
   common/src/com/sun/grid/grm/sparepool/SparePoolServiceImpl.java
   common/src/com/sun/grid/grm/bootstrap/JVM.java
   common/src/com/sun/grid/grm/bootstrap/JVMImpl.java
   common/src/com/sun/grid/grm/service/impl/ServiceCachingProxy.java
   common/src/com/sun/grid/grm/service/slo/impl/DefaultSLOManager.java
   common/src/com/sun/grid/grm/service/slo/impl/RunnableSLOManager.java
   common/src/com/sun/grid/grm/service/slo/SLOManager.java
   common/src/com/sun/grid/grm/resource/impl/RequestQueueImpl.java
   common/src/com/sun/grid/grm/resource/impl/ResourceManagerImpl.java
   common/src/com/sun/grid/grm/resource/impl/OsProvisionerLiason.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/GEServiceImpl.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/HostManager.java


   Examples are:
   %ack --java '.interrupt\(|.cancel\(|.shutdown\(' -A 3 --ignore-dir=test
   ...
   common/src/com/sun/grid/grm/service/impl/ServiceCachingProxy.java
   1522:                    scheduledRefresh.cancel(false);
   1523-                }
   1524-                scheduledRefresh = actionExecutor.schedule(new
   RefreshAction(), timeout, TimeUnit.MILLISECONDS);
   1525-            }
   --
   3322:                        entry.getValue().cancel(false);
   3323-                        iterator.remove();
   3324-                    }
   3325-                }

   common/src/com/sun/grid/grm/service/slo/impl/DefaultSLOManager.java
   180:    public boolean interrupt() {
   181-        log.entering(DefaultSLOManager.class.getName(), "interrupt");
   182-        boolean ret = false;
   183-        SLOContext tmpCtx = null;
   --
   214:        interrupt();
   215-        waitForSLORunFinished();
   216-    }
   217-
   --
   249:                    interrupt();
   250-                    lock.lock();
   251-                    try {
   252-                        waitForSLORunFinished();

   ...

   We have to keep in mind that a lot of these problematic locations are gone with
   the redesigned Component framework! For the remaining places we have to consider
   to implement a loop that retries the thread interruption after a defined waiting
   timeout. This should be done with an enhanced executor service. All new threads
    should be started by this executor service and a reference on any started
   thread should be kept. The shutdown of these threads should be exclusively done
   by a enhanced shutdown method. Threads that remain running should be interrupted
   again until all threads are terminated.

   2) Fix code that is not interruption aware:
   Example code constructs that we have in our code:
   A)
     try {
         //code that may throw an InterruptedException
     } catch (Exception e) {
         // code that does not lead to exit of the thread
     }


   B)
     try {
         //code that may throw an InterruptedException
     } catch (InterruptedException e) {
         // code that does not lead to exit of the thread
     }

   C)
     while (!Thread.currentThread().isInterrupted()) {
       try {
         //code that may throw an InterruptedException
       } catch (InterruptedException e) {
         /* If we're interrupted, just reevaluate the conditional. */
         continue;
       }
     }

   All examples share the problem that they do not consider that the interrupted
   information is lost once the InterruptedException is catched. Thus we have to
   look for all
   location where we catch Exception, InterruptedException,
   ClosedByInterruptException, InterruptedIOException or InterruptedNamingException.


   The ack search returns 104 locations:

   %ack --java 'catch\s*\(\s*(Exception|Interrupted|ClosedByInterrupt)'
   --ignore-dir=test -l

   build/tmp/src/man_jsp.java
   security/src/com/sun/grid/grm/security/ui/CreateDaemonCommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/CreateJAASConfigCommand.java
   security/src/com/sun/grid/grm/security/ui/CreateSecurityDirectoriesCommand.java
   security/src/com/sun/grid/grm/security/ui/RenewCACommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/GetUserKeyStoreCommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/RenewDaemonCommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/StoreKeyStoreCommand.java
   security/src/com/sun/grid/grm/security/ui/GetDaemonKeyStoreCommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/CreateUserCommandDelegate.java
   security/src/com/sun/grid/grm/security/ui/RenewUserCommandDelegate.java
   security/src/com/sun/grid/grm/security/SSLHelper.java
   security/src/com/sun/grid/grm/security/AskingX509TrustManager.java
   security/src/com/sun/grid/grm/security/ca/KeyStoreWrapper.java
   security/src/com/sun/grid/grm/security/DefaultClientSecurityContext.java
   security/src/com/sun/grid/grm/security/login/DelegatingLoginModule.java
   security/src/com/sun/grid/grm/security/DefaultServerSecurityContext.java
   security/src/com/sun/grid/grm/security/cli/RenewCertificateCliCommand.java
   security/src/com/sun/grid/grm/security/cli/ShowCertificateCliCommand.java
   util/ant/src/com/sun/grid/grm/util/template/TemplateFactory.java
   common/gensrc/com/sun/grid/grm/util/filter/FilterParser.java
   common/src/com/sun/grid/grm/cli/AbstractCli.java
   common/src/com/sun/grid/grm/cli/cmd/administration/InstallMasterCliCommand.java
   common/src/com/sun/grid/grm/cli/cmd/administration/InstallManagedHostCliCommand.java
   common/src/com/sun/grid/grm/cli/cmd/monitoring/ShowComponentStatusCliCommand.java
   common/src/com/sun/grid/grm/cli/cmd/monitoring/ShowServicesCliCommand.java
   common/src/com/sun/grid/grm/cli/cmd/resources/AbstractResourceCliCommand.java
   common/src/com/sun/grid/grm/management/ConnectionProxy.java
   common/src/com/sun/grid/grm/management/ComponentWrapper.java
   common/src/com/sun/grid/grm/management/SecureMBeanServer.java
   common/src/com/sun/grid/grm/management/ComponentProxy.java
   common/src/com/sun/grid/grm/ui/component/StopJVMCommand.java
   common/src/com/sun/grid/grm/ui/component/StartJVMCommand.java
   common/src/com/sun/grid/grm/ui/component/BindCommand.java
   common/src/com/sun/grid/grm/ui/component/AbstractStopComponentCommand.java
   common/src/com/sun/grid/grm/ui/component/service/AddActiveComponentCommand.java
   common/src/com/sun/grid/grm/ui/component/StartComponentCommand.java
   common/src/com/sun/grid/grm/ui/component/RebindCommand.java
   common/src/com/sun/grid/grm/ui/component/AbstractComponentCommand.java
   common/src/com/sun/grid/grm/ui/install/UninstallHostFromPreferencesCommand.java
   common/src/com/sun/grid/grm/ui/install/CheckSTRegistrationCommand.java
   common/src/com/sun/grid/grm/ui/install/CheckSTSupportCommand.java
   common/src/com/sun/grid/grm/ui/install/RemoveSTEntryCommand.java
   common/src/com/sun/grid/grm/ui/install/UninstallSMFCommand.java
   common/src/com/sun/grid/grm/ui/install/CheckMasterHostInstallationCommand.java
   common/src/com/sun/grid/grm/ui/install/RemoveDirCommand.java
   common/src/com/sun/grid/grm/ui/install/AbstractCreateDirectoriesCommand.java
   common/src/com/sun/grid/grm/ui/install/CheckSMFSupportCommand.java
   common/src/com/sun/grid/grm/ui/install/AddSTEntryCommand.java
   common/src/com/sun/grid/grm/ui/install/InstallRcScriptCommand.java
   common/src/com/sun/grid/grm/ui/install/InstallSMFCommand.java
   common/src/com/sun/grid/grm/ui/install/CreateMasterHostDirsCommand.java
   common/src/com/sun/grid/grm/ui/install/UninstallRcScriptCommand.java
   common/src/com/sun/grid/grm/ui/install/CreateManagedHostDirsCommand.java
   common/src/com/sun/grid/grm/ui/resource/AbstractSingleResourceActionCommand.java
   common/src/com/sun/grid/grm/ui/resource/ShowResourceStateCommand.java
   common/src/com/sun/grid/grm/ui/resource/AbstractMultipleResourceActionCommand.java
   common/src/com/sun/grid/grm/ui/resource/ShowBlackListCommand.java
   common/src/com/sun/grid/grm/ui/impl/ConfigurationServiceImpl.java
   common/src/com/sun/grid/grm/executor/impl/ShellCommand.java
   common/src/com/sun/grid/grm/executor/impl/ExecutorImpl.java
   common/src/com/sun/grid/grm/security/Password.java
   common/src/com/sun/grid/grm/util/MainWrapper.java
   common/src/com/sun/grid/grm/util/Platform.java
   common/src/com/sun/grid/grm/util/UnixPlatform.java
   common/src/com/sun/grid/grm/util/Hostname.java
   common/src/com/sun/grid/grm/util/prefs/FilePreferences.java
   common/src/com/sun/grid/grm/util/filter/FilterHelper.java
   common/src/com/sun/grid/grm/util/EventListenerSupport.java
   common/src/com/sun/grid/grm/reporting/impl/ReporterServiceManager.java
   common/src/com/sun/grid/grm/os/impl/OsProvisionerImpl.java
   common/src/com/sun/grid/grm/os/impl/OsDispatcherImpl.java
   common/src/com/sun/grid/grm/GrmComponentDelegate.java
   common/src/com/sun/grid/grm/sparepool/SparePoolServiceImpl.java
   common/src/com/sun/grid/grm/bootstrap/SystemUtil.java
   common/src/com/sun/grid/grm/bootstrap/Modules.java
   common/src/com/sun/grid/grm/bootstrap/PreferencesUtil.java
   common/src/com/sun/grid/grm/bootstrap/JVMImpl.java
   common/src/com/sun/grid/grm/bootstrap/ParentStartupService.java
   common/src/com/sun/grid/grm/service/impl/SimpleServiceStore.java
   common/src/com/sun/grid/grm/service/impl/ServiceCachingProxy.java
   common/src/com/sun/grid/grm/service/impl/AbstractServiceManager.java
   common/src/com/sun/grid/grm/service/impl/ServiceCachingProxyManager.java
   common/src/com/sun/grid/grm/service/slo/impl/DefaultSLOManager.java
   common/src/com/sun/grid/grm/service/slo/impl/RunnableSLOManager.java
   common/src/com/sun/grid/grm/service/slo/SLOFactory.java
   common/src/com/sun/grid/grm/resource/ResourceManagerGuard.java
   common/src/com/sun/grid/grm/resource/impl/RequestQueueImpl.java
   common/src/com/sun/grid/grm/resource/impl/ResourceManagerImpl.java
   common/src/com/sun/grid/grm/resource/impl/SimpleResourceStore.java
   common/src/com/sun/grid/grm/resource/impl/FileResourceStore.java
   common/src/com/sun/grid/grm/resource/impl/OsProvisionerLiason.java
   common/src/com/sun/grid/grm/resource/impl/RequestImpl.java
   common/src/com/sun/grid/grm/config/XMLUtil.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/ExecdInstaller.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/GEServiceImpl.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/HostFileStore.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/GEConnection.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/GEServiceConfigHelper.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/InstallationSequence.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/ExecdUninstaller.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/HostManager.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/HostImpl.java
   ge-adapter/src/com/sun/grid/grm/service/impl/ge/HostSpooling.java


   There are 234 positions in the code that have to be considered more closely:
   (-A 3 means show the next 3 lines after a hit, too.)
   %ack --java 'catch\s*\(\s*(Exception|Interrupted|ClosedByInterrupt)' -A 3
   --ignore-dir=test

   ...
   common/src/com/sun/grid/grm/cli/cmd/resources/AbstractResourceCliCommand.java
   132:                } catch (InterruptedException ex) {
   133-                // Ignore
   134-                }
   135-
   --
   543:            } catch (InterruptedException ex) {
   544-                return
   getBundle().getString("AbstractResourceCliCommand.interrupted");
   545-            }
   546-        }
   --
   566:            } catch (InterruptedException ex) {
   567-            // Ignore
   568-            }
   569-        }

   common/src/com/sun/grid/grm/management/ConnectionProxy.java
   214:        } catch(Exception ex) {
   215-            Throwable t = getRealError(ex);
   216-            if (t instanceof IOException || t instanceof GrmRemoteException) {
   217-                // Reconnect
   --
   224:                } catch(Exception ex1) {
   225-                    throw getRealError(ex1);
   226-                }
   227-            } else {
   --
   382:                    } catch(Exception ex) {
   383-                        if(log.isLoggable(Level.FINE)) {
   384-                            log.log(Level.FINE, "ConnectionProxy.dnff", ex);
   385-                        }
   --
   510:                            } catch(Exception ex) {
   511-                                LogRecord lr = new LogRecord(Level.WARNING,
   "ConnectionProxy.ex.reReg");
   512-                                lr.setParameters(new Object [] {name,
   ex.getLocalizedMessage()});
   513-                                lr.setThrown(ex);
   --
   609:                    } catch(Exception ex) {
   610-                        // Ignore
   611-                    }
   612-                    try {


   All these positions in code have to be considered carefully.



   3) Interruption aware logging:
   Examples for external code is the logging framework we use (eg. log.log(...)).
   If thread is interrupted before or while logging something with the logging
   framework, the interrupted information is lost.The problem is rooted in the used
   StreamHandler that will report any interruption error but does not reestablish
   the interrupted information:

   java.util.logging.StreamHandler

       public synchronized void publish(LogRecord record) {
           if (!isLoggable(record)) {
               return;
           }
           String msg;
           try {
               msg = getFormatter().format(record);
           } catch (Exception ex) {
               // We don't want to throw an exception here, but we
               // report the exception to any registered ErrorManager.
               reportError(null, ex, ErrorManager.FORMAT_FAILURE);
               return;
           }

           try {
               if (!doneHeader) {
                   writer.write(getFormatter().getHead(this));
                   doneHeader = true;
               }
               writer.write(msg);
           } catch (Exception ex) {
               // We don't want to throw an exception here, but we
               // report the exception to any registered ErrorManager.
               reportError(null, ex, ErrorManager.WRITE_FAILURE);
           }
       }

   The logging problem can be solved as described for issue538.



   Test:
   ?



   ETC:
   ?

   We should discuss how this issue should be tested and what would be a reasonable
   ETC.
               ------- Additional comments from zwierzak Thu Dec 4 09:40:35 -0700 2008 -------
   Fixed summary so there is no confusion with executor as hedeby component in
   first place :)
               ------- Additional comments from afisch Thu Dec 4 09:50:16 -0700 2008 -------
   Thanks Rys, I changed it once more as it was still somehow odd.
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.