Opened 50 years ago

Last modified 9 years ago

#886 new defect

IZ557: Uninstall problems for managed hosts

Reported by: afisch Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0
Severity: Keywords: cli
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=557]

        Issue #:      557          Platform:     All         Reporter: afisch (afisch)
       Component:     hedeby          OS:        All
     Subcomponent:    cli          Version:      1.0            CC:    None defined
        Status:       NEW          Priority:     P3
      Resolution:                 Issue type:    DEFECT
                               Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
          URL:
       * Summary:     Uninstall problems for managed hosts
   Status whiteboard:
      Attachments:


     Issue 557 blocks:
   Votes for issue 557:     Vote for this issue


   Opened: Fri Aug 8 02:07:00 -0700 2008 
------------------------


   Description:

   A set of problems occur if the a managed host instance is deinstalled under
   different circumstances. The problems listed below occurred in a user mode sdm
   installation but they will also happen with a system installation!

   1.) Double deinstallation of a managed host leads to a strange behavior:

   % sdmadm -s sdmaf -p user uh
   Uninstallation of managed host: XXX finished successfully.

   % sdmadm -s sdmaf -p user uh
   The following certifcate is not trusted by your certificate authority
       Issued To: EMAILADDRESS=a, UID=CA, CN=SGE Certificate Authority, OU=a, O=a,
   L=a, ST=a, C=AA
   Serial Number: af47ef71979eb404
       Issued By: EMAILADDRESS=fischer@sun.com, UID=CA, CN=SGE Certificate
   Authority, OU=a, O=a, L=a, ST=a, C=AA
        Validity: From Tue Aug 05 18:21:22 MEST 2008
                    To Mon Aug 05 18:21:22 MEST 2013
    Fingerprints: SHA1withRSA

   06:17:8f:f2:6d:32:4b:ea:f4:85:00:c2:dd:59:2f:11:0a:c3:14:cb:2f:e5:97:0d:fc:a9:e5:11:79:83:cf:bb:5a:6c

   37:dc:20:3c:f0:9f:d9:c2:87:d7:b2:f0:2a:fe:03:41:46:58:dc:3b:6b:4c:cd:06:59:d7:b9:7b:33:42:7c:ef:e0:db

   4f:a0:55:9a:0d:d8:67:9f:6d:64:68:9e:47:e6:14:e5:b5:6f:9e:89:97:a5:a0:a5:81:b5:1a:8e:af:f1:81:cb:e7:91

   1e:c4:93:74:f9:69:66:48:20:31:6f:e2:a0:36:aa:a4:f7:d0:12:f5:68:83:27:e1:a3:b3
   Trust this certificate?  [(Y)es/(N)o] (Y) > y
   Error: permission denied

   additionally the "trust this certificate" question is asked again if it is
   anwered with no the first time and makes the system ask the question again and
   again.

   2.) The same behavior occurs if the master host is reinstalled and started
   before the managed host is deinstalled.

   3.) The -force option does also not help to deinstall the managed host in case
   2. Again the "trust this certificate" question is asked. Only if the master host
   is shut down the -force option leads to a successful deinstall.




   Evaluation:

   The issue is considered a p3 defect because the "trust this certificate" dialog
   is not working properly. However, the problem can be avoided by a correct usage
   of the command and a work around is available if the system is stuck in problem
   1.) - 3.).

   Suggested Fix / Work Around:

   1.) The uninstall command should prompt that there is no system found if it is
   applied the second time. The rest of the strange behavior is covered by 2.).

   2.)
   Here it would be better to change the order of the error messages:
   It should be checked first if the user is not accepted by the CS. If the user is
   not accepted the following error message should be displayed (instead of "Error:
   permission denied" ):

           The automatic user authentication failed. Keystore <location> is
   <missing/invalid>.
       Please use the global -ppw option to connect to the server without a valid
   keystore.

   and the command should exit. If the user is accepted the system should ask the
   "trust this certificate" question. It would be helpful if the message is
   changed to:

           The pem file <pem.file> for automatic server certification is <invalid /
   missing>.
           The server can be connected anyway if the server certifcate below is
   trusted:

             Issued To: EMAILADDRESS=fischer@sun.com, UID=CA, CN=SGE Certificate
   Authority, OU=a, O=a, L=a, ST=a, C=AA
             Serial Number: af47ef71979eb404
               Issued By: EMAILADDRESS=fischer@sun.com, UID=CA, CN=SGE Certificate
   Authority, OU=a, O=a, L=a, ST=a, C=AA
               Validity: From Tue Aug 05 18:21:22 MEST 2008
               To Mon Aug 05 18:21:22 MEST 2013
               Fingerprints: SHA1withRSA


   06:17:8f:f2:6d:32:4b:ea:f4:85:00:c2:dd:59:2f:11:0a:c3:14:cb:2f:e5:97:0d:fc:a9:e5:11:79:83:cf:bb:5a:6c

   37:dc:20:3c:f0:9f:d9:c2:87:d7:b2:f0:2a:fe:03:41:46:58:dc:3b:6b:4c:cd:06:59:d7:b9:7b:33:42:7c:ef:e0:db

   4f:a0:55:9a:0d:d8:67:9f:6d:64:68:9e:47:e6:14:e5:b5:6f:9e:89:97:a5:a0:a5:81:b5:1a:8e:af:f1:81:cb:e7:91

   1e:c4:93:74:f9:69:66:48:20:31:6f:e2:a0:36:aa:a4:f7:d0:12:f5:68:83:27:e1:a3:b3

           Trust this certificate?  [(Y)es/(N)o] (Y) >



   The command should exit if answered with "no" or continue the uninstall if the
   question is answered with "yes".


   3.) The system should not try to contact the cs service if the -force option is
   used and do the uninstall anyway. Or it should do the unistall even if the
   connect was not possible.

   No work around is needed for 1.) as the second uninstall does not harm the
   system. For case 2.) and 3.) the uninstall should be done with the -ppw option.
   If the uninstall is still not possible the master host should be shutted down
   and then the uninstall_host command should be restarted with the -force option.




   Analysis:
   1.) The problem is rooted in UninstallHostCommand:execute:86

          if (getPreferencesType().equals(PreferencesType.SYSTEM) &&
   !Platform.getPlatform().isSuperUser()) {
              throw new
   GrmException("uninstallhost.error.systempreferences.notsuperuser", BUNDLE);
          }

   another if-condition for the user mode should be appended to check if the system
   is there. If not a proper GrmException should be thrown.


   2.) The strange behavior result from two different problems: The repeated
   questions are asked because the system tries to establish the CS connections
   multiple times but it does not remember untrusted certificates. Additionally the
   corresponding code is not Thread save thus the question for trust is asked in
   parallel.

   Richard provided an alternative implementation of the method that causes both
   problems:

          com.sun.grid.grm.security.AskingX509TrustManager


               private final List<X509Certificate> untrustedCerts = new
   LinkedList<X509Certificate>();
               private Thread askingThread = null;

               private boolean trustCeritificate(final X509Certificate[] chain) {

                   X509Certificate cert = chain[chain.length -1];
                   boolean ret = true;

                   synchronized (untrustedCerts) {
                       if(Thread.currentThread().equals(askingThread)) {
                           return false;
                       }
                       if (untrustedCerts.contains(cert)) {
                           return false;
                       }
                       while (askingThread != null) {
                           try {
                               untrustedCerts.wait();
                           } catch (InterruptedException ex) {
                               return false;
                           }
                       }
                       askingThread = Thread.currentThread();
                   }

                   try {
                       StringWriter sw = new StringWriter();
                       Printer pw = new Printer(new PrintWriter(sw));

   pw.println(SecurityConstants.getString("AskingX509TrustManager.certNotTrusted"));

                       printCertificate(cert, pw);


   pw.print(SecurityConstants.getString("AskingX509TrustManager.trustCert"));
                       pw.print(' ');
                       pw.flush();

                       ConfirmationCallback [] callbacks = new ConfirmationCallback
   [] {
                             new ConfirmationCallback(sw.getBuffer().toString(),
                                                      ConfirmationCallback.INFORMATION,

   ConfirmationCallback.YES_NO_OPTION,
                                                      ConfirmationCallback.YES)
                       };


                       try {
                           callbackHandler.handle(callbacks);
                           if (callbacks[0].getSelectedIndex() !=
   ConfirmationCallback.YES) {
                               ret = false;
                           }
                       } catch (IOException ex) {
                           ret  = false;
                       } catch (UnsupportedCallbackException ex) {
                           ret = false;
                       }

                       synchronized (untrustedCerts) {
                           if (ret) {
                               addCertToCache(cert);
                           } else {
                               untrustedCerts.add(cert);
                           }
                       }
                   } finally {
                       synchronized (untrustedCerts) {
                           askingThread = null;
                           untrustedCerts.notify();
                       }
                   }
                   return ret;
               }


   Additionally the error messages have to be changed as suggested in section fix
   /work around: The "permission denied" related error message can be covered by a
   CheckAccessCommand that might already be covered by issue 541. Please be aware
   that "permission is also prompted at other locations of the code when the user
   is not an sdm admin user! The message that explains the reason for the trust
   procedure should be added in Security.properties:

           AskingX509TrustManager.certNotTrusted=The following certifcate is not
   trusted by your certificate authority
           AskingX509TrustManager.trustCert=Trust this certificate?

   The corresponding class is com.sun.grid.grm.security.AskingX509TrustManager.


   3.) The problem lies in method In UninstallHostCommand:execute:105. If the CS
   can be contacted the uninstall command tries to shut down the components of the
   managed host before deinstallation, also the -force option is used. Two
   reasonable options are possible here:

       a.) The command should try to contact the CS and shut down the components.
           But if the CS handshake fails it should continue uninstalling.

           Cover UninstallHostCommand:execute:107-115 with a try catch block.
           In the case of -force option just report any exception caught and continue
           otherwise throw the caught exception again!

       b.) The command should not try to contact the CS and just do the
   uninstalling anyway.

           change UninstallHostCommand:execute:105:  if (csRunning) {
           to:   if (csRunning && !isForced()) {

   In the case that the -force option is not used. The system should continue to
   try to shut down the components before uninstallaton.




   How to test:

   1.) This problem can be tested with a JUnit test that executes the
   UninstallHostCommand with a dummy environment. A TS test should check the
   behavior on the command line level.

   2.) This part can only be tested blackbox as it affects details in the ssl
   implementation that are out of scope. However a TS test should test the
   following cases:

       deinstallation is possible if:
           a.) man.host off line && can contact the cs && authentication files ok.
           b.) man.host off line && can contact the cs && no authentication files
   && -ppw option && trust=>yes.

       deinstallation should fail if ANY of conditions cases apply:
           a.) man.host online.
           b.) can not contact the cs.
           c.) no authentication files && no -ppw option.
           d.) no authentication files &&  -ppw option && trust =>no.

   This cases should be tested for both user and system installation!


   3.) Here a TS test should repeat to test all cases listed in 2.) with the
   additional -force option. Now all uninstalls should finish successful regardless
   of the other options.




   ETC:
   6 PD (2 PD ATC)
               ------- Additional comments from adoerr Wed Aug 20 07:43:45 -0700 2008 -------
   New target milestone.
               ------- Additional comments from torsten Thu Oct 2 02:21:03 -0700 2008 -------
   reassign
               ------- Additional comments from torsten Thu Oct 2 02:21:25 -0700 2008 -------
   started
               ------- Additional comments from adoerr Wed Oct 22 06:49:43 -0700 2008 -------
   New target milestone is 1.0u3
               ------- Additional comments from torsten Tue Feb 17 07:39:39 -0700 2009 -------
   changing status back to new by reassigning
               ------- Additional comments from torsten Fri Nov 27 00:40:10 -0700 2009 -------
   changed milestone to 1.0u5next

Change History (0)

Note: See TracTickets for help on using tickets.