Opened 50 years ago

Last modified 9 years ago

#906 new defect

IZ603: Corrupted user mode install because of path expansion for automounted directories

Reported by: afisch Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u1
Severity: Keywords: Sun cli
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=603]

        Issue #:      603          Platform:     Sun         Reporter: afisch (afisch)
       Component:     hedeby          OS:        All
     Subcomponent:    cli          Version:      1.0u1          CC:    None defined
        Status:       NEW          Priority:     P3
      Resolution:                 Issue type:    DEFECT
                               Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
          URL:
       * Summary:     Corrupted user mode install because of path expansion for automounted directories
   Status whiteboard:
      Attachments:


     Issue 603 blocks:
   Votes for issue 603:     Vote for this issue


   Opened: Tue Nov 11 05:53:00 -0700 2008 
------------------------


   Description:
   The problem can only occur, if the file system of the used machine is expanding
   the path of auto mounted directories. The following problem can be observed on
   such machines:
   If the home directory of user foo is auto mounted as /home/foo, the expanded
   path may look like /private/var/automount/home/foo. Although the home directory
   is accessible with /home/foo, a pwd command will report
   /private/var/automount/home/foo. If this behavior is not consistently present on
   all machines used, it can lead to the following error scenario:

   SDM master host is installed in user mode on a machine with the expanded path
   problem (hostA):

    hostA% cd /home/foo/sdm_root/bin
    hostA% pwd
    /private/ var/automount/home/foo/sdm_root/bin
    hostA% sdmadm -suserModeSdm -p user install_master_host -ca_admin_mail "a"
   -ca_state "a" -ca_country "aa" -ca_location "a" -ca_org_unit "a" -ca_org "a" -au
   some_user -cs_port 31226 -l /tmp/sdmaf2 -sge_root /sge_dir

   Install and startup should work without problems. Now a managed host is
   installed on a different machine (hostB) that does not suffer from the problem:

    hostB% cd /home/foo/sdm_root/bin
    hostB% pwd
    /home/foo/sdm_root/bin
    hostB% /sdmadm -suserModeSdm -p user -keystore keystore.file -cacert cacert.pem
   install_managed_host -au some_user -l /tmp/sdmaf2 -cs_url hostA:31226

    class com.sun.grid.grm.cli.SdmAdm not found in classpath
    Using file:/home/foo/sdm_root/lib/sdm-common.jar
    A configuration for system "sdmaf2" has been added.
    Error: error setting up ssl: No security module found in URLClassLoader{
       file:/home/foo/sdm_root/lib/sdm-common.jar
    }
    AppClassLoader{
       file:/home/foo/sdm_root/lib/sdm-starter.jar
    }
    ExtClassLoader{
       file:/ ... /jre/lib/ext/dnsns.jar
       file:/ ... /jre/lib/ext/sunpkcs11.jar
       file:/ ... /jre/lib/ext/sunjce_provider.jar
       file:/ ... /jre/lib/ext/localedata.jar
    }
    ]]]

   This path expansion problem was observed on a Mac Os 10.4 machine. Interestingly
   the expansion phenomenon could be observed on the command line if a tc shell was
   used, but not with a bash shell.

   Evaluation:
   The issue is rated as a p3 defect as it is a rare case and a work around exists.
   However the user can not conclude from the error message where the problem is
   rooted.


   Suggested Fix / Work Around:
   Suggested Fix: The SDM system should not have problems with the path expansion.
   If this is not solvable, at least the error message should clearly state what
   the problem is.
   Work Around:
   If the user is aware of the problem in the moment he is installing the master
   host, he can use the -dist option to provide the dist dir for the
   install_master_host command explicitly. If the user allready installed the
   master host without the -dist option, the bootstrap information can be fixed
   manually: The file /home/[user]/.sdm/bootstrap/[sdm_system]/prefs.properties has
   to be edited. The path for the property dist has to be replaced with the
   unexpanded variant:

    dist=[dist_path]

   The current value of the dist property can be checked with the -all switch of
   the sbc command.


   Analysis:
   The problem does not happen with a system install, as the system install has
   usually local bootstrap directories. Thus an expanded dist path is only visible
   for the machine where the expanded dist path is valid.

   The problem would disappear after a managed host is installed, because then a
   host specific bootstrap dir is created
   (/home/[user]/.sdm/bootstrap/[sdm_system]/[host]/prefs.properties) that outrules
   the corrupted one in the bootstrap root
   (/home/[user]/.sdm/bootstrap/[sdm_system]/prefs.properties). However as the
   install can not be performed, the host specific bootstrap can not be created.

   Location where SDM automatically determines the dist dir that is saved in the
   bootstrap config:
   If a new SDM instance is added, the auto discovery of the dist path is requested
   by AddSystemCommand objects. The auto discovery of the dist dir happens in
   PathUtil.getDistLibURL(). The Path is extracted from an url object for a
   ResourceBundle file. After the dir is determined, it is saved with:
   PerferencesUtil.setDistDir(env.getDistDir()). The problem can be handled in
   three ways: We find a java functionality that allows us to get the unexpanded
   path variant (No idea so far), we make the -dist option mandatory (API change),
   or we can leave it unfixed and just report a reasonable error. The error message
   should clearly quote where the problem is rooted and ask the user to manually
   adjust the dist dir in the pref.properties file. As the problem is not apparent
   for the Master host install (the expanded dist path is valid here), this has to
   happen later on any managed host machine where the expanded path can not be
   resolved.


   The error message:
   It comes from Modules.getSecurityModule()#98. It states that it can not find the
   security module. This is correct. However no module is loaded at all. Thus the
   command should not get this far. The problem is rooted in a "pre-command stage"
   in MainWrapper. Here the classloader switch takes places. This trick allows for
   any SDM instance to have its own binaries (located in the dist dir). After the
   switch the command is executed with a modified classloader that uses the classes
   from the dist dir.The Mainwrapper exploits the fact that new threads can have
   their own classloaders and can then load classes from other locations as the
   ones defined for the classloader of the main thread. The only condition is that
   the main thread has not loaded these classes so far, as they are cached. The
   MainWrapper works as follows:
     1) The MainWrapper is started with only sdm-starter.jar to avoid class loading
   conflicts. The jar mainly consists of the the mainwrapper code.
     2) MainWrapper.run() is executed by the main() thread that uses the default
   initial classloader. In run()  the java version is checked then a separate
   thread (SystemFinderThread) is started with a second independent classloader
   that only sees the sdm-common.jar of the local dist dir (where the command was
   started from the command line). The reason for the single jar is, that the
   thread simply does not need anything else. This thread determines the system
   that is addressed with the command. In case that no explicit system was
   addressed it determines the local dist dir that was used to start the command
   (see SystemFinder.initFromPrefs()). The main thread waits for this
   SystemFinderThread to end.
     3) Then another thread (SystemRunThread) is started with a third classloader
   that uses the classpath determined by the SystemFinder. This thread executes the
   command that has been provided by command line arguments. If the third
   classloader fails the second class loader (the one that only sees common.jar) is
   used instead.
   The case that the third classloader could not be initialized (point 3) is the
   reason for the error message in the description section. The problem then is
   that the second classloader just knows the common jar. Thus the system will fail
   to execute the command correctly if it needs additional jars (eg. sdm-common.jar)

   To fix error message a set of changes has to be made:

   The problem that the dist dir is invalid should be addressed. This can already
   be detected in the Systemfinder by checking if the directory exists. If not, it
   should print/log a warning: "The dist dir [corrupted dir] is invalid, the local
   dist dir [local dist dir] is used instead." It should then switch to the local
   one, as it does if no system name was provided (SystemFinder.initFromPrefs()).

   If the third classloader can not be initialized correctly (for example if the
   dist dir is empty), the system should *NOT* switch to the second class loader.
   It should exit instead and print an error message that clearly states that the
   system failed to use the classpath provided by the SystemFinder and should print
   the used system name and the location where it found this invalid path (dist
   dir). Currently it only prints "class com.sun.grid.grm.cli.SdmAdm not found in
   classpath" and continues.

   This fix alone does lead to a new problem: The command show_bootstrap_config
   (sbc) will then also fail if it is executed for a corrupted install on the
   managed host. This is an unwanted behavior, as the command is helpful to
   correctly diagnose the problem. To prevent this situation, the SystemFinder
   class has to be changed. In SystemFinder.initFromPrefs() it should be checked
   what command will be executed and the local classpath should be used if it is
   the sbc command that should be executed. The method already parses the command
   line arguments for the global options to determine the system name. Similarly it
   should check if the sbc command is called, by using the default routine to
   determine the command.

   The classes MainWrapper and SystemFinder should be commented in a way that the
   bootstrap process can be perceived more easily. Parts of the explanations for
   this issue could be reused.

   Additionally we should add a hint to the hedeby installation manual to use the
   -dist option if there are problems with the path expansion.

   How to test:
   There should be three TS tests:

   1) Normal mode:
   Install a masterhost in user mode on one host and start it.
   ==> Should work without fix.
   Install managed host for the same system on a different host and start it.
   ==> Should work without fix.
   Execute sbc on managed host with the installed system as system name.
   ==> Should work without fix.

   2) Non existent dist path mode:
   Install a masterhost in user mode on one host and start it.
   ==> Should work without fix.
   Change the dist dir in the file
   /home/[user]/.sdm/bootstrap/[sdm_system]/prefs.properties to an invalid dir.
   Install managed host for the same system on a different host and start it.
   ==> without fix: the error of the issue ==> with fix: "Warning that default dist
   is used instead of invalid one"
   Execute sbc on managed host with the installed system as system name.
   ==> Should work in the same way with and without fix.

   3) Corrupted dist path mode:
   Install a masterhost in user mode on one host and start it.
   ==> Should work even without fix.
   Change the dist dir  in the file
   /home/[user]/.sdm/bootstrap/[sdm_system]/prefs.properties to an existent but
   invalid dir (eg. /tmp/).
   Install managed host for the same system on a different host and start it.
   ==> without fix: Similar error as the one of the issue. With fix: Errror that
   states that the dir is corrupted.
   Execute sbc on managed host with the installed system as system name.
   ==> Should work in the same way with and without fix.


   ETC
   6 PD{
   3PD to fix, comment the code and update the installation guide.
   3PD to write the TS test (it's complicated as the install command for an
   additional system has to be integrated.)
   }
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:09 -0700 2009 -------
   Milestone changed

Change History (0)

Note: See TracTickets for help on using tickets.