[GE users] VPN startup problem when using the SDM cloud adapter

jorisroovers joris.roovers at gmail.com
Wed Apr 14 22:39:44 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ok,

I decided to patch the installation for now using the following lines:

ssh $SSH_PARAMS root@$CLOUD_HOST "chmod 744 /etc/services/"
ssh $SSH_PARAMS root@$CLOUD_HOST "echo \"sge_qmaster    6444/tcp\" >> /etc/services" 2> /dev/null
ssh $SSH_PARAMS root@$CLOUD_HOST "echo \"sge_execd      6445/tcp\" >> /etc/services" 2> /dev/null

This seems to work. However, I now get errors when the cloud host is uninstalling the sge executor daemon.
Apparently there is something more fundamentally wrong. I might try and reinstall the whole system again using the environment variables instead of the /etc/services approach somewhere this week.

If that doesn't help, I might reopen this topic once more...

Thanks again for all your help. It has been greatly appreciated.

Joris





On Wed, Apr 14, 2010 at 09:32, rhierlmeier <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>> wrote:
For efficiency reasons, the system has converted the large body of this message into an attachment.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253348

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].


---------- Forwarded message ----------
From: Richard Hierlmeier <Richard.Hierlmeier at Sun.COM>
To: users <users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>
Date: Wed, 14 Apr 2010 09:32:02 +0200
Subject: Re: [GE users] VPN startup problem when using the SDM cloud adapter


Hi Joris,

On 04/13/10 19:38, jorisroovers wrote:
...
More concretely, would reinstalling SGE using the variables instead of the services approach potentially give any results; or should the 2 methods be able to co-exist?


Normally it should work with both methods. However on opensolaris the /etc/services file does not contain the definitions for gridengine. The AMI is based on opensolaris 2009.06.

The port definitions from the auto installation configuration file are not used for the execd installation. It uses only the settings file.

As work around you can

 o install qmaster using the SGE_QMASTER_PORT and SGE_EXECD_PORT variable
 o or you can patch the util/templates/copy_sge_root_to_cloud.sh script (in
   the SDM distribution)  in a way that it defines the gridengine ports in
   /etc/services on the cloud host.


Richard





Thanks,

Joris


On Tue, Apr 13, 2010 at 16:40, rhierlmeier <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com> <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>> wrote:

   For efficiency reasons, the system has converted the large body of
   this message into an attachment.

   ------------------------------------------------------
   http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253249
   <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253249>

   To unsubscribe from this discussion, e-mail:
   [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
   <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>>].


   ---------- Forwarded message ----------
   From: Richard Hierlmeier <Richard.Hierlmeier at Sun.COM>
   To: users <users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
   <mailto:users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>>
   Date: Tue, 13 Apr 2010 16:39:55 +0200
   Subject: Re: [GE users] VPN startup problem when using the SDM cloud
   adapter

   Hi Joris,

   On 04/13/10 15:45, jorisroovers wrote:

       Ok,

       I got some valuable information out of this (thanks for your
       quick reply).
       The second root_xx directory (=output of the xx script) contains
       the following error

       Cannot contact qmaster. The command failed:

         ./bin/sol-x86/qconf -sh

       The error message was:

         error: could not get environment variable SGE_QMASTER_PORT or
       service "sge_qmaster"
            Setting the SGE_QMASTER_PORT variable does not change
       anything about this, the error stays there. However, if I add
       sge_qmaster to /etc/services and do the same for sge_execd the
       installation works.

       This means that somehow, the automatic installation procedure
       doesn't do this.
       If checked the install_execd_cloud.conf file and it contains the
       correct entries for the ports:

       SGE_QMASTER_PORT="6444"
       SGE_EXECD_PORT="6445"

       How can this happen? Is this the result of some faulty
       configuration, or something else ?


   That's really strange. I don't know how the execd install script
   evaluates the SGE_QMASTER_PORT. Normally I would say that it is
   taken from the auto configuration file. However it is also possible
   that it is taken from $SGE_ROOT/$SGE_CELL/common/settings.sh.

   The cloud-adapter synchronizes the files in
   $SGE_ROOT/$SGE_CELL/common at the cloud host with the files from
   qmaster. Do you have the correct SGE_QMASTER_PORT in
   $SGE_ROOT/$SGE_CELL/common/setting.sh on the cloud host?

   Did you uncomment the "set -x" line in inst_sge? In the debug output
   you can see what value SGE_QMASTER_PORT has.

   Richard




       Thanks,

       Joris

       On Tue, Apr 13, 2010 at 13:28, rhierlmeier
       <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com> <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>>> wrote:

          For efficiency reasons, the system has converted the large
       body of
          this message into an attachment.

          ------------------------------------------------------
                 http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233
       <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233>
                 <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233
       <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233>>

          To unsubscribe from this discussion, e-mail:
          [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
       <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>>
          <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
       <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>>>].

          ---------- Forwarded message ----------
          From: Richard Hierlmeier <Richard.Hierlmeier at Sun.COM>
          To: users <users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
       <mailto:users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>
          <mailto:users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
       <mailto:users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>>>
          Date: Tue, 13 Apr 2010 13:28:31 +0200
          Subject: Re: [GE users] VPN startup problem when using the
       SDM cloud
          adapter

          Hi Joris,

          welcome back.

          you can debug the complete execd installation process of SDM
       if you
          set the keepFiles attribute of the SDM executor on the cloud
       host:

          1. Modify the configuration of the executor:

          % sdmadm mc -c executor
          <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
          <executor:executor ...
            keepFiles="true"/>

          2. Make the configuration active on the cloud host:

          % sdmadm uc -c executor -h <cloud-host>

          3. Reset the resource (it should be in error state)

          % sdmadm rsr -r <cloud-host>

          The resource_resource command will reinstall the execd on the
       cloud
          host.

          4. Wait until the resource goes again into error state.

          5. Login in into the cloud host and look into the directory
            <local_spool_dir>/tmp/executor.

          You can find the local spool directory on the cloud host with

          # sdmadm -s <system_name> sbc -all
          ts31040system SYSTEM arges 31040
             spool=/var/sdm/<system_name>   <-- the local spool directory
              dist=/opt/sdm

          You will find in <local_spool_dir>/tmp/executor for each executed
          command
          a directory. The directory names have the format
          <user_name>_<sequence_nr>.
          The directory with the highest sequence number will contain the
          protocol of the last (un)install command (including stderr and
          stdout output). The username is always root.

          If the the stderr and stdout outout contains still no use full
          information please enable the debugging of the inst_sge script.
          Uncomment the line with

          # set -x

          in $SGE_ROOT/inst_sge on the cloud host.  Repeat the
       installation:

          # cd <local_spool_dir>/tmp/executor/root_10
          # ./install_execd.sh


          Richard


          On 04/13/10 12:15, jorisroovers wrote:

              Hello everyone,

              I have been out of the country for some time, which is the
              reason this reply is coming so late.
              However, in the mean time I have reinstalled SDM (and SGE) to
              make sure that I had a clean install to work with.
              After installing the GE-adapter, the cloud nodes no
       longer shut
              down right after the SDM node installation.
              I thought this would mean the end of my problems (I couldn't
              really think of anything that could go wrong after that),
       but it
              seems that I was wrong.

              Although the cloud node is successfully added to the
       geadapter
              service, it fails to install the SGE execution daemon.
               "Script install_execd_cloud.sh failed with status 1"

              So, I started debugging again. The different error logs
       didn't
              provide usefull information, so I decided to have a
       closer look
              at the installation scripts again.
               From the
              /opt/sdm/util/templates/ge-adapter/install_execd_cloud.sh
       script
              I learned that

              ./inst_sge -x -noremote -auto $CONF_FILE \
                     2> $BASEDIR/install_execd.stderr >
              $BASEDIR/install_execd.stdout &              is called to
       install the execution deamon on the cloud host and
              that $CONF_FILE can be found under
                     /var/spool/sdm/sdmjoris/tmp/executor/root_1/install_execd_cloud.conf
              on the cloud node.
              By quickly copying this configuration-file on the cloud node
              (before the uninstallation procedure triggered by the ERROR
              during install deletes this file), I was able to inspect its
              content.

              File contents of install_execd_cloud.conf on the cloud node
              (comments stripped):

              SGE_ROOT="/opt/sge"
              SGE_QMASTER_PORT="6444"
              SGE_EXECD_PORT="6445"
              SGE_ENABLE_SMF="false"
              SGE_CLUSTER_NAME="sgejoris"
              CELL_NAME="default"
              PAR_EXECD_INST_COUNT="1"
              ADMIN_HOST_LIST=""
              SUBMIT_HOST_LIST="ip-10-245-209-208"
              EXEC_HOST_LIST="ip-10-245-209-208"
              EXECD_SPOOL_DIR_LOCAL=""
              HOSTNAME_RESOLVING="false" DEFAULT_DOMAIN=""
              ADD_TO_RC="false"
              EXEC_HOST_LIST_RM="ip-10-245-209-208"
              REMOVE_RC="false"


              I personally believe that everything is alright here...
       (I have
              already tried setting the HOSTNAME_RESOLVING option to true
              using sdmadm, but that didn't help => I thought the problem
              could be DNS related again...).

              So, because I don't believe the problems lies here, I tried
              something different. I added the cloud host to the spare_pool
              (instead of adding it directly to the geadapter), and then
              performed the sge execution deamon installation manually
       on the
              cloud node.

              When I run
              ./inst_sge -x -noremote -auto $CONF_FILE \
                     2> $BASEDIR/install_execd.stderr >
              $BASEDIR/install_execd.stdout &      with the correct config
              file, nothing happens. No successfull installation, no error
              messages, no command output.

              However, when I perform the installation completely
       manual (that
              is, without the -auto option and by adding the sgeqmaster and
              sgeexecd as service to the cloud node), I am able to add the
              cloud node to the grid engine and run jobs on it...

              I thus think that some kind of configuration option must be
              wrong, but I don't really know where to go from here.

              Can anyone give some better directions? Is there any way
       to get
              better debugging output? Could this be DNS related again,
       or is
              this probably an other problem?

              Thanks again,

              Joris

              PS: Should I post a new message to the mailinglist for this
              since this problem doesn't have anything to do with the
       VPN/DNS
              problems I was originally having ?


              On Mon, Mar 29, 2010 at 07:47, rhierlmeier
              <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>>
              <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>
              <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>>>> wrote:

                 Hi Joris,

                 On 03/26/10 13:22, jorisroovers wrote:
                  > Hi Torsten,
                  >
                  > I changed the output of the hostname command
                 neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                  > <http://neo-wn01.cmi.ua.ac.be> and restarted the sdm
              master jvm. This
                  > solved the problem! The cloud host is now
       succesfully started.
                  >
                  > However, the cloud is shutdown immediately after
       the startup
                 procedure
                  > is completed. It also isn't added to the spare
       pool. I believe
                 this is
                  > because there currently is no load nor SLO defined
       on the
              system.


                 Per default the spare_pool has a PermanentRequestSLO with
              urgency 1 and
                 the cloud service a PermanentRequestSLO with urgency 2
              (considering
                 only cloud
                 resources). This means if no other SLO is defined in the
              system the
                 resource
                 will immediately moved back to the cloud service after
       startup.

                 Do you have already a Grid Engine service in the system?

                 Grid Engine service has per default a FixedUsageSLO with
              urgency 50
                 (gives every
                 resources at the service a fixed usage). If you move a
       cloud
                 resource to the
                 Grid Engine service it will stay there.


                 Richard

                  > I think I'll reinstall the sdm system, grid engine
       and cloud
                 adapter to
                  > make sure that I have a clean install to continue with.
              This will
                  > probably solve this problem.

                  >
                  > Thanks again for all your help. Keep up the good
       work :-)
                  >
                  > Joris
                  >
                  > On Wed, Mar 24, 2010 at 16:47, torsten
              <torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
                  > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>> wrote:
                  >
                  >     Hi Joris,
                  >
                  >     thanks for your answers. It looks to me like
       your problem
                 comes from the
                  >     fact that the hostname of your SDM master host
       (what the
                 hostname binary
                  >     returns) is the short version (neo-wn01) while when
              resolving
                 this host
                  >     on the SDM master host you get the fully qualified
              hostname
                  >     (neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>). This should
                  >     be consistent.
                  >
                  >     Therefore I'd expect your system to work if you
       change the
                 hostname to
                  >     the fully qualified hostname, or influence the
       hostname
                 resolving on the
                  >     SDM master host, so that the host is always
       resolved
              to the short
                  >     hostname (the FQDN must stay resolvable as
       well). In both
                 cases, a
                  >     restart of the SDM system (with sdmadm
       shutdown_jvm and
                 startup_jvm) is
                  >     necessary afterwards.
                  >
                  >     A reinstallation of SDM should not be necessary.
                  >
                  >     Cheers,
                  >     Torsten
                  >
                  >     On 03/24/10 13:12, jorisroovers wrote:
                  >      > Hi Torsten,
                  >      >
                  >      > To answer your questions:
                  >      >
                  >      > 1) Did you install the SDM system (master host)
              before the
                 entries to
                  >      > the DNS server were made (while you still
       had the
              manual
                 entries in
                  >      > /etc/hosts)?
                  >      >
                  >      > Yes I did. Can this be the cause of my
       problems ?
                  >      >
                  >      > 2) Has the SDM system on the master host
       been running
                 without restart
                  >      > since then? (so no "sdmadm shutdown_jvm"
       command)
                  >      >
                  >      > Yes it has. However, after receiving your
       previous
              email, I
                  >     rebooted the
                  >      > sdm master node to make sure that the master
       uses
              the latest
                  >     configuration.
                  >      > (java.rmi.server.hostname still is set to
                 neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      > <http://neo-wn01.cmi.ua.ac.be>)
                  >      >
                  >      > 3) What does the following command (executed on
              your local
                 SDM master
                  >      > host) output? Full or short hostname for
       neo-wn01?
                  >      > % grep csInfo
              /etc/sdm/bootstrap/sdmjoris/prefs.properties
                  >      >
                  >      > root at neo-wn01:~# grep csInfo
                  >     /etc/sdm/bootstrap/sdmjoris/prefs.properties
                  >      > csInfo=neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>\:6442
                  >      >
                  >      > 4) What does the hostname binary return on
       your SDM
              master
                 host?
                  >     Full or
                  >      > short hostname for neo-wn01?
                  >      >
                  >      > root at neo-wn01:~# hostname
                  >      > neo-wn01
                  >      >
                  >      > I've ran the hostname command before and already
              thought that
                  >     this might
                  >      > be related to the problem, but since I
       didn't find any
                 reference
                  >     to the
                  >      > command in any of the related gef_ec2_*
       scripts, I
              thought
                 this
                  >     wasn't
                  >      > important. Do you think that the hostname
       command
              giving the
                  >     unqualified
                  >      > name may be related to the problems I'm having?
                  >      >
                  >      > Hopefully, this information can help. If
       not, I'll do a
                 reinstall
                  >      > tonight or tomorrow morning.
                  >      >
                  >      > Thanks again for all your help.
                  >      >
                  >      > Cheers,
                  >      > Joris
                  >      >
                  >      >
                  >      > On Tue, Mar 23, 2010 at 14:33, torsten
                 <torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
                  >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
                  >      > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>>
              wrote:
                  >      >
                  >      >     Hi Joris,
                  >      >
                  >      >     On 03/22/10 17:12, jorisroovers wrote:
                  >      >      > Hi,
                  >      >      >
                  >      >      > I checked this, but it seems that the
              rmi-registry is
                  >     setup correctly
                  >      >      > (ps -eF)
                  >      >      >
                  >      >      >
       /usr/lib/jvm/java-6-sun-1.6.0.15/jre/bin/java
                  >      >      >
              -Djava.security.manager=java.rmi.RMISecurityManager
                 [lot
                  >     of other
                  >      >      > arguments]
                  >            -Djava.rmi.server.hostname=neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      > <http://neo-wn01.cmi.ua.ac.be>
                  >      >
                  >      >     ok, so this isn't the culprit ...
                  >      >
                  >      >      > Your reply got me thinking though. The
              cluster I'm
                 using
                  >     is newly
                  >      >      > installed, and it has only been added
       to the
              DNS-server
                  >     last week.
                  >      >      > Before the nodes of the cluster were
       added
              to the
                  >     DNS-server, I
                  >      >     needed
                  >      >      > to add entries to /etc/hosts manually
       if I
              wanted the
                  >     hostnames to be
                  >      >      > resolved. Therefore, I added some
       entries of
              other
                 nodes
                  >     to the
                  >      >      > /etc/hosts file of neo-wn01
       (including neo-wn01
                 itself).
                  >      >      > I have now removed those, to be sure
       that no new
                 problems
                  >     arise
                  >      >     from the
                  >      >      > /etc/hosts file.
                  >      >
                  >      >     Good point! This host name resolving
              reconfiguration
                 might be
                  >     the cause
                  >      >     of your problem. I have a few questions:
                  >      >
                  >      >     1) Did you install the SDM system
       (master host)
              before the
                  >     entries to
                  >      >     the DNS server were made (while you
       still had
              the manual
                  >     entries in
                  >      >     /etc/hosts)?
                  >      >
                  >      >     2) Has the SDM system on the master host
       been
              running
                 without
                  >     restart
                  >      >     since then? (so no "sdmadm shutdown_jvm"
       command)
                  >      >
                  >      >     3) What does the following command
       (executed on
              your local
                  >     SDM master
                  >      >     host) output? Full or short hostname for
       neo-wn01?
                  >      >     % grep csInfo
              /etc/sdm/bootstrap/sdmjoris/prefs.properties
                  >      >
                  >      >     4) What does the hostname binary return
       on your
              SDM master
                  >     host? Full or
                  >      >     short hostname for neo-wn01?
                  >      >
                  >      >     It might help, to configure your master
       host to
              resolve
                  >     itself always to
                  >      >     the short hostname (neo-wn01) and
       reinstall SDM (or
                 install a
                  >     2nd SDM
                  >      >     system with a different system name).
                  >      >
                  >      >
                  >      >      > Currently, the only entry in
       /etc/hosts is
                  >      >      >
                  >      >      > 127.0.0.1 localhost
                  >      >      >
                  >      >      > I've retried the cloud installation
       process, but
                 the same
                  >     error
                  >      >     occured.
                  >      >      > However, I also found an interesting
       error,
              that I
                  >     overlooked before.
                  >      >      > When doing the sdminstallation
       manually on the
                 cloud host
                  >     (not having
                  >      >      > edited the /etc/hosts file) I get the
              following error
                  >      >      >
                  >      >      > root at domU-12-31-39-03-CC-61:/opt/sdm/bin#
              ./sdmadm -p
                  >     system -ppw -s
                  >      >      > sdmjoris install_managed_host -au root -l
                 /root/spool -cs_url
                  >      >      > neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
       <http://neo-wn01.cmi.ua.ac.be:6442>
              <http://neo-wn01.cmi.ua.ac.be:6442>
                 <http://neo-wn01.cmi.ua.ac.be:6442>
                  >     <http://neo-wn01.cmi.ua.ac.be:6442>
                 <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >      > A configuration for system "sdmjoris" has
              been added.
                  >      >      > username [root] >
                  >      >      > password >
                  >      >      > WARNING: Host neo-wn01 is not resolvable
                  >      >      > username [root] >
                  >      >      > password >
                  >      >      > During installation of system
       sdmjoris, an error
                 occurred.
                  >     The system
                  >      >      > will be removed from preferences.
                  >      >      > Error: Cannot connect to JVM
                 cs_vm at neo-wn01_cmi_ua_ac_be:
                  >     Exception
                  >      >      > creating connection to:
              neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      > <http://neo-wn01.cmi.ua.ac.be>; nested
              exception is:
                  >      >      >         java.io.IOException: found no SSL
              context
                 for system
                  >      >     neo-wn01:6442
                  >      >      >
                  >      >      >
                  >      >      > Which would suggest that there is a
       certificate
                 problem.
                  >      >
                  >      >     The error message suggests that, but this is
              not the case.
                  >     This is very
                  >      >     probably related to hostname resolving
       on the SDM
                 master host.
                  >      >
                  >      >     Cheers,
                  >      >     Torsten
                  >      >
                  >      >      > Any other suggestions?
                  >      >      > Thanks again,
                  >      >      >
                  >      >      > Joris
                  >      >      >
                  >      >      >
                  >      >      >
                  >      >      >
                  >      >      > On Mon, Mar 22, 2010 at 14:43, torsten
                  >     <torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
                  >      >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>
                  >      >      > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
                  >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
                 <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
              <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>>>
              wrote:
                  >      >      >
                  >      >      >     Hi Joris,
                  >      >      >
                  >      >      >     On 03/22/10 13:45, jorisroovers
       wrote:
                  >      >      >      > Hi Torsten,
                  >      >      >      >
                  >      >      >      > Thanks for your help !
                  >      >      >      > Sorry for the late reply. I've
       been busy
                 last week.
                  >      >      >      > However, I have been able to
       solve the
                 problem. It was
                  >      >     indeed the
                  >      >      >      > ssh-tunnel that was not setup
       correctly.
                  >      >      >      > The actual problem was that the
              /etc/hosts
                 file on
                  >     the sdm
                  >      >     master
                  >      >      >     host
                  >      >      >      > didn't contain a localhost
       entry anymore
                 (apparently,
                  >      >      >      > I accidentally deleted that
       entry when
                 editing the
                  >     file).
                  >      >     This caused
                  >      >      >      > the ssh tunnel setup to fail.
       This is
              solved
                 now.
                  >      >      >
                  >      >      >     Good to hear!
                  >      >      >
                  >      >      >      > However, I'm now having an
       other issue.
                  >      >      >      > The installation now fails when
              installing
                 the SDM
                  >     managed
                  >      >     host.
                  >      >      >      > ec2        res#32
              domU-12-31-39-0B-1D-31 ERROR
                  >      host       2
                  >      >      >       Step
                  >      >      >      > 'Installing and starting up SDM'
              failed (see
                 ...)
                  >      >      >      >
                  >      >      >      > I believe this has something to do
              with the
                  >     /etc/hosts file on
                  >      >      >     the cloud
                  >      >      >      > host.
                  >      >      >      > When I run the
       install_managed_host
              on the
                 cloud host
                  >      >      >      >
                  >      >      >      > sdmadm -p system -ppw -s sdmtest
                  >      install_managed_host -au
                  >      >     root -l
                  >      >      >      > /root/spool -cs_url
                 neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
       <http://neo-wn01.cmi.ua.ac.be:6442>
              <http://neo-wn01.cmi.ua.ac.be:6442>
              <http://neo-wn01.cmi.ua.ac.be:6442>
                  >     <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >      >      >
       <http://neo-wn01.cmi.ua.ac.be:6442>
                  >      >      >      >
                  >      >      >      > (I use the password installation
              method for
                 simplicity,
                  >      >     I've already
                  >      >      >      > verified that the right
       certificates
              that are
                  >     needed for
                  >      >      >     password-less
                  >      >      >      > installation are present on
       the cloud
              host)
                  >      >      >      > I get the following output:
                  >      >      >      >
                  >      >      >      > A configuration for system
       "sdmtest" has
                 been added.
                  >      >      >      > username [root] >
                  >      >      >      > password >
                  >      >      >      > WARNING: Host neo-wn01 is not
       resolvable
                  >      >      >
                  >      >      >     This looks like a problem with host
              names resolving
                  >      >     differently on your
                  >      >      >     SDM master host and on the cloud
       host.
                  >      >      >
                  >      >      >     A little background:
                  >      >      >     The cs_url you specified on the
       command line
                 above is
                  >     used to
                  >      >     contact an
                  >      >      >     RMI registry on the SDM master host.
              This registry
                  >     hands back
                  >      >     a URL to
                  >      >      >     which the real RMI connection
       should be
              made.
                  From the
                  >      >     warning you got
                  >      >      >     it looks like that this 2nd URL
       handed
              back by
                 the RMI
                  >      >     registry contains
                  >      >      >     the short hostname for your SDM
       master host.
                  >      >      >
                  >      >      >     To confirm this suspicion, it
       would be
              good if
                 you could
                  >      >     check on the
                  >      >      >     SDM master host, the parameters that
              were used for
                  >     starting
                  >      >     up your SDM
                  >      >      >     JVMs. Look (e.g. by using ps or
       pargs on
                 Solaris) for a
                  >      >     command line
                  >      >      >     switch
                 -Djava.rmi.server.hostname=<master_host_name>
                  >     in the
                  >      >     (rather
                  >      >      >     longish) command line that was
       used to start
                 the SDM
                  >     JVM process.
                  >      >      >
                  >      >      >     If my suspicion is correct, than
       this should
                 show the
                  >     short
                  >      >     name of your
                  >      >      >     master host (neo-wn01) instead of
       the FQDN
                  >      >     neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      >
                  >      >      >     Could you verify this, please?
                  >      >      >
                  >      >      >     Cheers,
                  >      >      >     Torsten
                  >      >      >
                  >      >      >      > The installation procedure
       then again
              asks
                 for the
                  >      >     username and
                  >      >      >     password
                  >      >      >      > for 2 times, before exiting.
                  >      >      >      > The /etc/hosts file on the
       cloud node
              currently
                  >     contains
                  >      >      >      >
                  >      >      >      > # SDM master host
                  >      >      >      > 10.8.0.1               neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
              <http://neo-wn01.cmi.ua.ac.be>
                 <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      >     <http://neo-wn01.cmi.ua.ac.be>
                  >     <http://neo-wn01.cmi.ua.ac.be>
                  >      >      >      >
                  >      >      >      > When I replaced this entry
       with the
              following
                  >     (adding the
                  >      >     unqualified
                  >      >      >      > name), the installer no longer
       gives
              the warning
                  >     and the
                  >      >     installation
                  >      >      >      > seems to go well.
                  >      >      >      >
                  >      >





More information about the gridengine-users mailing list