[GE users] VPN startup problem when using the SDM cloud adapter

jorisroovers joris.roovers at gmail.com
Tue Apr 13 18:38:25 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I had a look at the settings.sh script and it unsets both parameters instead of setting them:

unset SGE_QMASTER_PORT
unset SGE_EXECD_PORT

removing this or changing it to

SGE_QMASTER_PORT=6444; export SGE_QMASTER_PORT
SGE_EXECD_PORT=6445; export SGE_EXECD_PORT

didn't really give new results (which was what I expected, since the settings.sh script is sourced AFTER the installation if I remember correctly).

I also uncommented the -x option resulting in very verbose output. However, I didn't really notice anything helpful yet. The SGE_QMASTER_PORT and SGE_EXECD_PORT variables both seem to be set.


I also tried out various other scenario's, including setting the SGE_QMASTER_PORT and SGE_EXECD_PORT variables on the qmaster, adding them to the .profile of the cloud host, ...
Nothing seems to help (except adding the /etc/services entries).

Is it possible that this has something to do with the fact that I used the /etc/services-approach to install the qmaster host (that is, the qmaster host doesn't uses the SGE_QMASTER_PORT and SGE_EXECD_PORT variables, but uses services instead).
More concretely, would reinstalling SGE using the variables instead of the services approach potentially give any results; or should the 2 methods be able to co-exist?

Thanks,

Joris


On Tue, Apr 13, 2010 at 16:40, rhierlmeier <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>> wrote:
For efficiency reasons, the system has converted the large body of this message into an attachment.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253249

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].


---------- Forwarded message ----------
From: Richard Hierlmeier <Richard.Hierlmeier at Sun.COM>
To: users <users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>
Date: Tue, 13 Apr 2010 16:39:55 +0200
Subject: Re: [GE users] VPN startup problem when using the SDM cloud adapter

Hi Joris,

On 04/13/10 15:45, jorisroovers wrote:
Ok,

I got some valuable information out of this (thanks for your quick reply).
The second root_xx directory (=output of the xx script) contains the following error

Cannot contact qmaster. The command failed:

  ./bin/sol-x86/qconf -sh

The error message was:

  error: could not get environment variable SGE_QMASTER_PORT or service "sge_qmaster"
     Setting the SGE_QMASTER_PORT variable does not change anything about this, the error stays there. However, if I add sge_qmaster to /etc/services and do the same for sge_execd the installation works.

This means that somehow, the automatic installation procedure doesn't do this.
If checked the install_execd_cloud.conf file and it contains the correct entries for the ports:

SGE_QMASTER_PORT="6444"
SGE_EXECD_PORT="6445"

How can this happen? Is this the result of some faulty configuration, or something else ?

That's really strange. I don't know how the execd install script evaluates the SGE_QMASTER_PORT. Normally I would say that it is taken from the auto configuration file. However it is also possible that it is taken from $SGE_ROOT/$SGE_CELL/common/settings.sh.

The cloud-adapter synchronizes the files in $SGE_ROOT/$SGE_CELL/common at the cloud host with the files from qmaster. Do you have the correct SGE_QMASTER_PORT in $SGE_ROOT/$SGE_CELL/common/setting.sh on the cloud host?

Did you uncomment the "set -x" line in inst_sge? In the debug output you can see what value SGE_QMASTER_PORT has.

Richard




Thanks,

Joris

On Tue, Apr 13, 2010 at 13:28, rhierlmeier <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com> <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>> wrote:

   For efficiency reasons, the system has converted the large body of
   this message into an attachment.

   ------------------------------------------------------
   http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233
   <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253233>

   To unsubscribe from this discussion, e-mail:
   [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
   <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>>].

   ---------- Forwarded message ----------
   From: Richard Hierlmeier <Richard.Hierlmeier at Sun.COM>
   To: users <users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
   <mailto:users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>>>
   Date: Tue, 13 Apr 2010 13:28:31 +0200
   Subject: Re: [GE users] VPN startup problem when using the SDM cloud
   adapter

   Hi Joris,

   welcome back.

   you can debug the complete execd installation process of SDM if you
   set the keepFiles attribute of the SDM executor on the cloud host:

   1. Modify the configuration of the executor:

   % sdmadm mc -c executor
   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
   <executor:executor ...
     keepFiles="true"/>

   2. Make the configuration active on the cloud host:

   % sdmadm uc -c executor -h <cloud-host>

   3. Reset the resource (it should be in error state)

   % sdmadm rsr -r <cloud-host>

   The resource_resource command will reinstall the execd on the cloud
   host.

   4. Wait until the resource goes again into error state.

   5. Login in into the cloud host and look into the directory
     <local_spool_dir>/tmp/executor.

   You can find the local spool directory on the cloud host with

   # sdmadm -s <system_name> sbc -all
   ts31040system SYSTEM arges 31040
      spool=/var/sdm/<system_name>   <-- the local spool directory
       dist=/opt/sdm

   You will find in <local_spool_dir>/tmp/executor for each executed
   command
   a directory. The directory names have the format
   <user_name>_<sequence_nr>.
   The directory with the highest sequence number will contain the
   protocol of the last (un)install command (including stderr and
   stdout output). The username is always root.

   If the the stderr and stdout outout contains still no use full
   information please enable the debugging of the inst_sge script.
   Uncomment the line with

   # set -x

   in $SGE_ROOT/inst_sge on the cloud host.  Repeat the installation:

   # cd <local_spool_dir>/tmp/executor/root_10
   # ./install_execd.sh


   Richard


   On 04/13/10 12:15, jorisroovers wrote:

       Hello everyone,

       I have been out of the country for some time, which is the
       reason this reply is coming so late.
       However, in the mean time I have reinstalled SDM (and SGE) to
       make sure that I had a clean install to work with.
       After installing the GE-adapter, the cloud nodes no longer shut
       down right after the SDM node installation.
       I thought this would mean the end of my problems (I couldn't
       really think of anything that could go wrong after that), but it
       seems that I was wrong.

       Although the cloud node is successfully added to the geadapter
       service, it fails to install the SGE execution daemon.
        "Script install_execd_cloud.sh failed with status 1"

       So, I started debugging again. The different error logs didn't
       provide usefull information, so I decided to have a closer look
       at the installation scripts again.
        From the
       /opt/sdm/util/templates/ge-adapter/install_execd_cloud.sh script
       I learned that

       ./inst_sge -x -noremote -auto $CONF_FILE \
              2> $BASEDIR/install_execd.stderr >
       $BASEDIR/install_execd.stdout &              is called to install the execution deamon on the cloud host and
       that $CONF_FILE can be found under
       /var/spool/sdm/sdmjoris/tmp/executor/root_1/install_execd_cloud.conf
       on the cloud node.
       By quickly copying this configuration-file on the cloud node
       (before the uninstallation procedure triggered by the ERROR
       during install deletes this file), I was able to inspect its
       content.

       File contents of install_execd_cloud.conf on the cloud node
       (comments stripped):

       SGE_ROOT="/opt/sge"
       SGE_QMASTER_PORT="6444"
       SGE_EXECD_PORT="6445"
       SGE_ENABLE_SMF="false"
       SGE_CLUSTER_NAME="sgejoris"
       CELL_NAME="default"
       PAR_EXECD_INST_COUNT="1"
       ADMIN_HOST_LIST=""
       SUBMIT_HOST_LIST="ip-10-245-209-208"
       EXEC_HOST_LIST="ip-10-245-209-208"
       EXECD_SPOOL_DIR_LOCAL=""
       HOSTNAME_RESOLVING="false" DEFAULT_DOMAIN=""
       ADD_TO_RC="false"
       EXEC_HOST_LIST_RM="ip-10-245-209-208"
       REMOVE_RC="false"


       I personally believe that everything is alright here... (I have
       already tried setting the HOSTNAME_RESOLVING option to true
       using sdmadm, but that didn't help => I thought the problem
       could be DNS related again...).

       So, because I don't believe the problems lies here, I tried
       something different. I added the cloud host to the spare_pool
       (instead of adding it directly to the geadapter), and then
       performed the sge execution deamon installation manually on the
       cloud node.

       When I run
       ./inst_sge -x -noremote -auto $CONF_FILE \
              2> $BASEDIR/install_execd.stderr >
       $BASEDIR/install_execd.stdout &      with the correct config
       file, nothing happens. No successfull installation, no error
       messages, no command output.

       However, when I perform the installation completely manual (that
       is, without the -auto option and by adding the sgeqmaster and
       sgeexecd as service to the cloud node), I am able to add the
       cloud node to the grid engine and run jobs on it...

       I thus think that some kind of configuration option must be
       wrong, but I don't really know where to go from here.

       Can anyone give some better directions? Is there any way to get
       better debugging output? Could this be DNS related again, or is
       this probably an other problem?

       Thanks again,

       Joris

       PS: Should I post a new message to the mailinglist for this
       since this problem doesn't have anything to do with the VPN/DNS
       problems I was originally having ?


       On Mon, Mar 29, 2010 at 07:47, rhierlmeier
       <richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com> <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>
       <mailto:richard.hierlmeier at sun.com<mailto:richard.hierlmeier at sun.com>>>> wrote:

          Hi Joris,

          On 03/26/10 13:22, jorisroovers wrote:
           > Hi Torsten,
           >
           > I changed the output of the hostname command
          neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
           > <http://neo-wn01.cmi.ua.ac.be> and restarted the sdm
       master jvm. This
           > solved the problem! The cloud host is now succesfully started.
           >
           > However, the cloud is shutdown immediately after the startup
          procedure
           > is completed. It also isn't added to the spare pool. I believe
          this is
           > because there currently is no load nor SLO defined on the
       system.


          Per default the spare_pool has a PermanentRequestSLO with
       urgency 1 and
          the cloud service a PermanentRequestSLO with urgency 2
       (considering
          only cloud
          resources). This means if no other SLO is defined in the
       system the
          resource
          will immediately moved back to the cloud service after startup.

          Do you have already a Grid Engine service in the system?

          Grid Engine service has per default a FixedUsageSLO with
       urgency 50
          (gives every
          resources at the service a fixed usage). If you move a cloud
          resource to the
          Grid Engine service it will stay there.


          Richard

           > I think I'll reinstall the sdm system, grid engine and cloud
          adapter to
           > make sure that I have a clean install to continue with.
       This will
           > probably solve this problem.

           >
           > Thanks again for all your help. Keep up the good work :-)
           >
           > Joris
           >
           > On Wed, Mar 24, 2010 at 16:47, torsten
       <torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>> wrote:
           >
           >     Hi Joris,
           >
           >     thanks for your answers. It looks to me like your problem
          comes from the
           >     fact that the hostname of your SDM master host (what the
          hostname binary
           >     returns) is the short version (neo-wn01) while when
       resolving
          this host
           >     on the SDM master host you get the fully qualified
       hostname
           >     (neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>). This should
           >     be consistent.
           >
           >     Therefore I'd expect your system to work if you change the
          hostname to
           >     the fully qualified hostname, or influence the hostname
          resolving on the
           >     SDM master host, so that the host is always resolved
       to the short
           >     hostname (the FQDN must stay resolvable as well). In both
          cases, a
           >     restart of the SDM system (with sdmadm shutdown_jvm and
          startup_jvm) is
           >     necessary afterwards.
           >
           >     A reinstallation of SDM should not be necessary.
           >
           >     Cheers,
           >     Torsten
           >
           >     On 03/24/10 13:12, jorisroovers wrote:
           >      > Hi Torsten,
           >      >
           >      > To answer your questions:
           >      >
           >      > 1) Did you install the SDM system (master host)
       before the
          entries to
           >      > the DNS server were made (while you still had the
       manual
          entries in
           >      > /etc/hosts)?
           >      >
           >      > Yes I did. Can this be the cause of my problems ?
           >      >
           >      > 2) Has the SDM system on the master host been running
          without restart
           >      > since then? (so no "sdmadm shutdown_jvm" command)
           >      >
           >      > Yes it has. However, after receiving your previous
       email, I
           >     rebooted the
           >      > sdm master node to make sure that the master uses
       the latest
           >     configuration.
           >      > (java.rmi.server.hostname still is set to
          neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      > <http://neo-wn01.cmi.ua.ac.be>)
           >      >
           >      > 3) What does the following command (executed on
       your local
          SDM master
           >      > host) output? Full or short hostname for neo-wn01?
           >      > % grep csInfo
       /etc/sdm/bootstrap/sdmjoris/prefs.properties
           >      >
           >      > root at neo-wn01:~# grep csInfo
           >     /etc/sdm/bootstrap/sdmjoris/prefs.properties
           >      > csInfo=neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>\:6442
           >      >
           >      > 4) What does the hostname binary return on your SDM
       master
          host?
           >     Full or
           >      > short hostname for neo-wn01?
           >      >
           >      > root at neo-wn01:~# hostname
           >      > neo-wn01
           >      >
           >      > I've ran the hostname command before and already
       thought that
           >     this might
           >      > be related to the problem, but since I didn't find any
          reference
           >     to the
           >      > command in any of the related gef_ec2_* scripts, I
       thought
          this
           >     wasn't
           >      > important. Do you think that the hostname command
       giving the
           >     unqualified
           >      > name may be related to the problems I'm having?
           >      >
           >      > Hopefully, this information can help. If not, I'll do a
          reinstall
           >      > tonight or tomorrow morning.
           >      >
           >      > Thanks again for all your help.
           >      >
           >      > Cheers,
           >      > Joris
           >      >
           >      >
           >      > On Tue, Mar 23, 2010 at 14:33, torsten
          <torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
           >      > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>
       wrote:
           >      >
           >      >     Hi Joris,
           >      >
           >      >     On 03/22/10 17:12, jorisroovers wrote:
           >      >      > Hi,
           >      >      >
           >      >      > I checked this, but it seems that the
       rmi-registry is
           >     setup correctly
           >      >      > (ps -eF)
           >      >      >
           >      >      > /usr/lib/jvm/java-6-sun-1.6.0.15/jre/bin/java
           >      >      >
       -Djava.security.manager=java.rmi.RMISecurityManager
          [lot
           >     of other
           >      >      > arguments]
           >     -Djava.rmi.server.hostname=neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      > <http://neo-wn01.cmi.ua.ac.be>
           >      >
           >      >     ok, so this isn't the culprit ...
           >      >
           >      >      > Your reply got me thinking though. The
       cluster I'm
          using
           >     is newly
           >      >      > installed, and it has only been added to the
       DNS-server
           >     last week.
           >      >      > Before the nodes of the cluster were added
       to the
           >     DNS-server, I
           >      >     needed
           >      >      > to add entries to /etc/hosts manually if I
       wanted the
           >     hostnames to be
           >      >      > resolved. Therefore, I added some entries of
       other
          nodes
           >     to the
           >      >      > /etc/hosts file of neo-wn01 (including neo-wn01
          itself).
           >      >      > I have now removed those, to be sure that no new
          problems
           >     arise
           >      >     from the
           >      >      > /etc/hosts file.
           >      >
           >      >     Good point! This host name resolving
       reconfiguration
          might be
           >     the cause
           >      >     of your problem. I have a few questions:
           >      >
           >      >     1) Did you install the SDM system (master host)
       before the
           >     entries to
           >      >     the DNS server were made (while you still had
       the manual
           >     entries in
           >      >     /etc/hosts)?
           >      >
           >      >     2) Has the SDM system on the master host been
       running
          without
           >     restart
           >      >     since then? (so no "sdmadm shutdown_jvm" command)
           >      >
           >      >     3) What does the following command (executed on
       your local
           >     SDM master
           >      >     host) output? Full or short hostname for neo-wn01?
           >      >     % grep csInfo
       /etc/sdm/bootstrap/sdmjoris/prefs.properties
           >      >
           >      >     4) What does the hostname binary return on your
       SDM master
           >     host? Full or
           >      >     short hostname for neo-wn01?
           >      >
           >      >     It might help, to configure your master host to
       resolve
           >     itself always to
           >      >     the short hostname (neo-wn01) and reinstall SDM (or
          install a
           >     2nd SDM
           >      >     system with a different system name).
           >      >
           >      >
           >      >      > Currently, the only entry in /etc/hosts is
           >      >      >
           >      >      > 127.0.0.1 localhost
           >      >      >
           >      >      > I've retried the cloud installation process, but
          the same
           >     error
           >      >     occured.
           >      >      > However, I also found an interesting error,
       that I
           >     overlooked before.
           >      >      > When doing the sdminstallation manually on the
          cloud host
           >     (not having
           >      >      > edited the /etc/hosts file) I get the
       following error
           >      >      >
           >      >      > root at domU-12-31-39-03-CC-61:/opt/sdm/bin#
       ./sdmadm -p
           >     system -ppw -s
           >      >      > sdmjoris install_managed_host -au root -l
          /root/spool -cs_url
           >      >      > neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
       <http://neo-wn01.cmi.ua.ac.be:6442>
          <http://neo-wn01.cmi.ua.ac.be:6442>
           >     <http://neo-wn01.cmi.ua.ac.be:6442>
          <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >      > A configuration for system "sdmjoris" has
       been added.
           >      >      > username [root] >
           >      >      > password >
           >      >      > WARNING: Host neo-wn01 is not resolvable
           >      >      > username [root] >
           >      >      > password >
           >      >      > During installation of system sdmjoris, an error
          occurred.
           >     The system
           >      >      > will be removed from preferences.
           >      >      > Error: Cannot connect to JVM
          cs_vm at neo-wn01_cmi_ua_ac_be:
           >     Exception
           >      >      > creating connection to:
       neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      > <http://neo-wn01.cmi.ua.ac.be>; nested
       exception is:
           >      >      >         java.io.IOException: found no SSL
       context
          for system
           >      >     neo-wn01:6442
           >      >      >
           >      >      >
           >      >      > Which would suggest that there is a certificate
          problem.
           >      >
           >      >     The error message suggests that, but this is
       not the case.
           >     This is very
           >      >     probably related to hostname resolving on the SDM
          master host.
           >      >
           >      >     Cheers,
           >      >     Torsten
           >      >
           >      >      > Any other suggestions?
           >      >      > Thanks again,
           >      >      >
           >      >      > Joris
           >      >      >
           >      >      >
           >      >      >
           >      >      >
           >      >      > On Mon, Mar 22, 2010 at 14:43, torsten
           >     <torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
           >      >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
           >      >      > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>>
       wrote:
           >      >      >
           >      >      >     Hi Joris,
           >      >      >
           >      >      >     On 03/22/10 13:45, jorisroovers wrote:
           >      >      >      > Hi Torsten,
           >      >      >      >
           >      >      >      > Thanks for your help !
           >      >      >      > Sorry for the late reply. I've been busy
          last week.
           >      >      >      > However, I have been able to solve the
          problem. It was
           >      >     indeed the
           >      >      >      > ssh-tunnel that was not setup correctly.
           >      >      >      > The actual problem was that the
       /etc/hosts
          file on
           >     the sdm
           >      >     master
           >      >      >     host
           >      >      >      > didn't contain a localhost entry anymore
          (apparently,
           >      >      >      > I accidentally deleted that entry when
          editing the
           >     file).
           >      >     This caused
           >      >      >      > the ssh tunnel setup to fail. This is
       solved
          now.
           >      >      >
           >      >      >     Good to hear!
           >      >      >
           >      >      >      > However, I'm now having an other issue.
           >      >      >      > The installation now fails when
       installing
          the SDM
           >     managed
           >      >     host.
           >      >      >      > ec2        res#32
       domU-12-31-39-0B-1D-31 ERROR
           >      host       2
           >      >      >       Step
           >      >      >      > 'Installing and starting up SDM'
       failed (see
          ...)
           >      >      >      >
           >      >      >      > I believe this has something to do
       with the
           >     /etc/hosts file on
           >      >      >     the cloud
           >      >      >      > host.
           >      >      >      > When I run the install_managed_host
       on the
          cloud host
           >      >      >      >
           >      >      >      > sdmadm -p system -ppw -s sdmtest
           >      install_managed_host -au
           >      >     root -l
           >      >      >      > /root/spool -cs_url
          neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
       <http://neo-wn01.cmi.ua.ac.be:6442>
       <http://neo-wn01.cmi.ua.ac.be:6442>
           >     <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >      >     <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >      >      > <http://neo-wn01.cmi.ua.ac.be:6442>
           >      >      >      >
           >      >      >      > (I use the password installation
       method for
          simplicity,
           >      >     I've already
           >      >      >      > verified that the right certificates
       that are
           >     needed for
           >      >      >     password-less
           >      >      >      > installation are present on the cloud
       host)
           >      >      >      > I get the following output:
           >      >      >      >
           >      >      >      > A configuration for system "sdmtest" has
          been added.
           >      >      >      > username [root] >
           >      >      >      > password >
           >      >      >      > WARNING: Host neo-wn01 is not resolvable
           >      >      >
           >      >      >     This looks like a problem with host
       names resolving
           >      >     differently on your
           >      >      >     SDM master host and on the cloud host.
           >      >      >
           >      >      >     A little background:
           >      >      >     The cs_url you specified on the command line
          above is
           >     used to
           >      >     contact an
           >      >      >     RMI registry on the SDM master host.
       This registry
           >     hands back
           >      >     a URL to
           >      >      >     which the real RMI connection should be
       made.
           From the
           >      >     warning you got
           >      >      >     it looks like that this 2nd URL handed
       back by
          the RMI
           >      >     registry contains
           >      >      >     the short hostname for your SDM master host.
           >      >      >
           >      >      >     To confirm this suspicion, it would be
       good if
          you could
           >      >     check on the
           >      >      >     SDM master host, the parameters that
       were used for
           >     starting
           >      >     up your SDM
           >      >      >     JVMs. Look (e.g. by using ps or pargs on
          Solaris) for a
           >      >     command line
           >      >      >     switch
          -Djava.rmi.server.hostname=<master_host_name>
           >     in the
           >      >     (rather
           >      >      >     longish) command line that was used to start
          the SDM
           >     JVM process.
           >      >      >
           >      >      >     If my suspicion is correct, than this should
          show the
           >     short
           >      >     name of your
           >      >      >     master host (neo-wn01) instead of the FQDN
           >      >     neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >
           >      >      >     Could you verify this, please?
           >      >      >
           >      >      >     Cheers,
           >      >      >     Torsten
           >      >      >
           >      >      >      > The installation procedure then again
       asks
          for the
           >      >     username and
           >      >      >     password
           >      >      >      > for 2 times, before exiting.
           >      >      >      > The /etc/hosts file on the cloud node
       currently
           >     contains
           >      >      >      >
           >      >      >      > # SDM master host
           >      >      >      > 10.8.0.1        neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >     <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >      >
           >      >      >      > When I replaced this entry with the
       following
           >     (adding the
           >      >     unqualified
           >      >      >      > name), the installer no longer gives
       the warning
           >     and the
           >      >     installation
           >      >      >      > seems to go well.
           >      >      >      >
           >      >      >      > # SDM master host
           >      >      >      > 10.8.0.1        neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >     <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >      > neo-wn01
           >      >      >      >
           >      >      >      > Now, my question is: Is it normal
       that the
           >     installer needs
           >      >     this
           >      >      >     second
           >      >      >      > alias in the /etc/hosts file? Can I
       modify
          anything
           >     in my sdm
           >      >      >      > installation so that this is no long
       necessary?
           >      >      >      > I know that the /etc/hosts file is
       edited by the
           >      >     startup-vpn.sh
           >      >      >     script
           >      >      >      > that is remotely trigged  by the
           >      >      >     gef_ec2_startup_vpn_connection.sh  script.
           >      >      >      >
           >      >      >      > execute_ssh_script $RES_dnsName
       "startup-vpn.sh
           >      >     $sdm_master_host
           >      >      >      > $SDM_MASTER_VPN_IP $remote_config_file
           >     $SDM_MASTER_VPN_IP" 1
           >      >      >      >
           >      >      >      > Trying to edit $sdm_master_host in
       the line
          above
           >     has been
           >      >      >     unsuccessful
           >      >      >      > so far. Apparently, if this variable
       contains
           >     spaces (like
           >      >     when
           >      >      >     setting
           >      >      >      >
       $sdm_master_host="neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>
       <http://neo-wn01.cmi.ua.ac.be>
          <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >     <http://neo-wn01.cmi.ua.ac.be>
           >     <http://neo-wn01.cmi.ua.ac.be>
           >      >      >      > neo-wn01") only the first part is
       added to the
           >     /etc/hosts
           >      >     file.
           >      >      >      >
           >      >      >      > Of course, solving this problem by
       editing
           >      >      >      > the gef_ec2_startup_vpn_connection.sh
       script
          would
           >     only be
           >      >     half a
           >      >      >      > solution...
           >      >      >      >
           >      >      >      > Any ideas or help would be very useful.
           >      >      >      > Thanks a lot,
           >      >      >      >
           >      >      >      > Joris
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      >
           >      >      >      > On Mon, Mar 15, 2010 at 14:52, torsten
           >      >     <torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
           >      >      >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>
           >      >      >      > <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>
           >      >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>
           >     <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>
          <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>
       <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>>>>>>>
       wrote:
           >      >      >      >
           >      >      >      >     Hi Joris,
           >      >      >      >
           >      >      >      >     any progress in the meantime?
           >      >      >      >
           >      >      >      >     Maybe my comments below can help.
           >      >      >      >
           >      >      >      >     On 03/10/10 13:42, jorisroovers
       wrote:
           >      >      >      >      > Hello everyone,
           >      >      >      >      >
           >      >      >      >      > I'm trying to setup an SDM
          installation with
           >      >     managed nodes on
           >      >      >      >     Amazon EC2
           >      >      >      >      > using the SDM Cloud Adapter.
           >      >      >      >      > The installation of the adapter
          itself was
           >     successfull,
           >      >      >     but now I'm
           >      >      >      >      > having some problems when starting
          cloud hosts.
           >      >      >      >      >
           >      >      >      >      > To start a cloud host, I use the
          commands as
           >      >     described in
           >      >      >     the wiki:
           >      >      >      >      >
           >      >      >      >      > sdmadm add_resource -s ec2
       (filled in
           >     unbound_name and
           >      >      >     amiId in the
           >      >      >      >      > editor. I'm using the sample AMI)
           >      >      >      >      > smdadm move_resource -r cloud1 -s
          spare_pool
           >      >      >      >      >
           >      >      >      >      > when watching the 'sdmam
          show_resource' output I
           >      >     can see
           >      >      >     that the
           >      >      >      >      > instance is successfully
       started (I
           >     confirmed this by
           >      >      >     using the
           >      >      >      >     online
           >      >      >      >      > Amazon EC2 Management Console).
           >      >      >      >      > However, during the
       UNASSIGNING phase, a
           >     problem occurs
           >      >      >     while the
           >      >      >      >     VPN is
           >      >      >      >      > started on the cloud host.
           >      >      >      >      >
           >      >      >      >      > output of 'sdmadm show_resource':
           >      >      >      >      > ec2        res#16 cloud1                          ERROR
           >      >      host U     2
           >      >      >      >     Step
           >      >      >      >      > 'Starting up VPN connection'
       failed (see
           >      >      >      >      >
           >      >               'Starting_up_virtual_resource-2010-03-10_11:21:43-res#16.log')
           >      >      >      >      >
           >      >      >      >      > I already did some research on the
          cause of this
           >      >     problem (by
           >      >      >      >     increasing
           >      >      >      >      > log output, removing the
       undo-steps
          so that the
           >      >     cloud node
           >      >      >     is not
           >      >      >      >      > shutdown when the problem
       occurs and
           >     examining the
           >      >      >     executed scripts).
           >      >      >      >      > I found out that the problem lies
          with the
           >      >     execution of the
           >      >      >      >      >
           >     /opt/sdm/util/cloud/ec2/ami_scripts/startup-vpn.sh
           >      >     script
           >      >      >     on the
           >      >      >      >     cloud node.
           >      >      >      >      > More specifically, the
       'wait_for_ping
           >      >     $VPN_SERVER_VPN_IP "VPN
           >      >      >      >     server"'
           >      >      >      >      > part fails.
           >      >      >      >      > I suspect this is caused by the
          './openvpn
           >     --config
           >      >      >      >     "$VPN_CONFIG_FILE"
           >      >      >      >      > --daemon' that is  executed
       before the
           >      >     wait_for_ping command.
           >      >      >      >      >
           >      >      >      >      > I tried to run the 'openvpn'
       command
          manually on
           >      >     the cloud
           >      >      >     host
           >      >      >      >     and got
           >      >      >      >      > the following output:
           >      >      >      >      >
           >      >      >      >      > Wed Mar 10 10:55:16 2010 TCP:
       connect to
           >      >     127.0.0.1:1194<http://127.0.0.1:1194> <http://127.0.0.1:1194>
       <http://127.0.0.1:1194>
          <http://127.0.0.1:1194> <http://127.0.0.1:1194>
           >      >      >     <http://127.0.0.1:1194>
           >      >      >      >     <http://127.0.0.1:1194>
           >      >      >





More information about the gridengine-users mailing list