[GE users] VPN startup problem when using the SDM cloud adapter

torsten torsten.blix at sun.com
Tue Mar 23 13:33:59 GMT 2010


Hi Joris,

On 03/22/10 17:12, jorisroovers wrote:
> Hi,
>
> I checked this, but it seems that the rmi-registry is setup correctly
> (ps -eF)
>
> /usr/lib/jvm/java-6-sun-1.6.0.15/jre/bin/java
> -Djava.security.manager=java.rmi.RMISecurityManager [lot of other
> arguments] -Djava.rmi.server.hostname=neo-wn01.cmi.ua.ac.be
> <http://neo-wn01.cmi.ua.ac.be>

ok, so this isn't the culprit ...

> Your reply got me thinking though. The cluster I'm using is newly
> installed, and it has only been added to the DNS-server last week.
> Before the nodes of the cluster were added to the DNS-server, I needed
> to add entries to /etc/hosts manually if I wanted the hostnames to be
> resolved. Therefore, I added some entries of other nodes to the
> /etc/hosts file of neo-wn01 (including neo-wn01 itself).
> I have now removed those, to be sure that no new problems arise from the
> /etc/hosts file.

Good point! This host name resolving reconfiguration might be the cause
of your problem. I have a few questions:

1) Did you install the SDM system (master host) before the entries to
the DNS server were made (while you still had the manual entries in
/etc/hosts)?

2) Has the SDM system on the master host been running without restart
since then? (so no "sdmadm shutdown_jvm" command)

3) What does the following command (executed on your local SDM master
host) output? Full or short hostname for neo-wn01?
% grep csInfo /etc/sdm/bootstrap/sdmjoris/prefs.properties

4) What does the hostname binary return on your SDM master host? Full or
short hostname for neo-wn01?

It might help, to configure your master host to resolve itself always to
the short hostname (neo-wn01) and reinstall SDM (or install a 2nd SDM
system with a different system name).


> Currently, the only entry in /etc/hosts is
>
> 127.0.0.1 localhost
>
> I've retried the cloud installation process, but the same error occured.
> However, I also found an interesting error, that I overlooked before.
> When doing the sdminstallation manually on the cloud host (not having
> edited the /etc/hosts file) I get the following error
>
> root at domU-12-31-39-03-CC-61:/opt/sdm/bin# ./sdmadm -p system -ppw -s
> sdmjoris install_managed_host -au root -l /root/spool -cs_url
> neo-wn01.cmi.ua.ac.be:6442 <http://neo-wn01.cmi.ua.ac.be:6442>
> A configuration for system "sdmjoris" has been added.
> username [root] >
> password >
> WARNING: Host neo-wn01 is not resolvable
> username [root] >
> password >
> During installation of system sdmjoris, an error occurred. The system
> will be removed from preferences.
> Error: Cannot connect to JVM cs_vm at neo-wn01_cmi_ua_ac_be: Exception
> creating connection to: neo-wn01.cmi.ua.ac.be
> <http://neo-wn01.cmi.ua.ac.be>; nested exception is:
>         java.io.IOException: found no SSL context for system neo-wn01:6442
>
>
> Which would suggest that there is a certificate problem.

The error message suggests that, but this is not the case. This is very
probably related to hostname resolving on the SDM master host.

Cheers,
Torsten

> Any other suggestions?
> Thanks again,
>
> Joris
>
>
>
>
> On Mon, Mar 22, 2010 at 14:43, torsten <torsten.blix at sun.com
> <mailto:torsten.blix at sun.com>> wrote:
>
>     Hi Joris,
>
>     On 03/22/10 13:45, jorisroovers wrote:
>      > Hi Torsten,
>      >
>      > Thanks for your help !
>      > Sorry for the late reply. I've been busy last week.
>      > However, I have been able to solve the problem. It was indeed the
>      > ssh-tunnel that was not setup correctly.
>      > The actual problem was that the /etc/hosts file on the sdm master
>     host
>      > didn't contain a localhost entry anymore (apparently,
>      > I accidentally deleted that entry when editing the file). This caused
>      > the ssh tunnel setup to fail. This is solved now.
>
>     Good to hear!
>
>      > However, I'm now having an other issue.
>      > The installation now fails when installing the SDM managed host.
>      > ec2        res#32 domU-12-31-39-0B-1D-31 ERROR    host       2
>       Step
>      > 'Installing and starting up SDM' failed (see ...)
>      >
>      > I believe this has something to do with the /etc/hosts file on
>     the cloud
>      > host.
>      > When I run the install_managed_host on the cloud host
>      >
>      > sdmadm -p system -ppw -s sdmtest  install_managed_host -au root -l
>      > /root/spool -cs_url neo-wn01.cmi.ua.ac.be:6442
>     <http://neo-wn01.cmi.ua.ac.be:6442>
>      > <http://neo-wn01.cmi.ua.ac.be:6442>
>      >
>      > (I use the password installation method for simplicity, I've already
>      > verified that the right certificates that are needed for
>     password-less
>      > installation are present on the cloud host)
>      > I get the following output:
>      >
>      > A configuration for system "sdmtest" has been added.
>      > username [root] >
>      > password >
>      > WARNING: Host neo-wn01 is not resolvable
>
>     This looks like a problem with host names resolving differently on your
>     SDM master host and on the cloud host.
>
>     A little background:
>     The cs_url you specified on the command line above is used to contact an
>     RMI registry on the SDM master host. This registry hands back a URL to
>     which the real RMI connection should be made. From the warning you got
>     it looks like that this 2nd URL handed back by the RMI registry contains
>     the short hostname for your SDM master host.
>
>     To confirm this suspicion, it would be good if you could check on the
>     SDM master host, the parameters that were used for starting up your SDM
>     JVMs. Look (e.g. by using ps or pargs on Solaris) for a command line
>     switch -Djava.rmi.server.hostname=<master_host_name> in the (rather
>     longish) command line that was used to start the SDM JVM process.
>
>     If my suspicion is correct, than this should show the short name of your
>     master host (neo-wn01) instead of the FQDN neo-wn01.cmi.ua.ac.be
>     <http://neo-wn01.cmi.ua.ac.be>
>
>     Could you verify this, please?
>
>     Cheers,
>     Torsten
>
>      > The installation procedure then again asks for the username and
>     password
>      > for 2 times, before exiting.
>      > The /etc/hosts file on the cloud node currently contains
>      >
>      > # SDM master host
>      > 10.8.0.1        neo-wn01.cmi.ua.ac.be
>     <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
>      >
>      > When I replaced this entry with the following (adding the unqualified
>      > name), the installer no longer gives the warning and the installation
>      > seems to go well.
>      >
>      > # SDM master host
>      > 10.8.0.1        neo-wn01.cmi.ua.ac.be
>     <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
>      > neo-wn01
>      >
>      > Now, my question is: Is it normal that the installer needs this
>     second
>      > alias in the /etc/hosts file? Can I modify anything in my sdm
>      > installation so that this is no long necessary?
>      > I know that the /etc/hosts file is edited by the startup-vpn.sh
>     script
>      > that is remotely trigged  by the
>     gef_ec2_startup_vpn_connection.sh  script.
>      >
>      > execute_ssh_script $RES_dnsName "startup-vpn.sh $sdm_master_host
>      > $SDM_MASTER_VPN_IP $remote_config_file $SDM_MASTER_VPN_IP" 1
>      >
>      > Trying to edit $sdm_master_host in the line above has been
>     unsuccessful
>      > so far. Apparently, if this variable contains spaces (like when
>     setting
>      > $sdm_master_host="neo-wn01.cmi.ua.ac.be
>     <http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
>      > neo-wn01") only the first part is added to the /etc/hosts file.
>      >
>      > Of course, solving this problem by editing
>      > the gef_ec2_startup_vpn_connection.sh script would only be half a
>      > solution...
>      >
>      > Any ideas or help would be very useful.
>      > Thanks a lot,
>      >
>      > Joris
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      > On Mon, Mar 15, 2010 at 14:52, torsten <torsten.blix at sun.com
>     <mailto:torsten.blix at sun.com>
>      > <mailto:torsten.blix at sun.com <mailto:torsten.blix at sun.com>>> wrote:
>      >
>      >     Hi Joris,
>      >
>      >     any progress in the meantime?
>      >
>      >     Maybe my comments below can help.
>      >
>      >     On 03/10/10 13:42, jorisroovers wrote:
>      >      > Hello everyone,
>      >      >
>      >      > I'm trying to setup an SDM installation with managed nodes on
>      >     Amazon EC2
>      >      > using the SDM Cloud Adapter.
>      >      > The installation of the adapter itself was successfull,
>     but now I'm
>      >      > having some problems when starting cloud hosts.
>      >      >
>      >      > To start a cloud host, I use the commands as described in
>     the wiki:
>      >      >
>      >      > sdmadm add_resource -s ec2 (filled in unbound_name and
>     amiId in the
>      >      > editor. I'm using the sample AMI)
>      >      > smdadm move_resource -r cloud1 -s spare_pool
>      >      >
>      >      > when watching the 'sdmam show_resource' output I can see
>     that the
>      >      > instance is successfully started (I confirmed this by
>     using the
>      >     online
>      >      > Amazon EC2 Management Console).
>      >      > However, during the UNASSIGNING phase, a problem occurs
>     while the
>      >     VPN is
>      >      > started on the cloud host.
>      >      >
>      >      > output of 'sdmadm show_resource':
>      >      > ec2        res#16 cloud1                ERROR    host U     2
>      >     Step
>      >      > 'Starting up VPN connection' failed (see
>      >      > 'Starting_up_virtual_resource-2010-03-10_11:21:43-res#16.log')
>      >      >
>      >      > I already did some research on the cause of this problem (by
>      >     increasing
>      >      > log output, removing the undo-steps so that the cloud node
>     is not
>      >      > shutdown when the problem occurs and examining the
>     executed scripts).
>      >      > I found out that the problem lies with the execution of the
>      >      > /opt/sdm/util/cloud/ec2/ami_scripts/startup-vpn.sh script
>     on the
>      >     cloud node.
>      >      > More specifically, the 'wait_for_ping $VPN_SERVER_VPN_IP "VPN
>      >     server"'
>      >      > part fails.
>      >      > I suspect this is caused by the './openvpn --config
>      >     "$VPN_CONFIG_FILE"
>      >      > --daemon' that is  executed before the wait_for_ping command.
>      >      >
>      >      > I tried to run the 'openvpn' command manually on the cloud
>     host
>      >     and got
>      >      > the following output:
>      >      >
>      >      > Wed Mar 10 10:55:16 2010 TCP: connect to 127.0.0.1:1194
>     <http://127.0.0.1:1194>
>      >     <http://127.0.0.1:1194>
>      >      > <http://127.0.0.1:1194> failed, will try again in 5 seconds:
>      >     Connection
>      >      > refused (errno=146)
>      >      >
>      >      > This probably is the root of the problem.
>      >     [snip]
>      >
>      >     Good debugging so far, valuable information!
>      >
>      >     The installation step that fails for you ('Starting up VPN
>     connection')
>      >     does two things (see
>      >     <sdm_dist_dir>/util/cloud/ec2/gef_ec2_startup_vpn_connection.sh):
>      >     1) create an ssh tunnel to the started up cloud host from
>     local port
>      >     1194 to remote port 1194
>      >     2) execute a script
>      >     <sdm_dist_dir>/util/cloud/ec2/ami_scripts/startup-vpn.sh that
>     then
>      >     starts up the openvpn client on the cloud host (the part that
>     fails for
>      >     you). This openvpn client is configured to connect to port
>     1194 on the
>      >     local host (which is the cloud host), a connection which
>     should be
>      >     forwarded by the ssh tunnel set up in step 1 to the VPN
>     master running
>      >     on your (local) SDM master machine.
>      >
>      >     If I shoot down the ssh tunnel in my test system (after a
>     complete and
>      >     successful cloud host startup) and try to restart the openvpn
>     client on
>      >     the cloud host, I get exactly your error message: Connection
>     refused
>      >     (errno=146).
>      >
>      >     So I'm suspecting that step 1 of the script, the ssh tunnel
>     startup,
>      >     somehow fails for you.
>      >
>      >     Could you check on your local SDM master whether this ssh
>     tunnel process
>      >     exists after the startup process fails (and NO undo is done)?
>     Something
>      >     like "ps -ef | grep ssh" should show it. ssh should be called
>     with
>      >     arguments like "-R 1194:localhost:1194 -N
>     <public_cloudhost_name>"
>      >
>      >     If this ssh process is running, you should be able to telnet
>     from the
>      >     cloud host to port localhost:1194 ("telnet localhost 1194")
>     and get an
>      >     answer from the VPN master process running on the SDM master
>     host.
>      >
>      >     A further thing to check would be the syslog on the SDM
>     master host. The
>      >     VPN master is logging into the syslog any kind of problems it
>      >     encounters.
>      >
>      >     I hope this helps!
>      >
>      >     Cheers,
>      >     Torsten
>      >
>      >     ------------------------------------------------------
>      >
>     http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721>
>      >
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721>>
>      >
>      >     To unsubscribe from this discussion, e-mail:
>      >     [users-unsubscribe at gridengine.sunsource.net
>     <mailto:users-unsubscribe at gridengine.sunsource.net>
>      >     <mailto:users-unsubscribe at gridengine.sunsource.net
>     <mailto:users-unsubscribe at gridengine.sunsource.net>>].
>      >
>      >
>
>     ------------------------------------------------------
>     http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250497
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250497>
>
>     To unsubscribe from this discussion, e-mail:
>     [users-unsubscribe at gridengine.sunsource.net
>     <mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250807

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list