[GE users] VPN startup problem when using the SDM cloud adapter

jorisroovers joris.roovers at gmail.com
Mon Mar 22 16:12:18 GMT 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I checked this, but it seems that the rmi-registry is setup correctly (ps -eF)

/usr/lib/jvm/java-6-sun-1.6.0.15/jre/bin/java -Djava.security.manager=java.rmi.RMISecurityManager [lot of other arguments] -Djava.rmi.server.hostname=neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>

Your reply got me thinking though. The cluster I'm using is newly installed, and it has only been added to the DNS-server last week. Before the nodes of the cluster were added to the DNS-server, I needed to add entries to /etc/hosts manually if I wanted the hostnames to be resolved. Therefore, I added some entries of other nodes to the /etc/hosts file of neo-wn01 (including neo-wn01 itself).
I have now removed those, to be sure that no new problems arise from the /etc/hosts file.
Currently, the only entry in /etc/hosts is

127.0.0.1 localhost

I've retried the cloud installation process, but the same error occured.
However, I also found an interesting error, that I overlooked before.
When doing the sdminstallation manually on the cloud host (not having edited the /etc/hosts file) I get the following error

root at domU-12-31-39-03-CC-61:/opt/sdm/bin# ./sdmadm -p system -ppw -s sdmjoris install_managed_host -au root -l /root/spool -cs_url neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
A configuration for system "sdmjoris" has been added.
username [root] >
password >
WARNING: Host neo-wn01 is not resolvable
username [root] >
password >
During installation of system sdmjoris, an error occurred. The system will be removed from preferences.
Error: Cannot connect to JVM cs_vm at neo-wn01_cmi_ua_ac_be: Exception creating connection to: neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>; nested exception is:
        java.io.IOException: found no SSL context for system neo-wn01:6442


Which would suggest that there is a certificate problem.

Any other suggestions?
Thanks again,

Joris




On Mon, Mar 22, 2010 at 14:43, torsten <torsten.blix at sun.com<mailto:torsten.blix at sun.com>> wrote:
Hi Joris,

On 03/22/10 13:45, jorisroovers wrote:
> Hi Torsten,
>
> Thanks for your help !
> Sorry for the late reply. I've been busy last week.
> However, I have been able to solve the problem. It was indeed the
> ssh-tunnel that was not setup correctly.
> The actual problem was that the /etc/hosts file on the sdm master host
> didn't contain a localhost entry anymore (apparently,
> I accidentally deleted that entry when editing the file). This caused
> the ssh tunnel setup to fail. This is solved now.

Good to hear!

> However, I'm now having an other issue.
> The installation now fails when installing the SDM managed host.
> ec2        res#32 domU-12-31-39-0B-1D-31 ERROR    host       2     Step
> 'Installing and starting up SDM' failed (see ...)
>
> I believe this has something to do with the /etc/hosts file on the cloud
> host.
> When I run the install_managed_host on the cloud host
>
> sdmadm -p system -ppw -s sdmtest  install_managed_host -au root -l
> /root/spool -cs_url neo-wn01.cmi.ua.ac.be:6442<http://neo-wn01.cmi.ua.ac.be:6442>
> <http://neo-wn01.cmi.ua.ac.be:6442>
>
> (I use the password installation method for simplicity, I've already
> verified that the right certificates that are needed for password-less
> installation are present on the cloud host)
> I get the following output:
>
> A configuration for system "sdmtest" has been added.
> username [root] >
> password >
> WARNING: Host neo-wn01 is not resolvable

This looks like a problem with host names resolving differently on your
SDM master host and on the cloud host.

A little background:
The cs_url you specified on the command line above is used to contact an
RMI registry on the SDM master host. This registry hands back a URL to
which the real RMI connection should be made. From the warning you got
it looks like that this 2nd URL handed back by the RMI registry contains
the short hostname for your SDM master host.

To confirm this suspicion, it would be good if you could check on the
SDM master host, the parameters that were used for starting up your SDM
JVMs. Look (e.g. by using ps or pargs on Solaris) for a command line
switch -Djava.rmi.server.hostname=<master_host_name> in the (rather
longish) command line that was used to start the SDM JVM process.

If my suspicion is correct, than this should show the short name of your
master host (neo-wn01) instead of the FQDN neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be>

Could you verify this, please?

Cheers,
Torsten

> The installation procedure then again asks for the username and password
> for 2 times, before exiting.
> The /etc/hosts file on the cloud node currently contains
>
> # SDM master host
> 10.8.0.1        neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
>
> When I replaced this entry with the following (adding the unqualified
> name), the installer no longer gives the warning and the installation
> seems to go well.
>
> # SDM master host
> 10.8.0.1        neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
> neo-wn01
>
> Now, my question is: Is it normal that the installer needs this second
> alias in the /etc/hosts file? Can I modify anything in my sdm
> installation so that this is no long necessary?
> I know that the /etc/hosts file is edited by the startup-vpn.sh script
> that is remotely trigged  by the gef_ec2_startup_vpn_connection.sh  script.
>
> execute_ssh_script $RES_dnsName "startup-vpn.sh $sdm_master_host
> $SDM_MASTER_VPN_IP $remote_config_file $SDM_MASTER_VPN_IP" 1
>
> Trying to edit $sdm_master_host in the line above has been unsuccessful
> so far. Apparently, if this variable contains spaces (like when setting
> $sdm_master_host="neo-wn01.cmi.ua.ac.be<http://neo-wn01.cmi.ua.ac.be> <http://neo-wn01.cmi.ua.ac.be>
> neo-wn01") only the first part is added to the /etc/hosts file.
>
> Of course, solving this problem by editing
> the gef_ec2_startup_vpn_connection.sh script would only be half a
> solution...
>
> Any ideas or help would be very useful.
> Thanks a lot,
>
> Joris
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Mar 15, 2010 at 14:52, torsten <torsten.blix at sun.com<mailto:torsten.blix at sun.com>
> <mailto:torsten.blix at sun.com<mailto:torsten.blix at sun.com>>> wrote:
>
>     Hi Joris,
>
>     any progress in the meantime?
>
>     Maybe my comments below can help.
>
>     On 03/10/10 13:42, jorisroovers wrote:
>      > Hello everyone,
>      >
>      > I'm trying to setup an SDM installation with managed nodes on
>     Amazon EC2
>      > using the SDM Cloud Adapter.
>      > The installation of the adapter itself was successfull, but now I'm
>      > having some problems when starting cloud hosts.
>      >
>      > To start a cloud host, I use the commands as described in the wiki:
>      >
>      > sdmadm add_resource -s ec2 (filled in unbound_name and amiId in the
>      > editor. I'm using the sample AMI)
>      > smdadm move_resource -r cloud1 -s spare_pool
>      >
>      > when watching the 'sdmam show_resource' output I can see that the
>      > instance is successfully started (I confirmed this by using the
>     online
>      > Amazon EC2 Management Console).
>      > However, during the UNASSIGNING phase, a problem occurs while the
>     VPN is
>      > started on the cloud host.
>      >
>      > output of 'sdmadm show_resource':
>      > ec2        res#16 cloud1                ERROR    host U     2
>     Step
>      > 'Starting up VPN connection' failed (see
>      > 'Starting_up_virtual_resource-2010-03-10_11:21:43-res#16.log')
>      >
>      > I already did some research on the cause of this problem (by
>     increasing
>      > log output, removing the undo-steps so that the cloud node is not
>      > shutdown when the problem occurs and examining the executed scripts).
>      > I found out that the problem lies with the execution of the
>      > /opt/sdm/util/cloud/ec2/ami_scripts/startup-vpn.sh script on the
>     cloud node.
>      > More specifically, the 'wait_for_ping $VPN_SERVER_VPN_IP "VPN
>     server"'
>      > part fails.
>      > I suspect this is caused by the './openvpn --config
>     "$VPN_CONFIG_FILE"
>      > --daemon' that is  executed before the wait_for_ping command.
>      >
>      > I tried to run the 'openvpn' command manually on the cloud host
>     and got
>      > the following output:
>      >
>      > Wed Mar 10 10:55:16 2010 TCP: connect to 127.0.0.1:1194<http://127.0.0.1:1194>
>     <http://127.0.0.1:1194>
>      > <http://127.0.0.1:1194> failed, will try again in 5 seconds:
>     Connection
>      > refused (errno=146)
>      >
>      > This probably is the root of the problem.
>     [snip]
>
>     Good debugging so far, valuable information!
>
>     The installation step that fails for you ('Starting up VPN connection')
>     does two things (see
>     <sdm_dist_dir>/util/cloud/ec2/gef_ec2_startup_vpn_connection.sh):
>     1) create an ssh tunnel to the started up cloud host from local port
>     1194 to remote port 1194
>     2) execute a script
>     <sdm_dist_dir>/util/cloud/ec2/ami_scripts/startup-vpn.sh that then
>     starts up the openvpn client on the cloud host (the part that fails for
>     you). This openvpn client is configured to connect to port 1194 on the
>     local host (which is the cloud host), a connection which should be
>     forwarded by the ssh tunnel set up in step 1 to the VPN master running
>     on your (local) SDM master machine.
>
>     If I shoot down the ssh tunnel in my test system (after a complete and
>     successful cloud host startup) and try to restart the openvpn client on
>     the cloud host, I get exactly your error message: Connection refused
>     (errno=146).
>
>     So I'm suspecting that step 1 of the script, the ssh tunnel startup,
>     somehow fails for you.
>
>     Could you check on your local SDM master whether this ssh tunnel process
>     exists after the startup process fails (and NO undo is done)? Something
>     like "ps -ef | grep ssh" should show it. ssh should be called with
>     arguments like "-R 1194:localhost:1194 -N <public_cloudhost_name>"
>
>     If this ssh process is running, you should be able to telnet from the
>     cloud host to port localhost:1194 ("telnet localhost 1194") and get an
>     answer from the VPN master process running on the SDM master host.
>
>     A further thing to check would be the syslog on the SDM master host. The
>     VPN master is logging into the syslog any kind of problems it
>     encounters.
>
>     I hope this helps!
>
>     Cheers,
>     Torsten
>
>     ------------------------------------------------------
>     http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248721>
>
>     To unsubscribe from this discussion, e-mail:
>     [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
>     <mailto:users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>>].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=250497

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list