[GE users] error on the mpich2 + mpd + sge 6.2 integration - mpdtrace: cannot connect to local mpd

reuti reuti at staff.uni-marburg.de
Thu Jul 29 14:43:48 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 29.07.2010 um 15:16 schrieb fabiomartinelli:

> I fixed this error in the way I'm going to report, I don't know if the
> solution was obvious but I wasted so many hours on this integration that it's
> worth to report:
>
> I enforced in the SGE master conf these settings:
> ...
> rsh_command                  builtin
> rsh_daemon                   builtin
> ...
>
> and modified this line
> [root at scigrid mpich2_mpd]# grep -Hn short startmpich2.sh
> startmpich2.sh:176:NODE=`hostname --short`
> [root at scigrid mpich2_mpd]#

yep.


> AND
>
> I removed the rsh binary from the computational cluster !

This is strange and shouldn't be necessary, but it might depend on the way the call to "rsh" is coded in your mpich2. When it's a hardcoded full path, then the -catch_rsh can't work of course. But then it shouldn't work with your current setup either. By default the $TMPDIR is first in the $PATH which is given to start_proc_args and so on, and this way the rsh-wrapper is found first.

If you like to investigate this, you could submit a job with a simple "echo $PATH" statement.

-- Reuti


> without removing the rsh binary I was still getting the error: "  mpdtrace:
> cannot connect to local mpd "
> and I couldn't seen the call:  /opt/sge/bin/lx24-amd64/qrsh -inherit -V
> compute-00-00 /usr//bin/mpd
>
> [fmartine at scigrid mpich2]$ cat  mpi_hello.po823383
> -catch_rsh /var/sge/spool//compute-00-00/active_jobs/823383.1/pe_hostfile
> /usr/
> compute-00-00:1
> compute-00-06:1
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-00 /usr//bin/mpd
> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for mpd daemons (1 of 10)
> /opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-06 /usr//bin/mpd -h
> compute-00-00 -p 55940 -n
> startmpich2.sh: check for mpd daemons (2 of 10)
> startmpich2.sh: got all 2 of 2 nodes
> -catch_rsh /usr/
>
> so is this the right configuration or was I just lucky ?
>
> many thanks for your answers
> best regards
> Fabio
>
>
>
>
>
>
>
>
>             reuti
>             <reuti at staff.uni-m
>             arburg.de>                                                   To
>                                        users at gridengine.sunsource.net
>             14/07/2010 12.22                                             cc
>
>                                                                     Subject
>             Please respond to          Re: [GE users] error on the mpich2 +
>                   users                mpd + sge 6.2 integration -
>             <users at gridengine.         mpdtrace: cannot connect to local
>               sunsource.net>           mpd
>
>
>
>
>
>
>
>
>
>
> Hi,
>
> Am 13.07.2010 um 16:56 schrieb fabiomartinelli:
>
>> I'm stuck with this issue on my CentOS 5.3 cluster : mpich2 + mpd + sge 6.2
>> integration - mpdtrace: cannot connect to local mpd
>
> when do you get this error? In the startup of the daemons or during execution
> of the jobscript?
>
>
>> please somebody has experience on this integration ?
>>
>> I already applied the Reuti advice to use `hostname --short` in
>> "start_mpich2.sh" ( before Reuti adviced me, so the logs in this page are
>> still valid )
>>
>> I tried to play a bit with qrsh,qlogin to use builtin instead of SSH but
> they
>> seems neutral to this error ( it could be because it's a preliminary error
> )
>>
>> also I found in the rsh script a flag "just_wrap" that I don't understand
> how
>> to manage ( "" or 1 ? )
>> [root at wn55 ge6.2u5]# head /opt/gridengine/mpich2_mpd/rsh
>> #!/bin/bash -x
>> #
>> #
>> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>>
>> # could be rsh or remsh
>> me=`basename $0`
>> # just_wrap=1
>
> I think this option is seldom used. I inlcuded the scripts which you can also
> find in $SGE_ROOT/mpi as example. The purpose of the flag just_wrap is to
> test whether the rsh_wrapper is accessed at all (in case you have some
> problems with the setup), but uses still the default `rsh` w/o any SGE
> interaction. Hence under normal operation it should stay unset like it is, to
> allow the rsh-wrapper to call `qrsh -inherit ...` .
>
>
>> my aim is to enable MPICH2/MPD/SGE on SSH because I'm offering the cluster
>> like a Grid site but if that's so complicated I could relax the SSH
>> requirement
>
> Even when you start the processes by SSH, you need open ports to allow MPICH2
> to communicate anyway.
>
> -- Reuti
>
>
>> so kindly print your working configurations,
>> many thanks
>> Fabio Martinelli
>>
>>
>>
>>
>>
>>
>> *****************
>> Hi,
>>
>> can you please ask on the list in the future?
>>
>> Short answer:
>>
>> This looks like the ROCKS feature to return the FQDN when you call
>> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
> in
>> "start_mpich2.sh" often, where it checks for the hostname in the loop near
>> the end of the script.
>>
>>
>> Am 08.07.2010 um 17:26 schrieb Fabio Martinelli:
>>
>>
>>> Hello Reuti
>>>
>>> this is Fabio Martinelli from ESAC Madrid, there I manage a 30
>>     servers CentOS 5.3 linux cluster ordered by a Sun Grid Engine 6.2u2
>>     installation
>>
>>
>>
>>> between the servers I configured the SSH Hostbased authentication, so
>>     no rsh daemons.
>>>
>>> I was following your tutorial for mpich2 + mpd
>>>
>>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
>
>>
>>>
>>> the first error I fallen was the SSH support, after a while I
>>     realized I should recompile start_mpich2.c explicitly requesting ssh,
>>     something like:
>>> [fmartine at scigrid src]$ diff start_mpich2.c start_mpich2.c.backup
>>> 100c100
>>> <       rsh_argv[0]="ssh";
>>> ---
>>
>>>>     rsh_argv[0]="rsh";
>>
>>>
>>> kindly may you confirm this method for SSH ?
>>
>>
>> Long answer:
>>
>> Yes/no. The name here needs only to match the one which is used in MPICH2.
>> You could even name the called program "fubar" and tell MPICH2 to use
> "fubar"
>> for remote access.
>>
>> In a tight integration, these calls to "fubar" will then be caught by SGE's
>> rsh-wrapper, which must of course match the name from above (i.e. you would
>> adjust the defined script for "start_proc_args" in the PE to create a link
>> called "fubar" in $TMPDIR instead of the usually "rsh"). The "fubar" link
>> will then call SGE's rsh-wrapper and start a `qrsh -inherit ...` in SGE.
>>
>> What is really used, is then in the setup of SGE: a) default builtin method
>> in newer versions of SGE (despite the fact that you called "fubar" or "rsh"
>> above), b) rsh in former times, or c) ssh with
>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html setup
>>
>> In a private cluster (this I assume when you use ROCKS) there is no
> security
>> issue betwen the slave nodes in general. Hence ssh can often be replaced
>> witht the default builtin or rsh startup.
>>
>>
>> == Sidenote 1:
>>
>> Even when you use "rsh" between the nodes, you don't need to have any
> daemon
>> running all the time at all (and even not install any rsh stuff). SGE will
>> start its own daemon from the SGE distribution on a random port and allows
>> logins only from the expected source machine - one daemon for each `qrsh
>> -inherit` call).
>>
>>
>> == Sidenote 2:
>>
>> If you want to have interactive support via "rsh"/"qlogin" (and the rsh
>> method), you would need to install the rsh-server and telnetd tools, but
>> don't setup any startup for them - they just need to be there (i.e.
> "disable
>> yes" in /etc/xinetd.d/rsh or telnet).
>>
>> Including sidenote 1 and 2, you can have a tight integration of jobs with
>> disabled ssh and rsh to any node for the user (and allow ssh only for admin
>> staff).
>>
>>
>> == Sidenote 3:
>>
>> When you don't like the builtin startup of SGE, then you can supply a
>> different sshd_config to the SGE startup of SSH. I.e. normal logins for
> users
>> can still be blocked (this uses the /etc/ssh/sshd_config), but the adjusted
>> sshd_config would allow it. Each `qrsh -inerhit` call will start a daemon
> on
>> its own for each job.
>>
>>
>>
>>> after that, now I'm stuck with this sequence of errors:
>>> [fmartine at scigrid mpich2]$ cat mpi_hello.pe793709
>>> ...
>>> + /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>>     /usr//bin/mpd
>>> + actual_retry=1
>>> + '[' 1 -le 10 ']'
>>> + echo 'startmpich2.sh: check for local mpd daemon (1 of 10)'
>>> ++ mpdtrace -l
>>> + MASTER='mpdtrace: cannot connect to local mpd
>>     (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>> 1. no mpd is running on this host
>>> 2. an mpd is running but was started without a "console" (-n
>>     option)
>>> ...
>>>
>>>
>>> [fmartine at scigrid mpich2]$ cat mpi_hello.po793709
>>> -catch_rsh
>>     /var/sge/spool//compute-00-10/active_jobs/793709.1/pe_hostfile /usr/
>>> compute-00-10:2
>>> /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>>     /usr//bin/mpd
>>> startmpich2.sh: check for local mpd daemon (1 of 10)
>>> startmpich2.sh: check for local mpd daemon (2 of 10)
>>> startmpich2.sh: check for local mpd daemon (3 of 10)
>>> startmpich2.sh: check for local mpd daemon (4 of 10)
>>> startmpich2.sh: check for local mpd daemon (5 of 10)
>>> startmpich2.sh: check for local mpd daemon (6 of 10)
>>> startmpich2.sh: check for local mpd daemon (7 of 10)
>>> startmpich2.sh: check for local mpd daemon (8 of 10)
>>> startmpich2.sh: check for local mpd daemon (9 of 10)
>>> startmpich2.sh: check for local mpd daemon (10 of 10)
>>> startmpich2.sh: local mpd could not be started, aborting
>>> -catch_rsh /usr/
>>> mpdallexit: cannot connect to local mpd
>>     (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>> 1. no mpd is running on this host
>>> 2. an mpd is running but was started without a "console" (-n
>>     option)
>>
>>
>> This looks like the ROCKS feature to rename the FQDN when you call
>> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
> in
>> "start_mpich2.sh" often.
>>
>> (Yes, it's not in the How which I may rework at sometime: if you installed
>> SGE to use the FQDN, then it's doing the right thing, and other
> distributions
>> need `hostname --long`).
>>
>>
>>
>>> I can confirm I can start mpich2 + mpd computations without SGE,
>>     please what may I check ?
>>
>>
>> Great, so there is no firewall on the slave nodes, which would black the
> MPI
>> communication.
>>
>> To have a full security enhanced MPI was just on the Open MPI list. It
> would
>> need to create two tunnels per process and would have a speed impact of
>> course. Noone implemented because of this; but then all MPI communication
>> would go through ssh and you would need only  one open port per machine.
>>
>> -- Reuti
>>
>>
>>
>>> I spent hours on Google and I have read this error happened in many
>>     sites but I could find a clear solution,
>>> really many thanks for your advices
>>> kind regards
>>> Fabio
>>>
>>> --
>>> Fabio Martinelli
>>>
>>> European Space Agency (ESA)
>>> Computer Support Group (CSG)
>>> E-mail:
>>> fabio.martinelli at esa.int
>>>
>>>
>>> European Space Astronomy Centre (ESAC)
>>> 28691 Villanueva de la Ca?ada
>>> P.O. Box 78, Madrid, SPAIN
>>>
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267786
>
>>
>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267949
>
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270924
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270927

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list