[GE users] error on the mpich2 + mpd + sge 6.2 integration - mpdtrace: cannot connect to local mpd

fabiomartinelli fabio.martinelli at esa.int
Thu Jul 29 14:16:40 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello SGE users

I fixed this error in the way I'm going to report, I don't know if the
solution was obvious but I wasted so many hours on this integration that it's
worth to report:

I enforced in the SGE master conf these settings:
...
rsh_command                  builtin
rsh_daemon                   builtin
...

and modified this line
[root at scigrid mpich2_mpd]# grep -Hn short startmpich2.sh
startmpich2.sh:176:NODE=`hostname --short`
[root at scigrid mpich2_mpd]#

AND

I removed the rsh binary from the computational cluster !

without removing the rsh binary I was still getting the error: "  mpdtrace:
cannot connect to local mpd "
and I couldn't seen the call:  /opt/sge/bin/lx24-amd64/qrsh -inherit -V
compute-00-00 /usr//bin/mpd

[fmartine at scigrid mpich2]$ cat  mpi_hello.po823383
-catch_rsh /var/sge/spool//compute-00-00/active_jobs/823383.1/pe_hostfile
/usr/
compute-00-00:1
compute-00-06:1
startmpich2.sh: check for local mpd daemon (1 of 10)
/opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-00 /usr//bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-06 /usr//bin/mpd -h
compute-00-00 -p 55940 -n
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /usr/

so is this the right configuration or was I just lucky ?

many thanks for your answers
best regards
Fabio








             reuti
             <reuti at staff.uni-m
             arburg.de>                                                   To
                                        users at gridengine.sunsource.net
             14/07/2010 12.22                                             cc

                                                                     Subject
             Please respond to          Re: [GE users] error on the mpich2 +
                   users                mpd + sge 6.2 integration -
             <users at gridengine.         mpdtrace: cannot connect to local
               sunsource.net>           mpd










Hi,

Am 13.07.2010 um 16:56 schrieb fabiomartinelli:

> I'm stuck with this issue on my CentOS 5.3 cluster : mpich2 + mpd + sge 6.2
> integration - mpdtrace: cannot connect to local mpd

when do you get this error? In the startup of the daemons or during execution
of the jobscript?


> please somebody has experience on this integration ?
>
> I already applied the Reuti advice to use `hostname --short` in
> "start_mpich2.sh" ( before Reuti adviced me, so the logs in this page are
> still valid )
>
> I tried to play a bit with qrsh,qlogin to use builtin instead of SSH but
they
> seems neutral to this error ( it could be because it's a preliminary error
)
>
> also I found in the rsh script a flag "just_wrap" that I don't understand
how
> to manage ( "" or 1 ? )
> [root at wn55 ge6.2u5]# head /opt/gridengine/mpich2_mpd/rsh
> #!/bin/bash -x
> #
> #
> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>
> # could be rsh or remsh
> me=`basename $0`
> # just_wrap=1

I think this option is seldom used. I inlcuded the scripts which you can also
find in $SGE_ROOT/mpi as example. The purpose of the flag just_wrap is to
test whether the rsh_wrapper is accessed at all (in case you have some
problems with the setup), but uses still the default `rsh` w/o any SGE
interaction. Hence under normal operation it should stay unset like it is, to
allow the rsh-wrapper to call `qrsh -inherit ...` .


> my aim is to enable MPICH2/MPD/SGE on SSH because I'm offering the cluster
> like a Grid site but if that's so complicated I could relax the SSH
> requirement

Even when you start the processes by SSH, you need open ports to allow MPICH2
to communicate anyway.

-- Reuti


> so kindly print your working configurations,
> many thanks
> Fabio Martinelli
>
>
>
>
>
>
> *****************
> Hi,
>
> can you please ask on the list in the future?
>
> Short answer:
>
> This looks like the ROCKS feature to return the FQDN when you call
> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
in
> "start_mpich2.sh" often, where it checks for the hostname in the loop near
> the end of the script.
>
>
> Am 08.07.2010 um 17:26 schrieb Fabio Martinelli:
>
>
>> Hello Reuti
>>
>> this is Fabio Martinelli from ESAC Madrid, there I manage a 30
>      servers CentOS 5.3 linux cluster ordered by a Sun Grid Engine 6.2u2
>      installation
>
>
>
>> between the servers I configured the SSH Hostbased authentication, so
>      no rsh daemons.
>>
>> I was following your tutorial for mpich2 + mpd
>>
>
http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

>
>>
>> the first error I fallen was the SSH support, after a while I
>      realized I should recompile start_mpich2.c explicitly requesting ssh,
>      something like:
>> [fmartine at scigrid src]$ diff start_mpich2.c start_mpich2.c.backup
>> 100c100
>> <       rsh_argv[0]="ssh";
>> ---
>
>>>      rsh_argv[0]="rsh";
>
>>
>> kindly may you confirm this method for SSH ?
>
>
> Long answer:
>
> Yes/no. The name here needs only to match the one which is used in MPICH2.
> You could even name the called program "fubar" and tell MPICH2 to use
"fubar"
> for remote access.
>
> In a tight integration, these calls to "fubar" will then be caught by SGE's
> rsh-wrapper, which must of course match the name from above (i.e. you would
> adjust the defined script for "start_proc_args" in the PE to create a link
> called "fubar" in $TMPDIR instead of the usually "rsh"). The "fubar" link
> will then call SGE's rsh-wrapper and start a `qrsh -inherit ...` in SGE.
>
> What is really used, is then in the setup of SGE: a) default builtin method
> in newer versions of SGE (despite the fact that you called "fubar" or "rsh"
> above), b) rsh in former times, or c) ssh with
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html setup
>
> In a private cluster (this I assume when you use ROCKS) there is no
security
> issue betwen the slave nodes in general. Hence ssh can often be replaced
> witht the default builtin or rsh startup.
>
>
> == Sidenote 1:
>
> Even when you use "rsh" between the nodes, you don't need to have any
daemon
> running all the time at all (and even not install any rsh stuff). SGE will
> start its own daemon from the SGE distribution on a random port and allows
> logins only from the expected source machine - one daemon for each `qrsh
> -inherit` call).
>
>
> == Sidenote 2:
>
> If you want to have interactive support via "rsh"/"qlogin" (and the rsh
> method), you would need to install the rsh-server and telnetd tools, but
> don't setup any startup for them - they just need to be there (i.e.
"disable
> yes" in /etc/xinetd.d/rsh or telnet).
>
> Including sidenote 1 and 2, you can have a tight integration of jobs with
> disabled ssh and rsh to any node for the user (and allow ssh only for admin
> staff).
>
>
> == Sidenote 3:
>
> When you don't like the builtin startup of SGE, then you can supply a
> different sshd_config to the SGE startup of SSH. I.e. normal logins for
users
> can still be blocked (this uses the /etc/ssh/sshd_config), but the adjusted
> sshd_config would allow it. Each `qrsh -inerhit` call will start a daemon
on
> its own for each job.
>
>
>
>> after that, now I'm stuck with this sequence of errors:
>> [fmartine at scigrid mpich2]$ cat mpi_hello.pe793709
>> ...
>> + /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>      /usr//bin/mpd
>> + actual_retry=1
>> + '[' 1 -le 10 ']'
>> + echo 'startmpich2.sh: check for local mpd daemon (1 of 10)'
>> ++ mpdtrace -l
>> + MASTER='mpdtrace: cannot connect to local mpd
>      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>  1. no mpd is running on this host
>>  2. an mpd is running but was started without a "console" (-n
>      option)
>> ...
>>
>>
>> [fmartine at scigrid mpich2]$ cat mpi_hello.po793709
>> -catch_rsh
>      /var/sge/spool//compute-00-10/active_jobs/793709.1/pe_hostfile /usr/
>> compute-00-10:2
>> /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>      /usr//bin/mpd
>> startmpich2.sh: check for local mpd daemon (1 of 10)
>> startmpich2.sh: check for local mpd daemon (2 of 10)
>> startmpich2.sh: check for local mpd daemon (3 of 10)
>> startmpich2.sh: check for local mpd daemon (4 of 10)
>> startmpich2.sh: check for local mpd daemon (5 of 10)
>> startmpich2.sh: check for local mpd daemon (6 of 10)
>> startmpich2.sh: check for local mpd daemon (7 of 10)
>> startmpich2.sh: check for local mpd daemon (8 of 10)
>> startmpich2.sh: check for local mpd daemon (9 of 10)
>> startmpich2.sh: check for local mpd daemon (10 of 10)
>> startmpich2.sh: local mpd could not be started, aborting
>> -catch_rsh /usr/
>> mpdallexit: cannot connect to local mpd
>      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>  1. no mpd is running on this host
>>  2. an mpd is running but was started without a "console" (-n
>      option)
>
>
> This looks like the ROCKS feature to rename the FQDN when you call
> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
in
> "start_mpich2.sh" often.
>
> (Yes, it's not in the How which I may rework at sometime: if you installed
> SGE to use the FQDN, then it's doing the right thing, and other
distributions
> need `hostname --long`).
>
>
>
>> I can confirm I can start mpich2 + mpd computations without SGE,
>      please what may I check ?
>
>
> Great, so there is no firewall on the slave nodes, which would black the
MPI
> communication.
>
> To have a full security enhanced MPI was just on the Open MPI list. It
would
> need to create two tunnels per process and would have a speed impact of
> course. Noone implemented because of this; but then all MPI communication
> would go through ssh and you would need only  one open port per machine.
>
> -- Reuti
>
>
>
>> I spent hours on Google and I have read this error happened in many
>      sites but I could find a clear solution,
>> really many thanks for your advices
>> kind regards
>> Fabio
>>
>> --
>> Fabio Martinelli
>>
>> European Space Agency (ESA)
>> Computer Support Group (CSG)
>> E-mail:
>> fabio.martinelli at esa.int
>>
>>
>> European Space Astronomy Centre (ESAC)
>> 28691 Villanueva de la Ca?ada
>> P.O. Box 78, Madrid, SPAIN
>>
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267786

>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267949


To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270924

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list