[GE users] error on the mpich2 + mpd + sge 6.2 integration - mpdtrace: cannot connect to local mpd

reuti reuti at staff.uni-marburg.de
Thu Jul 29 17:27:47 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Fabio,

Am 29.07.2010 um 16:01 schrieb fabiomartinelli:

> this is the output of a echo $PATH job in the PE mpich2_mpd
> [fmartine at scigrid ~]$ cat test.sh.o823386
> /tmp/823386.1.mpi.q.test:/usr/local/bin:/bin:/usr/bin

well, this is as it should be, and should find the rsh wrapper first. I don't know exactly how mpich2 searches for a method to start the daemons.

You could reinstall rsh in the former location on one node (or put there an executable dummy program/script), and check with a serial job w/o a PE what:

which rsh

is giving. Then the same with a PE which doesn't need any start_proc_args, e.g. often something like a PE "smp" and $pe_slots is available on the clusters.

-- Reuti


>
> I'm using these RPMs:
> mpich2-devel-1.2.1-3.el5
> mpich2-1.2.1-3.el5
>
> if that matters, there are many others MPIs distribution installed in the
> same servers
>
> many thanks
> fabio
>
>
>
>
>
>
>             reuti
>             <reuti at staff.uni-m
>             arburg.de>                                                   To
>                                        users at gridengine.sunsource.net
>             29/07/2010 15.43                                             cc
>
>                                                                     Subject
>             Please respond to          Re: [GE users] error on the mpich2 +
>                   users                mpd + sge 6.2 integration -
>             <users at gridengine.         mpdtrace: cannot connect to local
>               sunsource.net>           mpd
>
>
>
>
>
>
>
>
>
>
> Hi,
>
> Am 29.07.2010 um 15:16 schrieb fabiomartinelli:
>
>> I fixed this error in the way I'm going to report, I don't know if the
>> solution was obvious but I wasted so many hours on this integration that
> it's
>> worth to report:
>>
>> I enforced in the SGE master conf these settings:
>> ...
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> ...
>>
>> and modified this line
>> [root at scigrid mpich2_mpd]# grep -Hn short startmpich2.sh
>> startmpich2.sh:176:NODE=`hostname --short`
>> [root at scigrid mpich2_mpd]#
>
> yep.
>
>
>> AND
>>
>> I removed the rsh binary from the computational cluster !
>
> This is strange and shouldn't be necessary, but it might depend on the way
> the call to "rsh" is coded in your mpich2. When it's a hardcoded full path,
> then the -catch_rsh can't work of course. But then it shouldn't work with
> your current setup either. By default the $TMPDIR is first in the $PATH which
> is given to start_proc_args and so on, and this way the rsh-wrapper is found
> first.
>
> If you like to investigate this, you could submit a job with a simple "echo
> $PATH" statement.
>
> -- Reuti
>
>
>> without removing the rsh binary I was still getting the error: "  mpdtrace:
>> cannot connect to local mpd "
>> and I couldn't seen the call:  /opt/sge/bin/lx24-amd64/qrsh -inherit -V
>> compute-00-00 /usr//bin/mpd
>>
>> [fmartine at scigrid mpich2]$ cat  mpi_hello.po823383
>> -catch_rsh /var/sge/spool//compute-00-00/active_jobs/823383.1/pe_hostfile
>> /usr/
>> compute-00-00:1
>> compute-00-06:1
>> startmpich2.sh: check for local mpd daemon (1 of 10)
>> /opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-00 /usr//bin/mpd
>> startmpich2.sh: check for local mpd daemon (2 of 10)
>> startmpich2.sh: check for mpd daemons (1 of 10)
>> /opt/sge/bin/lx24-amd64/qrsh -inherit -V compute-00-06 /usr//bin/mpd -h
>> compute-00-00 -p 55940 -n
>> startmpich2.sh: check for mpd daemons (2 of 10)
>> startmpich2.sh: got all 2 of 2 nodes
>> -catch_rsh /usr/
>>
>> so is this the right configuration or was I just lucky ?
>>
>> many thanks for your answers
>> best regards
>> Fabio
>>
>>
>>
>>
>>
>>
>>
>>
>>            reuti
>>            <reuti at staff.uni-m
>>            arburg.de>                                                   To
>>                                       users at gridengine.sunsource.net
>>            14/07/2010 12.22                                             cc
>>
>>                                                                    Subject
>>            Please respond to          Re: [GE users] error on the mpich2 +
>>                  users                mpd + sge 6.2 integration -
>>            <users at gridengine.         mpdtrace: cannot connect to local
>>              sunsource.net>           mpd
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>> Am 13.07.2010 um 16:56 schrieb fabiomartinelli:
>>
>>> I'm stuck with this issue on my CentOS 5.3 cluster : mpich2 + mpd + sge
> 6.2
>>> integration - mpdtrace: cannot connect to local mpd
>>
>> when do you get this error? In the startup of the daemons or during
> execution
>> of the jobscript?
>>
>>
>>> please somebody has experience on this integration ?
>>>
>>> I already applied the Reuti advice to use `hostname --short` in
>>> "start_mpich2.sh" ( before Reuti adviced me, so the logs in this page are
>>> still valid )
>>>
>>> I tried to play a bit with qrsh,qlogin to use builtin instead of SSH but
>> they
>>> seems neutral to this error ( it could be because it's a preliminary error
>> )
>>>
>>> also I found in the rsh script a flag "just_wrap" that I don't understand
>> how
>>> to manage ( "" or 1 ? )
>>> [root at wn55 ge6.2u5]# head /opt/gridengine/mpich2_mpd/rsh
>>> #!/bin/bash -x
>>> #
>>> #
>>> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
>>>
>>> # could be rsh or remsh
>>> me=`basename $0`
>>> # just_wrap=1
>>
>> I think this option is seldom used. I inlcuded the scripts which you can
> also
>> find in $SGE_ROOT/mpi as example. The purpose of the flag just_wrap is to
>> test whether the rsh_wrapper is accessed at all (in case you have some
>> problems with the setup), but uses still the default `rsh` w/o any SGE
>> interaction. Hence under normal operation it should stay unset like it is,
> to
>> allow the rsh-wrapper to call `qrsh -inherit ...` .
>>
>>
>>> my aim is to enable MPICH2/MPD/SGE on SSH because I'm offering the cluster
>>> like a Grid site but if that's so complicated I could relax the SSH
>>> requirement
>>
>> Even when you start the processes by SSH, you need open ports to allow
> MPICH2
>> to communicate anyway.
>>
>> -- Reuti
>>
>>
>>> so kindly print your working configurations,
>>> many thanks
>>> Fabio Martinelli
>>>
>>>
>>>
>>>
>>>
>>>
>>> *****************
>>> Hi,
>>>
>>> can you please ask on the list in the future?
>>>
>>> Short answer:
>>>
>>> This looks like the ROCKS feature to return the FQDN when you call
>>> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
>> in
>>> "start_mpich2.sh" often, where it checks for the hostname in the loop near
>>> the end of the script.
>>>
>>>
>>> Am 08.07.2010 um 17:26 schrieb Fabio Martinelli:
>>>
>>>
>>>> Hello Reuti
>>>>
>>>> this is Fabio Martinelli from ESAC Madrid, there I manage a 30
>>>    servers CentOS 5.3 linux cluster ordered by a Sun Grid Engine 6.2u2
>>>    installation
>>>
>>>
>>>
>>>> between the servers I configured the SSH Hostbased authentication, so
>>>    no rsh daemons.
>>>>
>>>> I was following your tutorial for mpich2 + mpd
>>>>
>>>
>>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
>
>>
>>>
>>>>
>>>> the first error I fallen was the SSH support, after a while I
>>>    realized I should recompile start_mpich2.c explicitly requesting ssh,
>>>    something like:
>>>> [fmartine at scigrid src]$ diff start_mpich2.c start_mpich2.c.backup
>>>> 100c100
>>>> <       rsh_argv[0]="ssh";
>>>> ---
>>>
>>>>>    rsh_argv[0]="rsh";
>>>
>>>>
>>>> kindly may you confirm this method for SSH ?
>>>
>>>
>>> Long answer:
>>>
>>> Yes/no. The name here needs only to match the one which is used in MPICH2.
>>> You could even name the called program "fubar" and tell MPICH2 to use
>> "fubar"
>>> for remote access.
>>>
>>> In a tight integration, these calls to "fubar" will then be caught by
> SGE's
>>> rsh-wrapper, which must of course match the name from above (i.e. you
> would
>>> adjust the defined script for "start_proc_args" in the PE to create a link
>>> called "fubar" in $TMPDIR instead of the usually "rsh"). The "fubar" link
>>> will then call SGE's rsh-wrapper and start a `qrsh -inherit ...` in SGE.
>>>
>>> What is really used, is then in the setup of SGE: a) default builtin
> method
>>> in newer versions of SGE (despite the fact that you called "fubar" or
> "rsh"
>>> above), b) rsh in former times, or c) ssh with
>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html setup
>>>
>>> In a private cluster (this I assume when you use ROCKS) there is no
>> security
>>> issue betwen the slave nodes in general. Hence ssh can often be replaced
>>> witht the default builtin or rsh startup.
>>>
>>>
>>> == Sidenote 1:
>>>
>>> Even when you use "rsh" between the nodes, you don't need to have any
>> daemon
>>> running all the time at all (and even not install any rsh stuff). SGE will
>>> start its own daemon from the SGE distribution on a random port and allows
>>> logins only from the expected source machine - one daemon for each `qrsh
>>> -inherit` call).
>>>
>>>
>>> == Sidenote 2:
>>>
>>> If you want to have interactive support via "rsh"/"qlogin" (and the rsh
>>> method), you would need to install the rsh-server and telnetd tools, but
>>> don't setup any startup for them - they just need to be there (i.e.
>> "disable
>>> yes" in /etc/xinetd.d/rsh or telnet).
>>>
>>> Including sidenote 1 and 2, you can have a tight integration of jobs with
>>> disabled ssh and rsh to any node for the user (and allow ssh only for
> admin
>>> staff).
>>>
>>>
>>> == Sidenote 3:
>>>
>>> When you don't like the builtin startup of SGE, then you can supply a
>>> different sshd_config to the SGE startup of SSH. I.e. normal logins for
>> users
>>> can still be blocked (this uses the /etc/ssh/sshd_config), but the
> adjusted
>>> sshd_config would allow it. Each `qrsh -inerhit` call will start a daemon
>> on
>>> its own for each job.
>>>
>>>
>>>
>>>> after that, now I'm stuck with this sequence of errors:
>>>> [fmartine at scigrid mpich2]$ cat mpi_hello.pe793709
>>>> ...
>>>> + /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>>>    /usr//bin/mpd
>>>> + actual_retry=1
>>>> + '[' 1 -le 10 ']'
>>>> + echo 'startmpich2.sh: check for local mpd daemon (1 of 10)'
>>>> ++ mpdtrace -l
>>>> + MASTER='mpdtrace: cannot connect to local mpd
>>>    (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>>> 1. no mpd is running on this host
>>>> 2. an mpd is running but was started without a "console" (-n
>>>    option)
>>>> ...
>>>>
>>>>
>>>> [fmartine at scigrid mpich2]$ cat mpi_hello.po793709
>>>> -catch_rsh
>>>    /var/sge/spool//compute-00-10/active_jobs/793709.1/pe_hostfile /usr/
>>>> compute-00-10:2
>>>> /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>>>    /usr//bin/mpd
>>>> startmpich2.sh: check for local mpd daemon (1 of 10)
>>>> startmpich2.sh: check for local mpd daemon (2 of 10)
>>>> startmpich2.sh: check for local mpd daemon (3 of 10)
>>>> startmpich2.sh: check for local mpd daemon (4 of 10)
>>>> startmpich2.sh: check for local mpd daemon (5 of 10)
>>>> startmpich2.sh: check for local mpd daemon (6 of 10)
>>>> startmpich2.sh: check for local mpd daemon (7 of 10)
>>>> startmpich2.sh: check for local mpd daemon (8 of 10)
>>>> startmpich2.sh: check for local mpd daemon (9 of 10)
>>>> startmpich2.sh: check for local mpd daemon (10 of 10)
>>>> startmpich2.sh: local mpd could not be started, aborting
>>>> -catch_rsh /usr/
>>>> mpdallexit: cannot connect to local mpd
>>>    (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>>> 1. no mpd is running on this host
>>>> 2. an mpd is running but was started without a "console" (-n
>>>    option)
>>>
>>>
>>> This looks like the ROCKS feature to rename the FQDN when you call
>>> `hostname`. The call `hostname` needs to be adjusted to `hostname --short`
>> in
>>> "start_mpich2.sh" often.
>>>
>>> (Yes, it's not in the How which I may rework at sometime: if you installed
>>> SGE to use the FQDN, then it's doing the right thing, and other
>> distributions
>>> need `hostname --long`).
>>>
>>>
>>>
>>>> I can confirm I can start mpich2 + mpd computations without SGE,
>>>    please what may I check ?
>>>
>>>
>>> Great, so there is no firewall on the slave nodes, which would black the
>> MPI
>>> communication.
>>>
>>> To have a full security enhanced MPI was just on the Open MPI list. It
>> would
>>> need to create two tunnels per process and would have a speed impact of
>>> course. Noone implemented because of this; but then all MPI communication
>>> would go through ssh and you would need only  one open port per machine.
>>>
>>> -- Reuti
>>>
>>>
>>>
>>>> I spent hours on Google and I have read this error happened in many
>>>    sites but I could find a clear solution,
>>>> really many thanks for your advices
>>>> kind regards
>>>> Fabio
>>>>
>>>> --
>>>> Fabio Martinelli
>>>>
>>>> European Space Agency (ESA)
>>>> Computer Support Group (CSG)
>>>> E-mail:
>>>> fabio.martinelli at esa.int
>>>>
>>>>
>>>> European Space Astronomy Centre (ESAC)
>>>> 28691 Villanueva de la Ca?ada
>>>> P.O. Box 78, Madrid, SPAIN
>>>>
>>>
>>> ------------------------------------------------------
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267786
>
>>
>>>
>>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267949
>
>>
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270924
>
>>
>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270927
>
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270933
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270954

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list