[GE users] error on the mpich2 + mpd + sge 6.2 integration - mpdtrace: cannot connect to local mpd

reuti reuti at staff.uni-marburg.de
Wed Jul 14 11:22:14 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 13.07.2010 um 16:56 schrieb fabiomartinelli:

> I'm stuck with this issue on my CentOS 5.3 cluster : mpich2 + mpd + sge 6.2
> integration - mpdtrace: cannot connect to local mpd

when do you get this error? In the startup of the daemons or during execution of the jobscript?


> please somebody has experience on this integration ?
> 
> I already applied the Reuti advice to use `hostname --short` in
> "start_mpich2.sh" ( before Reuti adviced me, so the logs in this page are
> still valid )
> 
> I tried to play a bit with qrsh,qlogin to use builtin instead of SSH but they
> seems neutral to this error ( it could be because it's a preliminary error )
> 
> also I found in the rsh script a flag "just_wrap" that I don't understand how
> to manage ( "" or 1 ? )
> [root at wn55 ge6.2u5]# head /opt/gridengine/mpich2_mpd/rsh
> #!/bin/bash -x
> #
> #
> # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.
> 
> # could be rsh or remsh
> me=`basename $0`
> # just_wrap=1

I think this option is seldom used. I inlcuded the scripts which you can also find in $SGE_ROOT/mpi as example. The purpose of the flag just_wrap is to test whether the rsh_wrapper is accessed at all (in case you have some problems with the setup), but uses still the default `rsh` w/o any SGE interaction. Hence under normal operation it should stay unset like it is, to allow the rsh-wrapper to call `qrsh -inherit ...` .


> my aim is to enable MPICH2/MPD/SGE on SSH because I'm offering the cluster
> like a Grid site but if that's so complicated I could relax the SSH
> requirement

Even when you start the processes by SSH, you need open ports to allow MPICH2 to communicate anyway.

-- Reuti


> so kindly print your working configurations,
> many thanks
> Fabio Martinelli
> 
> 
> 
> 
> 
> 
> *****************
> Hi,
> 
> can you please ask on the list in the future?
> 
> Short answer:
> 
> This looks like the ROCKS feature to return the FQDN when you call
> `hostname`. The call `hostname` needs to be adjusted to `hostname --short` in
> "start_mpich2.sh" often, where it checks for the hostname in the loop near
> the end of the script.
> 
> 
> Am 08.07.2010 um 17:26 schrieb Fabio Martinelli:
> 
> 
>> Hello Reuti
>> 
>> this is Fabio Martinelli from ESAC Madrid, there I manage a 30
>      servers CentOS 5.3 linux cluster ordered by a Sun Grid Engine 6.2u2
>      installation
> 
> 
> 
>> between the servers I configured the SSH Hostbased authentication, so
>      no rsh daemons.
>> 
>> I was following your tutorial for mpich2 + mpd
>> 
>      http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
> 
>> 
>> the first error I fallen was the SSH support, after a while I
>      realized I should recompile start_mpich2.c explicitly requesting ssh,
>      something like:
>> [fmartine at scigrid src]$ diff start_mpich2.c start_mpich2.c.backup
>> 100c100
>> <       rsh_argv[0]="ssh";
>> ---
> 
>>>      rsh_argv[0]="rsh";
> 
>> 
>> kindly may you confirm this method for SSH ?
> 
> 
> Long answer:
> 
> Yes/no. The name here needs only to match the one which is used in MPICH2.
> You could even name the called program "fubar" and tell MPICH2 to use "fubar"
> for remote access.
> 
> In a tight integration, these calls to "fubar" will then be caught by SGE's
> rsh-wrapper, which must of course match the name from above (i.e. you would
> adjust the defined script for "start_proc_args" in the PE to create a link
> called "fubar" in $TMPDIR instead of the usually "rsh"). The "fubar" link
> will then call SGE's rsh-wrapper and start a `qrsh -inherit ...` in SGE.
> 
> What is really used, is then in the setup of SGE: a) default builtin method
> in newer versions of SGE (despite the fact that you called "fubar" or "rsh"
> above), b) rsh in former times, or c) ssh with
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html setup
> 
> In a private cluster (this I assume when you use ROCKS) there is no security
> issue betwen the slave nodes in general. Hence ssh can often be replaced
> witht the default builtin or rsh startup.
> 
> 
> == Sidenote 1:
> 
> Even when you use "rsh" between the nodes, you don't need to have any daemon
> running all the time at all (and even not install any rsh stuff). SGE will
> start its own daemon from the SGE distribution on a random port and allows
> logins only from the expected source machine - one daemon for each `qrsh
> -inherit` call).
> 
> 
> == Sidenote 2:
> 
> If you want to have interactive support via "rsh"/"qlogin" (and the rsh
> method), you would need to install the rsh-server and telnetd tools, but
> don't setup any startup for them - they just need to be there (i.e. "disable
> yes" in /etc/xinetd.d/rsh or telnet).
> 
> Including sidenote 1 and 2, you can have a tight integration of jobs with
> disabled ssh and rsh to any node for the user (and allow ssh only for admin
> staff).
> 
> 
> == Sidenote 3:
> 
> When you don't like the builtin startup of SGE, then you can supply a
> different sshd_config to the SGE startup of SSH. I.e. normal logins for users
> can still be blocked (this uses the /etc/ssh/sshd_config), but the adjusted
> sshd_config would allow it. Each `qrsh -inerhit` call will start a daemon on
> its own for each job.
> 
> 
> 
>> after that, now I'm stuck with this sequence of errors:
>> [fmartine at scigrid mpich2]$ cat mpi_hello.pe793709
>> ...
>> + /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>      /usr//bin/mpd
>> + actual_retry=1
>> + '[' 1 -le 10 ']'
>> + echo 'startmpich2.sh: check for local mpd daemon (1 of 10)'
>> ++ mpdtrace -l
>> + MASTER='mpdtrace: cannot connect to local mpd
>      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>  1. no mpd is running on this host
>>  2. an mpd is running but was started without a "console" (-n
>      option)
>> ...
>> 
>> 
>> [fmartine at scigrid mpich2]$ cat mpi_hello.po793709
>> -catch_rsh
>      /var/sge/spool//compute-00-10/active_jobs/793709.1/pe_hostfile /usr/
>> compute-00-10:2
>> /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
>      /usr//bin/mpd
>> startmpich2.sh: check for local mpd daemon (1 of 10)
>> startmpich2.sh: check for local mpd daemon (2 of 10)
>> startmpich2.sh: check for local mpd daemon (3 of 10)
>> startmpich2.sh: check for local mpd daemon (4 of 10)
>> startmpich2.sh: check for local mpd daemon (5 of 10)
>> startmpich2.sh: check for local mpd daemon (6 of 10)
>> startmpich2.sh: check for local mpd daemon (7 of 10)
>> startmpich2.sh: check for local mpd daemon (8 of 10)
>> startmpich2.sh: check for local mpd daemon (9 of 10)
>> startmpich2.sh: check for local mpd daemon (10 of 10)
>> startmpich2.sh: local mpd could not be started, aborting
>> -catch_rsh /usr/
>> mpdallexit: cannot connect to local mpd
>      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
>>  1. no mpd is running on this host
>>  2. an mpd is running but was started without a "console" (-n
>      option)
> 
> 
> This looks like the ROCKS feature to rename the FQDN when you call
> `hostname`. The call `hostname` needs to be adjusted to `hostname --short` in
> "start_mpich2.sh" often.
> 
> (Yes, it's not in the How which I may rework at sometime: if you installed
> SGE to use the FQDN, then it's doing the right thing, and other distributions
> need `hostname --long`).
> 
> 
> 
>> I can confirm I can start mpich2 + mpd computations without SGE,
>      please what may I check ?
> 
> 
> Great, so there is no firewall on the slave nodes, which would black the MPI
> communication.
> 
> To have a full security enhanced MPI was just on the Open MPI list. It would
> need to create two tunnels per process and would have a speed impact of
> course. Noone implemented because of this; but then all MPI communication
> would go through ssh and you would need only  one open port per machine.
> 
> -- Reuti
> 
> 
> 
>> I spent hours on Google and I have read this error happened in many
>      sites but I could find a clear solution,
>> really many thanks for your advices
>> kind regards
>> Fabio
>> 
>> --
>> Fabio Martinelli
>> 
>> European Space Agency (ESA)
>> Computer Support Group (CSG)
>> E-mail:
>> fabio.martinelli at esa.int
>> 
>> 
>> European Space Astronomy Centre (ESAC)
>> 28691 Villanueva de la Ca?ada
>> P.O. Box 78, Madrid, SPAIN
>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267786
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267949

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list