[GE users] error on the mpich2 + mpd + sge 6.2 integration - mpdtrace: cannot connect to local mpd

fabiomartinelli fabio.martinelli at esa.int
Tue Jul 13 15:56:27 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello SGE Users

I'm stuck with this issue on my CentOS 5.3 cluster : mpich2 + mpd + sge 6.2
integration - mpdtrace: cannot connect to local mpd

please somebody has experience on this integration ?

I already applied the Reuti advice to use `hostname --short` in
"start_mpich2.sh" ( before Reuti adviced me, so the logs in this page are
still valid )

I tried to play a bit with qrsh,qlogin to use builtin instead of SSH but they
seems neutral to this error ( it could be because it's a preliminary error )

also I found in the rsh script a flag "just_wrap" that I don't understand how
to manage ( "" or 1 ? )
[root at wn55 ge6.2u5]# head /opt/gridengine/mpich2_mpd/rsh
#!/bin/bash -x
#
#
# (c) 2002 Sun Microsystems, Inc. Use is subject to license terms.

# could be rsh or remsh
me=`basename $0`
# just_wrap=1


my aim is to enable MPICH2/MPD/SGE on SSH because I'm offering the cluster
like a Grid site but if that's so complicated I could relax the SSH
requirement

so kindly print your working configurations,
many thanks
Fabio Martinelli






*****************
Hi,

can you please ask on the list in the future?

Short answer:

This looks like the ROCKS feature to return the FQDN when you call
`hostname`. The call `hostname` needs to be adjusted to `hostname --short` in
"start_mpich2.sh" often, where it checks for the hostname in the loop near
the end of the script.


Am 08.07.2010 um 17:26 schrieb Fabio Martinelli:


      > Hello Reuti
      >
      > this is Fabio Martinelli from ESAC Madrid, there I manage a 30
      servers CentOS 5.3 linux cluster ordered by a Sun Grid Engine 6.2u2
      installation



      > between the servers I configured the SSH Hostbased authentication, so
      no rsh daemons.
      >
      > I was following your tutorial for mpich2 + mpd
      >
      http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

      >
      > the first error I fallen was the SSH support, after a while I
      realized I should recompile start_mpich2.c explicitly requesting ssh,
      something like:
      > [fmartine at scigrid src]$ diff start_mpich2.c start_mpich2.c.backup
      > 100c100
      > <       rsh_argv[0]="ssh";
      > ---

            > >       rsh_argv[0]="rsh";

      >
      > kindly may you confirm this method for SSH ?


Long answer:

Yes/no. The name here needs only to match the one which is used in MPICH2.
You could even name the called program "fubar" and tell MPICH2 to use "fubar"
for remote access.

In a tight integration, these calls to "fubar" will then be caught by SGE's
rsh-wrapper, which must of course match the name from above (i.e. you would
adjust the defined script for "start_proc_args" in the PE to create a link
called "fubar" in $TMPDIR instead of the usually "rsh"). The "fubar" link
will then call SGE's rsh-wrapper and start a `qrsh -inherit ...` in SGE.

What is really used, is then in the setup of SGE: a) default builtin method
in newer versions of SGE (despite the fact that you called "fubar" or "rsh"
above), b) rsh in former times, or c) ssh with
http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html setup

In a private cluster (this I assume when you use ROCKS) there is no security
issue betwen the slave nodes in general. Hence ssh can often be replaced
witht the default builtin or rsh startup.


== Sidenote 1:

Even when you use "rsh" between the nodes, you don't need to have any daemon
running all the time at all (and even not install any rsh stuff). SGE will
start its own daemon from the SGE distribution on a random port and allows
logins only from the expected source machine - one daemon for each `qrsh
-inherit` call).


== Sidenote 2:

If you want to have interactive support via "rsh"/"qlogin" (and the rsh
method), you would need to install the rsh-server and telnetd tools, but
don't setup any startup for them - they just need to be there (i.e. "disable
yes" in /etc/xinetd.d/rsh or telnet).

Including sidenote 1 and 2, you can have a tight integration of jobs with
disabled ssh and rsh to any node for the user (and allow ssh only for admin
staff).


== Sidenote 3:

When you don't like the builtin startup of SGE, then you can supply a
different sshd_config to the SGE startup of SSH. I.e. normal logins for users
can still be blocked (this uses the /etc/ssh/sshd_config), but the adjusted
sshd_config would allow it. Each `qrsh -inerhit` call will start a daemon on
its own for each job.



      > after that, now I'm stuck with this sequence of errors:
      > [fmartine at scigrid mpich2]$ cat mpi_hello.pe793709
      > ...
      > + /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
      /usr//bin/mpd
      > + actual_retry=1
      > + '[' 1 -le 10 ']'
      > + echo 'startmpich2.sh: check for local mpd daemon (1 of 10)'
      > ++ mpdtrace -l
      > + MASTER='mpdtrace: cannot connect to local mpd
      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
      >   1. no mpd is running on this host
      >   2. an mpd is running but was started without a "console" (-n
      option)
      > ...
      >
      >
      > [fmartine at scigrid mpich2]$ cat mpi_hello.po793709
      > -catch_rsh
      /var/sge/spool//compute-00-10/active_jobs/793709.1/pe_hostfile /usr/
      > compute-00-10:2
      > /opt/sge/mpich2_mpd/bin/lx24-amd64/start_mpich2 -n compute-00-10
      /usr//bin/mpd
      > startmpich2.sh: check for local mpd daemon (1 of 10)
      > startmpich2.sh: check for local mpd daemon (2 of 10)
      > startmpich2.sh: check for local mpd daemon (3 of 10)
      > startmpich2.sh: check for local mpd daemon (4 of 10)
      > startmpich2.sh: check for local mpd daemon (5 of 10)
      > startmpich2.sh: check for local mpd daemon (6 of 10)
      > startmpich2.sh: check for local mpd daemon (7 of 10)
      > startmpich2.sh: check for local mpd daemon (8 of 10)
      > startmpich2.sh: check for local mpd daemon (9 of 10)
      > startmpich2.sh: check for local mpd daemon (10 of 10)
      > startmpich2.sh: local mpd could not be started, aborting
      > -catch_rsh /usr/
      > mpdallexit: cannot connect to local mpd
      (/tmp/mpd2.console_fmartine_sge_793709.undefined); possible causes:
      >   1. no mpd is running on this host
      >   2. an mpd is running but was started without a "console" (-n
      option)


This looks like the ROCKS feature to rename the FQDN when you call
`hostname`. The call `hostname` needs to be adjusted to `hostname --short` in
"start_mpich2.sh" often.

(Yes, it's not in the How which I may rework at sometime: if you installed
SGE to use the FQDN, then it's doing the right thing, and other distributions
need `hostname --long`).



      > I can confirm I can start mpich2 + mpd computations without SGE,
      please what may I check ?


Great, so there is no firewall on the slave nodes, which would black the MPI
communication.

To have a full security enhanced MPI was just on the Open MPI list. It would
need to create two tunnels per process and would have a speed impact of
course. Noone implemented because of this; but then all MPI communication
would go through ssh and you would need only  one open port per machine.

-- Reuti



      > I spent hours on Google and I have read this error happened in many
      sites but I could find a clear solution,
      > really many thanks for your advices
      > kind regards
      > Fabio
      >
      > --
      > Fabio Martinelli
      >
      > European Space Agency (ESA)
      > Computer Support Group (CSG)
      > E-mail:
      > fabio.martinelli at esa.int
      >
      >
      > European Space Astronomy Centre (ESAC)
      > 28691 Villanueva de la Ca?ada
      > P.O. Box 78, Madrid, SPAIN
      >

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267786

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list