[GE users] SGE:mpich2 tight integration failing to start mpds

hjmangalam harry.mangalam at uci.edu
Fri Sep 25 18:08:04 BST 2009

While I thought my tight integration woes were over, there still seems to be a 

Current setup: running SGE 6.2 on CentOS 5.3 nodes (all nodes and Qs mentioned 
here are on a public IP #s. The previously mentioned problem of trying to run 
mpich2 jobs over a mix of private and public IP #s has been solved by 
segregating them).

When I qsub the following script:
and follow the output of the startmpich2.sh, it strangely seems to work - mpds 
are started on all required nodes (2 slots each), but the actual mpiexec call 
fails, claiming that it can't connect to the original mpd master.

Any ideas?

[[ my comments inserted into output below ]]
===== output file from slightly modified startmpich2.sh ======

startmpich2.sh: check for local mpd daemon (2 of 10)
bduc-amd64-13 differs from bduc-amd64-12
bduc-amd64-14 differs from bduc-amd64-12
bduc-amd64-10 differs from bduc-amd64-12

[[finds 4 nodes that fulfill the mpich2 requirements ]]

startmpich2.sh: check for mpd daemons (1 of 10)
mpdtrace -l produces:
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-13 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-10 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-14 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
bduc-amd64-12.ics.uci.edu_47966 (
startmpich2.sh: check for mpd daemons (2 of 10)
mpdtrace -l produces:
bduc-amd64-12.ics.uci.edu_47966 ( [[ 1st mpd / master]]
bduc-amd64-10.ics.uci.edu_59432 (
bduc-amd64-14.ics.uci.edu_40199 (
bduc-amd64-13.ics.uci.edu_43082 (
startmpich2.sh: got all 4 of 4 nodes

[[ now it has found all the mpds required to fulfill the job ]]

calling mpiexec now
mpiexec_bduc-amd64-12.ics.uci.edu: cannot connect to local mpd
 (/tmp/mpd2.console_hmangala); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

[[ but mpiexec cannot connect to the original mpd that reported itself 
above ]]

-catch_rsh /sge62/mpich2

========== end of output ==========

