[GE users] SGE:mpich2 tight integration failing to start mpds

hjmangalam harry.mangalam at uci.edu
Fri Sep 25 18:08:04 BST 2009


While I thought my tight integration woes were over, there still seems to be a 
problem: 

Current setup: running SGE 6.2 on CentOS 5.3 nodes (all nodes and Qs mentioned 
here are on a public IP #s. The previously mentioned problem of trying to run 
mpich2 jobs over a mix of private and public IP #s has been solved by 
segregating them).

When I qsub the following script:
 <http://moo.nac.uci.edu:~hjm/neuron_mpi_8.sh>
and follow the output of the startmpich2.sh, it strangely seems to work - mpds 
are started on all required nodes (2 slots each), but the actual mpiexec call 
fails, claiming that it can't connect to the original mpd master.

Any ideas?

[[ my comments inserted into output below ]]
===== output file from slightly modified startmpich2.sh ======

startmpich2.sh: check for local mpd daemon (2 of 10)
bduc-amd64-13 differs from bduc-amd64-12
bduc-amd64-14 differs from bduc-amd64-12
bduc-amd64-10 differs from bduc-amd64-12

[[finds 4 nodes that fulfill the mpich2 requirements ]]

startmpich2.sh: check for mpd daemons (1 of 10)
mpdtrace -l produces:
====================
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-13 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-10 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
/sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-14 /sge62/mpich2/bin/mpd -h 
bduc-amd64-12 -p 47966 -n
bduc-amd64-12.ics.uci.edu_47966 (128.195.11.16)
====================
startmpich2.sh: check for mpd daemons (2 of 10)
mpdtrace -l produces:
====================
bduc-amd64-12.ics.uci.edu_47966 (128.195.11.16) [[ 1st mpd / master]]
bduc-amd64-10.ics.uci.edu_59432 (128.195.11.14)
bduc-amd64-14.ics.uci.edu_40199 (128.195.11.18)
bduc-amd64-13.ics.uci.edu_43082 (128.195.11.17)
====================
startmpich2.sh: got all 4 of 4 nodes

[[ now it has found all the mpds required to fulfill the job ]]

calling mpiexec now
mpiexec_bduc-amd64-12.ics.uci.edu: cannot connect to local mpd
 (/tmp/mpd2.console_hmangala); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

[[ but mpiexec cannot connect to the original mpd that reported itself 
above ]]

-catch_rsh /sge62/mpich2

========== end of output ==========




-- 
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
---
It is better to be roughly right than precisely wrong.
Keynes

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219068

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list