[GE users] SGE:mpich2 tight integration failing to start mpds

reuti reuti at staff.uni-marburg.de
Sat Sep 26 09:28:58 BST 2009


Zitat von hjmangalam <harry.mangalam at uci.edu>:

> While I thought my tight integration woes were over, there still   
> seems to be a
> problem:
>
> Current setup: running SGE 6.2 on CentOS 5.3 nodes (all nodes and Qs  
>  mentioned
> here are on a public IP #s. The previously mentioned problem of trying to run
> mpich2 jobs over a mix of private and public IP #s has been solved by
> segregating them).
>
> When I qsub the following script:
>  <http://moo.nac.uci.edu:~hjm/neuron_mpi_8.sh>
> and follow the output of the startmpich2.sh, it strangely seems to   
> work - mpds
> are started on all required nodes (2 slots each), but the actual mpiexec call
> fails, claiming that it can't connect to the original mpd master.
>
> Any ideas?

Do you export MPD_CON_EXT in the jobscript? Please have a look in the  
archive for the example script.

-- Reuti

>
> [[ my comments inserted into output below ]]
> ===== output file from slightly modified startmpich2.sh ======
>
> startmpich2.sh: check for local mpd daemon (2 of 10)
> bduc-amd64-13 differs from bduc-amd64-12
> bduc-amd64-14 differs from bduc-amd64-12
> bduc-amd64-10 differs from bduc-amd64-12
>
> [[finds 4 nodes that fulfill the mpich2 requirements ]]
>
> startmpich2.sh: check for mpd daemons (1 of 10)
> mpdtrace -l produces:
> ====================
> /sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-13 /sge62/mpich2/bin/mpd -h
> bduc-amd64-12 -p 47966 -n
> /sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-10 /sge62/mpich2/bin/mpd -h
> bduc-amd64-12 -p 47966 -n
> /sge62/bin/lx24-amd64/qrsh -inherit -V bduc-amd64-14 /sge62/mpich2/bin/mpd -h
> bduc-amd64-12 -p 47966 -n
> bduc-amd64-12.ics.uci.edu_47966 (128.195.11.16)
> ====================
> startmpich2.sh: check for mpd daemons (2 of 10)
> mpdtrace -l produces:
> ====================
> bduc-amd64-12.ics.uci.edu_47966 (128.195.11.16) [[ 1st mpd / master]]
> bduc-amd64-10.ics.uci.edu_59432 (128.195.11.14)
> bduc-amd64-14.ics.uci.edu_40199 (128.195.11.18)
> bduc-amd64-13.ics.uci.edu_43082 (128.195.11.17)
> ====================
> startmpich2.sh: got all 4 of 4 nodes
>
> [[ now it has found all the mpds required to fulfill the job ]]
>
> calling mpiexec now
> mpiexec_bduc-amd64-12.ics.uci.edu: cannot connect to local mpd
>  (/tmp/mpd2.console_hmangala); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>     mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> [[ but mpiexec cannot connect to the original mpd that reported itself
> above ]]
>
> -catch_rsh /sge62/mpich2
>
> ========== end of output ==========
>
>
>
>
> --
> Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
> [ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
> MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
> ---
> It is better to be roughly right than precisely wrong.
> Keynes
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219068
>
> To unsubscribe from this discussion, e-mail:   
> [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219161

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list