[GE users] SGE:mpich2 tight integration failing to start mpds

hjmangalam harry.mangalam at uci.edu
Tue Sep 22 21:24:05 BST 2009


I've been trying to get the mpich2 environment running with SGE6.2 as per:
<http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html>

I've installed the latest mpich2 (1.1.1p1) from ANL's src as a module 
(available to all nodes as an NFS mount point (in the $SGE_ROOT dir and 
prepended the appro PATHs so that the ENV can find the executables).

The MPI application I'm trying to get running can run in MPI mode outside of 
SGE, using an mpd.hosts file and manual starting of remote mpd's:

ssh bduc-amd64-14 'mpd --host=node2 --port=58609  -n &'
^C
ssh bduc-amd64-15 'mpd --host=node2 --port=58609  -n &'
^C
 ... <etc>
(However, the remote command does hang at each invocation, requiring a ^C to 
kill the ssh command, even tho the mpd has started).

mpdtrace shows all the expected nodes up:

====================
$ mpdtrace
bduc-amd64-2
bduc-amd64-21
bduc-amd64-20
bduc-amd64-19
bduc-amd64-18
bduc-amd64-17
bduc-amd64-16
bduc-amd64-14
bduc-amd64-15
====================

and the command runs to completion as expected with this command:

====================
mpiexec -np 8 \ 
nrniv -mpi -nobanner -nogui \
/home/hmangala/newmodel/model-2.1.hoc
====================

However, when I try to run it from within SGE, with the following qsub file:

====================
#!/bin/sh
#
#$ -q longbat64
#$ -pe mpich2 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -M harry.mangalam at uci.edu
#$ -m bea
#$ -N cells500
#$ -o cells500.out
#
module load neuron
module load mpich2
export NRNHOME=/apps/neuron/7.0
cd /home/hmangala/newmodel
/apps/mpich2/1.1.1p1/bin/mpiexec -np 8 \ 
nrniv -mpi -nobanner -nogui /home/hmangala/newmodel/model-2.1.hoc
====================

the job starts running normally

====================
$ qsub neuron_mpi_8.sh
Your job 11863 ("cells500") has been submitted

13:03:57 hmangala at bduc-amd64-2:~/newmodel
671 $ qstat
job-ID  prior   name       user         state submit/start at     queue                          
slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  11844 1.50713 QRLOGIN    hmangala     r     09/22/2009 10:34:49 
int64 at bduc-amd64-2.ics.uci.edu     1
  11863 0.62984 cells500   hmangala     r     09/22/2009 13:03:58 
longbat64 at bduc-amd64-12.ics.uc     8
====================

but the output shows:

====================
-catch_rsh /sge62/bduc_nacs/spool/bduc-amd64-12/active_jobs/11863.1/pe_hostfile /sge62/mpich2
bduc-amd64-12:1
bduc-amd64-13:1
bduc-amd64-14:1
bduc-amd64-10:1
bduc-amd64-11:1
bduc-amd64-8:1
bduc-amd64-7:1
bduc-amd64-36:1
usage: start_mpich2 [-n <hostname>] mpich2-mpd-path [mpd-parameters ..]

where: 'hostname' gives the name of the target host

[[repeated 7 more times and then]]

startmpich2.sh: check for mpd daemons (1 of 10)
startmpich2.sh: got all 8 of 8 nodes
mpiexec_bduc-amd64-12.ics.uci.edu: cannot connect to local mpd 
(/tmp/mpd2.console_hmangala); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
-catch_rsh /sge62/mpich2
mpdallexit: cannot connect to local mpd 
(/tmp/mpd2.console_hmangala_sge_11863.undefined); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
====================

I would think that this means that the 'start_mpich2' command is not being 
called correctly, but the mpich2 environment is defined as per Reuti's 
example:

====================
$ qconf -sp mpich2
pe_name            mpich2
slots              32
user_lists         NONE
xuser_lists        NONE
start_proc_args    /sge62/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
                   /sge62/mpich2
stop_proc_args     /sge62/mpich2_mpd/stopmpich2.sh -catch_rsh /sge62/mpich2
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
====================

and the 'startmpich2.sh' file is in place and chowned rx:

====================
$ ls -l /sge62/mpich2_mpd/startmpich2.sh
-rwxr-xr-x 1 root root 5922 Mar 10  2009 /sge62/mpich2_mpd/startmpich2.sh*
====================

Answers or debugging suggestions would be gratefully accepted.


-- 
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
---
It is better to be roughly right than precisely wrong.
Keynes

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=218465

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list