[GE users] SGE:mpich2 tight integration failing to start mpds

hjmangalam harry.mangalam at uci.edu
Wed Sep 23 22:07:42 BST 2009


Hi Reuti,

Thanks very much for coming out of vacation to help!

Your advice helped - my cluster is not ROCKS, but the nodes are CentOS, and so 
share some ROCKSian flavor.  The 'hostname' was returning the FQDN, and 
a '--short' solved that.

However, in my case, it was a little more complicated but looking around in 
the start_mpich2.sh did reveal the problem.  

My cluster is made up of 40 nodes on a public net and 40 nodes on a private 
net (space/political/financial constraints) and the 40 public nodes can't 
route to the private net.  For reasons I can't even remember, my Qs were set 
up to include both, and for non-mpich2 jobs they were working OK.  However, 
when trying to get the mpd's running across the public/private border, there 
were routing failures and therefore timeouts that eventually caused the jobs 
to fail.

The easy fix is to segregate the Qs to prevent mixing of pub/priv nodes.  the 
RIGHT way to do it would be to set up a NAT to allow them to see each other, 
but at least now I can get the jobs to run (they're not yet constrained by 
the local number of nodes).

Many Thanks!
Harry


On Wednesday 23 September 2009 01:57:21 you wrote:
> Hi,
>
> I'm on vacation right now up to October 5th. So only in short: is it
> ROCKS? It's a ROCKS feature to return the FQDN with a simple
> `hostname`. The problem is not the script, but the following binary
> start_mpich2. You will have to adjust the start_mpich2.sh (the
> script), so that the first comparison in the loop near the end of the
> script succeeds and the sets the proper variables. Just before the
> loop extend the `hostname` command:
>
> NODE=`hostname --short`
>
> -- Reuti
>
> PS: It's not in the script, as --short isn't available e.g. on
> Solaris. Then others would complain.
>
> Zitat von hjmangalam <harry.mangalam at uci.edu>:
> > I've been trying to get the mpich2 environment running with SGE6.2 as
> > per:
> > <http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integrat
> >ion.html>
> >
> > I've installed the latest mpich2 (1.1.1p1) from ANL's src as a module
> > (available to all nodes as an NFS mount point (in the $SGE_ROOT dir and
> > prepended the appro PATHs so that the ENV can find the executables).
> >
> > The MPI application I'm trying to get running can run in MPI mode outside
> > of SGE, using an mpd.hosts file and manual starting of remote mpd's:
> >
> > ssh bduc-amd64-14 'mpd --host=node2 --port=58609  -n &'
> > ^C
> > ssh bduc-amd64-15 'mpd --host=node2 --port=58609  -n &'
> > ^C
> >  ... <etc>
> > (However, the remote command does hang at each invocation, requiring a ^C
> > to kill the ssh command, even tho the mpd has started).
> >
> > mpdtrace shows all the expected nodes up:
> >
> > ====================
> > $ mpdtrace
> > bduc-amd64-2
> > bduc-amd64-21
> > bduc-amd64-20
> > bduc-amd64-19
> > bduc-amd64-18
> > bduc-amd64-17
> > bduc-amd64-16
> > bduc-amd64-14
> > bduc-amd64-15
> > ====================
> >
> > and the command runs to completion as expected with this command:
> >
> > ====================
> > mpiexec -np 8 \
> > nrniv -mpi -nobanner -nogui \
> > /home/hmangala/newmodel/model-2.1.hoc
> > ====================
> >
> > However, when I try to run it from within SGE, with the following qsub
> > file:
> >
> > ====================
> > #!/bin/sh
> > #
> > #$ -q longbat64
> > #$ -pe mpich2 8
> > #$ -cwd
> > #$ -j y
> > #$ -S /bin/bash
> > #$ -M harry.mangalam at uci.edu
> > #$ -m bea
> > #$ -N cells500
> > #$ -o cells500.out
> > #
> > module load neuron
> > module load mpich2
> > export NRNHOME=/apps/neuron/7.0
> > cd /home/hmangala/newmodel
> > /apps/mpich2/1.1.1p1/bin/mpiexec -np 8 \
> > nrniv -mpi -nobanner -nogui /home/hmangala/newmodel/model-2.1.hoc
> > ====================
> >
> > the job starts running normally
> >
> > ====================
> > $ qsub neuron_mpi_8.sh
> > Your job 11863 ("cells500") has been submitted
> >
> > 13:03:57 hmangala at bduc-amd64-2:~/newmodel
> > 671 $ qstat
> > job-ID  prior   name       user         state submit/start at     queue
> > slots ja-task-ID
> > -------------------------------------------------------------------------
> >---------------------------------------- 11844 1.50713 QRLOGIN    hmangala
> >     r     09/22/2009 10:34:49 int64 at bduc-amd64-2.ics.uci.edu     1
> >   11863 0.62984 cells500   hmangala     r     09/22/2009 13:03:58
> > longbat64 at bduc-amd64-12.ics.uc     8
> > ====================
> >
> > but the output shows:
> >
> > ====================
> > -catch_rsh
> > /sge62/bduc_nacs/spool/bduc-amd64-12/active_jobs/11863.1/pe_hostfile
> >  /sge62/mpich2
> > bduc-amd64-12:1
> > bduc-amd64-13:1
> > bduc-amd64-14:1
> > bduc-amd64-10:1
> > bduc-amd64-11:1
> > bduc-amd64-8:1
> > bduc-amd64-7:1
> > bduc-amd64-36:1
> > usage: start_mpich2 [-n <hostname>] mpich2-mpd-path [mpd-parameters ..]
> >
> > where: 'hostname' gives the name of the target host
> >
> > [[repeated 7 more times and then]]
> >
> > startmpich2.sh: check for mpd daemons (1 of 10)
> > startmpich2.sh: got all 8 of 8 nodes
> > mpiexec_bduc-amd64-12.ics.uci.edu: cannot connect to local mpd
> > (/tmp/mpd2.console_hmangala); possible causes:
> >   1. no mpd is running on this host
> >   2. an mpd is running but was started without a "console" (-n option)
> > In case 1, you can start an mpd on this host with:
> >     mpd &
> > and you will be able to run jobs just on this host.
> > For more details on starting mpds on a set of hosts, see
> > the MPICH2 Installation Guide.
> > -catch_rsh /sge62/mpich2
> > mpdallexit: cannot connect to local mpd
> > (/tmp/mpd2.console_hmangala_sge_11863.undefined); possible causes:
> >   1. no mpd is running on this host
> >   2. an mpd is running but was started without a "console" (-n option)
> > In case 1, you can start an mpd on this host with:
> >     mpd &
> > and you will be able to run jobs just on this host.
> > For more details on starting mpds on a set of hosts, see
> > the MPICH2 Installation Guide.
> > ====================
> >
> > I would think that this means that the 'start_mpich2' command is not
> > being called correctly, but the mpich2 environment is defined as per
> > Reuti's example:
> >
> > ====================
> > $ qconf -sp mpich2
> > pe_name            mpich2
> > slots              32
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /sge62/mpich2_mpd/startmpich2.sh -catch_rsh
> > $pe_hostfile \ /sge62/mpich2
> > stop_proc_args     /sge62/mpich2_mpd/stopmpich2.sh -catch_rsh
> > /sge62/mpich2 allocation_rule    $round_robin
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> > ====================
> >
> > and the 'startmpich2.sh' file is in place and chowned rx:
> >
> > ====================
> > $ ls -l /sge62/mpich2_mpd/startmpich2.sh
> > -rwxr-xr-x 1 root root 5922 Mar 10  2009
> > /sge62/mpich2_mpd/startmpich2.sh* ====================
> >
> > Answers or debugging suggestions would be gratefully accepted.
> >
> >
> > --
> > Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
> > [ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
> > MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
> > ---
> > It is better to be roughly right than precisely wrong.
> > Keynes
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageI
> >d=218465
> >
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].



-- 
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  949 824-0084(o), 949 285-4487(c)
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
---
It is better to be roughly right than precisely wrong.
Keynes

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=218739

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list