[GE users] MPICH2 and SGE 6.2u2 tight integration fails

reuti reuti at staff.uni-marburg.de
Tue Nov 17 15:37:33 GMT 2009


Hi,

is giving `hostname` the FQDN on your machine? Then please change the  
line:

NODE=`hostname`

to

NODE=`hostname --short`

in "startmpich2.sh". Any change?

-- Reuti


Am 17.11.2009 um 15:52 schrieb gustgr:

> Hello all,
>
> I am running SGE 6.2u2_1 (quite old, I know) and trying to tightly
> integrate MPICH2 1.2 (latest release) with it, but so far
> unsuccessfully.
>
> My first attempt was to follow the steps for the mpd installation as
> it is very clearly described in [1]. Everything is installed
> correctly, including the auxiliary scripts such as startmpich2.sh and
> stopmpich2.
>
> The parallel environment has been created and added to the all.q
> queue, which is the one I intend to use, and is as follows:
>
> # qconf -sp mpich2_mpd
> pe_name              mpich2_mpd
> slots                    928
> user_lists             NONE
> xuser_lists           NONE
> start_proc_args    /usr/sge/mpich2_mpd/startmpich2.sh -catch_rsh  
> $pe_hostfile \
>                           /opt/data/lib/mpich2/mpich2-1.2/mpd
> stop_proc_args    /usr/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
>                           /opt/data/lib/mpich2/mpich2-1.2/mpd
> allocation_rule     $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots     min
> accounting_summary FALSE
>
> MPICH2 has been compiled from source and installed on
> /opt/data/lib/mpich2/mpich2-1.2/mpd. No error messages or warnings
> appeared upon installation, so I assume everything is OK.
>
> I am trying to run the mpihello.c, also available on [1]. It has been
> successfully compiled with
> /opt/data/lib/mpich2/mpich2-1.2/mpd/bin/mpicc.
>
> The jobscript I am trying to use follows:
>
>> cat job.sh
> #!/bin/bash
> #$ -N MPICH2Test
> #$ -cwd
> #$ -pe mpich2_mpd 4
> #$ -S /bin/bash
>
> . /opt/modules/init/bash
> module load mpich2/mpich2-1.2
>
> export APP=/home/grondina/mpi-test/mpich2/mpihello
> export MPICH2_ROOT=/opt/data/lib/mpich2/mpich2-1.2/mpd
> export PATH=$PATH:$MPICH2_ROOT/bin
> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
> echo "Got $NSLOTS slots."
>
> mpiexec -machinefile $TMPDIR/machines -n $NSLOTS $APP
>
> module unload mpich2/mpich2-1.2
> exit 0
>
> Let me add that the module line refers to a module loading software
> which sets up the correct environment variables (PATH, MANPATH,
> LD_LIBRARY_PATH, etc).
>
> The problem is that when I submit this job with qsub job.sh, the
> startmpich2.sh fails to start the mpd daemon on the nodes, as we can
> see from the .po file:
>
>> cat MPICH2Test.po3143
> -catch_rsh /usr/sge/default/spool/node47/active_jobs/3143.1/ 
> pe_hostfile
> /opt/data/lib/mpich2/mpich2-1.2/mpd
> node47:1
> node57:1
> node6:1
> node63:1
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /usr/sge/bin/lx24-amd64/qrsh -inherit -V node47
> /opt/data/lib/mpich2/mpich2-1.2/mpd/bin/mpd
> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for local mpd daemon (3 of 10)
> startmpich2.sh: check for local mpd daemon (4 of 10)
> startmpich2.sh: check for local mpd daemon (5 of 10)
> startmpich2.sh: check for local mpd daemon (6 of 10)
> startmpich2.sh: check for local mpd daemon (7 of 10)
> startmpich2.sh: check for local mpd daemon (8 of 10)
> startmpich2.sh: check for local mpd daemon (9 of 10)
> startmpich2.sh: check for local mpd daemon (10 of 10)
> startmpich2.sh: local mpd could not be started, aborting
> -catch_rsh /opt/data/lib/mpich2/mpich2-1.2/mpd
> mpdallexit: cannot connect to local mpd
> (/tmp/mpd2.console_grondina_sge_3143.undefined); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>     mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> The .pe file also reports an error:
>
>> cat MPICH2Test.pe3143
> critical error: can't resolve group
>
> According to [2] this is a 6.2u2 related bug that has been fixed on
> future versions, but I think it is not the culprit for the mpd failure
> on the nodes.
>
> Has anyone ever experienced something like this? I would appreciate
> some hints on the correct direction (e.g., learn why the mpd daemon is
> failing). Before anyone advises on using an older version of MPICH2, I
> tried that already and got the same problem. I'm kinda stuck right
> now.
>
>
> Thanks,
> Gustavo
>
>
> [1] http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=227455
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=227465

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list