[GE users] MPICH2 and SGE 6.2u2 tight integration fails

gustgr rondina at gmail.com
Tue Nov 17 14:52:47 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello all,

I am running SGE 6.2u2_1 (quite old, I know) and trying to tightly
integrate MPICH2 1.2 (latest release) with it, but so far
unsuccessfully.

My first attempt was to follow the steps for the mpd installation as
it is very clearly described in [1]. Everything is installed
correctly, including the auxiliary scripts such as startmpich2.sh and
stopmpich2.

The parallel environment has been created and added to the all.q
queue, which is the one I intend to use, and is as follows:

# qconf -sp mpich2_mpd
pe_name              mpich2_mpd
slots                    928
user_lists             NONE
xuser_lists           NONE
start_proc_args    /usr/sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
                          /opt/data/lib/mpich2/mpich2-1.2/mpd
stop_proc_args    /usr/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
                          /opt/data/lib/mpich2/mpich2-1.2/mpd
allocation_rule     $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots     min
accounting_summary FALSE

MPICH2 has been compiled from source and installed on
/opt/data/lib/mpich2/mpich2-1.2/mpd. No error messages or warnings
appeared upon installation, so I assume everything is OK.

I am trying to run the mpihello.c, also available on [1]. It has been
successfully compiled with
/opt/data/lib/mpich2/mpich2-1.2/mpd/bin/mpicc.

The jobscript I am trying to use follows:

> cat job.sh
#!/bin/bash
#$ -N MPICH2Test
#$ -cwd
#$ -pe mpich2_mpd 4
#$ -S /bin/bash

. /opt/modules/init/bash
module load mpich2/mpich2-1.2

export APP=/home/grondina/mpi-test/mpich2/mpihello
export MPICH2_ROOT=/opt/data/lib/mpich2/mpich2-1.2/mpd
export PATH=$PATH:$MPICH2_ROOT/bin
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
echo "Got $NSLOTS slots."

mpiexec -machinefile $TMPDIR/machines -n $NSLOTS $APP

module unload mpich2/mpich2-1.2
exit 0

Let me add that the module line refers to a module loading software
which sets up the correct environment variables (PATH, MANPATH,
LD_LIBRARY_PATH, etc).

The problem is that when I submit this job with qsub job.sh, the
startmpich2.sh fails to start the mpd daemon on the nodes, as we can
see from the .po file:

> cat MPICH2Test.po3143
-catch_rsh /usr/sge/default/spool/node47/active_jobs/3143.1/pe_hostfile
/opt/data/lib/mpich2/mpich2-1.2/mpd
node47:1
node57:1
node6:1
node63:1
startmpich2.sh: check for local mpd daemon (1 of 10)
/usr/sge/bin/lx24-amd64/qrsh -inherit -V node47
/opt/data/lib/mpich2/mpich2-1.2/mpd/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for local mpd daemon (3 of 10)
startmpich2.sh: check for local mpd daemon (4 of 10)
startmpich2.sh: check for local mpd daemon (5 of 10)
startmpich2.sh: check for local mpd daemon (6 of 10)
startmpich2.sh: check for local mpd daemon (7 of 10)
startmpich2.sh: check for local mpd daemon (8 of 10)
startmpich2.sh: check for local mpd daemon (9 of 10)
startmpich2.sh: check for local mpd daemon (10 of 10)
startmpich2.sh: local mpd could not be started, aborting
-catch_rsh /opt/data/lib/mpich2/mpich2-1.2/mpd
mpdallexit: cannot connect to local mpd
(/tmp/mpd2.console_grondina_sge_3143.undefined); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

The .pe file also reports an error:

> cat MPICH2Test.pe3143
critical error: can't resolve group

According to [2] this is a 6.2u2 related bug that has been fixed on
future versions, but I think it is not the culprit for the mpd failure
on the nodes.

Has anyone ever experienced something like this? I would appreciate
some hints on the correct direction (e.g., learn why the mpd daemon is
failing). Before anyone advises on using an older version of MPICH2, I
tried that already and got the same problem. I'm kinda stuck right
now.


Thanks,
Gustavo


[1] http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=227455

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list