[GE users] MPICH2 tight integration

skylar2 skylar2 at u.washington.edu
Thu Aug 13 00:44:46 BST 2009


Has anyone gotten the MPICH2 tight integration working? I'm following
the stuff in the wiki:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

I installed the mpich2_mpd package in $SGE_ROOT and setup my parallel
environment like so:

pe_name           mpich2_mpd
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /net/gs/vol3/software/sge/mpich2_mpd/startmpich2.sh \
                  -catch_rsh $pe_hostfile \

/net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
stop_proc_args    /net/gs/vol3/software/sge/mpich2_mpd/stopmpich2.sh \
                  -catch_rsh \

/net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

When I try to run a job it fails during the mpd boot up:

-catch_rsh
/net/gs/vol3/software/sge/sage/spool/sage011/active_jobs/828151.1/pe_hostfile
/net/gs/vol3/software/modules-sw/mpi
ch2/1.0.8/Linux/RHEL5/x86_64
sage011:8
sage019:8
sage014:8
sage004:8
sage018:8
sage006:8
sage005:8
sage015:8
sage003:8
sage012:8
sage010:8
sage017:8
sage009:8
sage013:8
sage001:7
sage007:8
sage022:8
sage016:8
sage021:8
sage008:7
sage020:7
sage002:7
sage023:4
startmpich2.sh: check for mpd daemons (1 of 10)
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: check for mpd daemons (3 of 10)
startmpich2.sh: check for mpd daemons (4 of 10)
startmpich2.sh: check for mpd daemons (5 of 10)
startmpich2.sh: check for mpd daemons (6 of 10)
startmpich2.sh: check for mpd daemons (7 of 10)
startmpich2.sh: check for mpd daemons (8 of 10)
startmpich2.sh: check for mpd daemons (9 of 10)
startmpich2.sh: check for mpd daemons (10 of 10)
startmpich2.sh: got only 8 of 23 nodes, aborting
-catch_rsh /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
mpdallexit: cannot connect to local mpd
(/tmp/mpd2.console_skylar2_sge_828151.undefined); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.


Has anyone gotten this working?

-- 
-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S048, (206)-685-7354
-- University of Washington School of Medicine

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212074

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "OpenPGP digital signature"  Application/PGP-SIGNATURE ]
    [ (Name: "signature.asc") 261 bytes. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list