[GE users] MPICH2 tight integration

reuti reuti at staff.uni-marburg.de
Thu Aug 13 11:27:51 BST 2009


Hi,

Am 13.08.2009 um 01:44 schrieb skylar2:

> Has anyone gotten the MPICH2 tight integration working? I'm following
> the stuff in the wiki:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
> integration.html
>
> I installed the mpich2_mpd package in $SGE_ROOT and setup my parallel
> environment like so:
>
> pe_name           mpich2_mpd
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /net/gs/vol3/software/sge/mpich2_mpd/ 
> startmpich2.sh \
>                   -catch_rsh $pe_hostfile \
>
> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
> stop_proc_args    /net/gs/vol3/software/sge/mpich2_mpd/stopmpich2.sh \
>                   -catch_rsh \
>
> /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/RHEL5/x86_64
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> When I try to run a job it fails during the mpd boot up:
>
> -catch_rsh
> /net/gs/vol3/software/sge/sage/spool/sage011/active_jobs/828151.1/ 
> pe_hostfile
> /net/gs/vol3/software/modules-sw/mpi
> ch2/1.0.8/Linux/RHEL5/x86_64

was your first test already with such a high number of nodes? Is it  
working with only 4/8/16... slots on 1/2/4 nodes?

a) Are all nodes the same, or  can it be that some have a firewall?

b) It may be, that the startup sequence takes just too long and you  
have to increase the "sleep" time in the startmpich2.sh But then the  
startup would be really long.

c) What about using a daemonless smpd startup?

-- Reuti


> sage011:8
> sage019:8
> sage014:8
> sage004:8
> sage018:8
> sage006:8
> sage005:8
> sage015:8
> sage003:8
> sage012:8
> sage010:8
> sage017:8
> sage009:8
> sage013:8
> sage001:7
> sage007:8
> sage022:8
> sage016:8
> sage021:8
> sage008:7
> sage020:7
> sage002:7
> sage023:4
> startmpich2.sh: check for mpd daemons (1 of 10)
> startmpich2.sh: check for mpd daemons (2 of 10)
> startmpich2.sh: check for mpd daemons (3 of 10)
> startmpich2.sh: check for mpd daemons (4 of 10)
> startmpich2.sh: check for mpd daemons (5 of 10)
> startmpich2.sh: check for mpd daemons (6 of 10)
> startmpich2.sh: check for mpd daemons (7 of 10)
> startmpich2.sh: check for mpd daemons (8 of 10)
> startmpich2.sh: check for mpd daemons (9 of 10)
> startmpich2.sh: check for mpd daemons (10 of 10)
> startmpich2.sh: got only 8 of 23 nodes, aborting
> -catch_rsh /net/gs/vol3/software/modules-sw/mpich2/1.0.8/Linux/ 
> RHEL5/x86_64
> mpdallexit: cannot connect to local mpd
> (/tmp/mpd2.console_skylar2_sge_828151.undefined); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>     mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
>
> Has anyone gotten this working?
>
> -- 
> -- Skylar Thompson (skylar2 at u.washington.edu)
> -- Genome Sciences Department, System Administrator
> -- Foege Building S048, (206)-685-7354
> -- University of Washington School of Medicine
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=212074
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212121

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list