[GE users] Integration of the MPICH2 and SGE

reuti reuti at staff.uni-marburg.de
Thu May 27 17:40:21 BST 2010


Hi,

Am 27.05.2010 um 15:07 schrieb gqc606:

> Hi reuti:
>  This is my mistake that I only modified the script(startmpich2.sh) on the front node,and forgot to edit it on the compute nodes.Now I modified the script on all nodes.
> 
> But when I submitted the script,it still produces the following error:
> -catch_rsh /opt/gridengine/default/spool/compute-0-0/active_jobs/254.1/pe_hostfile /opt/mpich2/gnu
> compute-0-0:3
> compute-0-1:3
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /opt/gridengine/bin/lx26-x86/qrsh -inherit -V compute-0-0 /opt/mpich2/gnu/bin/mpd

so the loop is doing the right thing in the first iteration and tries to start the local daemon on node compute-0-0. The question is, why it's failing? You can start an mpd by hand on this machine? Which version of MPICH2 is installed?


> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for local mpd daemon (3 of 10)
> startmpich2.sh: check for local mpd daemon (4 of 10)
> startmpich2.sh: check for local mpd daemon (5 of 10)
> startmpich2.sh: check for local mpd daemon (6 of 10)
> startmpich2.sh: check for local mpd daemon (7 of 10)
> startmpich2.sh: check for local mpd daemon (8 of 10)
> startmpich2.sh: check for local mpd daemon (9 of 10)
> startmpich2.sh: check for local mpd daemon (10 of 10)
> startmpich2.sh: local mpd could not be started, aborting
> -catch_rsh /opt/mpich2/gnu
> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_test_sge_254.undefined); possible causes:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>    mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
> error: error: ending connection before all data received
> error: 
> error reading job context from "qlogin_starter"
> 
> And I searched the "qlogin_starter",but didn't found it at all.I don't know it is my MPICH2's problems or SGE's? Can you give me some advices?Thank

To achieve a Tight Integration, the necssary setup will make a local `qrsh` call. "job_is_fist_task FALSE" in the PE? What type of communication did you set up in SGE's configuration: -builtin-, classic rsh or ssh?

-- Reuti


> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259007
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259059

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list