[GE users] Integration of the MPICH2 and SGE

reuti reuti at staff.uni-marburg.de
Fri May 28 10:55:51 BST 2010


Hi,

Am 28.05.2010 um 05:20 schrieb gqc606:

> Hi reuti:
>  My operating system is Rocks 5.3,version of MPICH2 is mpich2-1.0.8p1-0.And I can manually start up MPICH2 successfully .

this is some time old. The actual version is MPICH2-1.2.1p1

> 
>  I found the SGE "messages" logs on the frontend and on both compute nodes:
>  On the frontend(/opt/gridengine/default/spool/qmaster/messages):
> 05/27/2010 16:26:29|worker|cluster|E|tightly integrated parallel task 256.1 task 1.compute-0-1 failed - killing job
> 05/27/2010 16:27:33|worker|cluster|W|job 256.1 failed on host compute-0-1.local general in pestart because: 05/27/2010 16:26:16 [400:24438]: exit_status of pe_start = 1
> 05/27/2010 16:27:33|worker|cluster|E|queue all.q marked QERROR as result of job 256's failure at host compute-0-1.local
> 05/27/2010 16:27:42|worker|cluster|E|denied: job "256" does not exist
> 
>  On compute-0-1 node(/opt/gridengine/default/spool/compute-0-1/messages):
> 05/27/2010 16:26:17|  main|compute-0-1|E|shepherd of job 256.1 exited with exit status = 10
> 05/27/2010 16:26:17|  main|compute-0-1|W|reaping job "256" ptf complains: Job does not exist
> 05/27/2010 16:26:29|  main|compute-0-1|W|reaping job "256" ptf complains: Job does not exist
> 05/27/2010 16:26:29|  main|compute-0-1|E|can't open file active_jobs/256.1/1.compute-0-1/error: No such file or directory
> 
> I can find the directory:/opt/gridengine/default/spool/compute-0-0/active_jobs,it seems that Rocks don't creat a file named "/256.1/1.compute-0-1" when I submitted the script.
> 
> 
> There is anther problem,when I submitted the script ,it seems taht all the processes are assigned to one of the compute nodes.As above the job "256",the wrong messages was only found on compute-0-1's "messages" log,I can't find a similar messages of the job "256" on compute-0-0's "messages" log.I don't know it is my SGE's problems or my configuration files need to reset.Can you give me some advices?Thanks!

When the local mpd is not starting up, there shouldn't be anything on any slave at all. These are only started in the upcoming iterations of the loop in start_mpich2.sh, after the chosen port number is known.

For the execution from below, there should be something on compute-0-0 instead. I assume, that the above output was from another run, where SGE chose another machine as master node for this parallel job.

Anyway, the problem is the startup of the local "mpd" on the master node of the parallel job. Can you please submit a parallel job, then check which master node was elected, go to this node and execute:

$ ps -e f

(f w/o -) maybe we can see something there.

And again:

>> To achieve a Tight Integration, the necssary setup will make a local `qrsh` call. "job_is_fist_task FALSE" in the PE? What type of communication did you set up in SGE's configuration: -builtin-, classic rsh or ssh?


-- Reuti


> 
> 
> 
> 
> 
> 
>> Hi,
>> 
>> Am 27.05.2010 um 15:07 schrieb gqc606:
>> 
>>> Hi reuti:
>>> This is my mistake that I only modified the script(startmpich2.sh) on the front node,and forgot to edit it on the compute nodes.Now I modified the script on all nodes.
>>> 
>>> But when I submitted the script,it still produces the following error:
>>> -catch_rsh /opt/gridengine/default/spool/compute-0-0/active_jobs/254.1/pe_hostfile /opt/mpich2/gnu
>>> compute-0-0:3
>>> compute-0-1:3
>>> startmpich2.sh: check for local mpd daemon (1 of 10)
>>> /opt/gridengine/bin/lx26-x86/qrsh -inherit -V compute-0-0 /opt/mpich2/gnu/bin/mpd
>> 
>> so the loop is doing the right thing in the first iteration and tries to start the local daemon on node compute-0-0. The question is, why it's failing? You can start an mpd by hand on this machine? Which version of MPICH2 is installed?
>> 
>> 
>>> startmpich2.sh: check for local mpd daemon (2 of 10)
>>> startmpich2.sh: check for local mpd daemon (3 of 10)
>>> startmpich2.sh: check for local mpd daemon (4 of 10)
>>> startmpich2.sh: check for local mpd daemon (5 of 10)
>>> startmpich2.sh: check for local mpd daemon (6 of 10)
>>> startmpich2.sh: check for local mpd daemon (7 of 10)
>>> startmpich2.sh: check for local mpd daemon (8 of 10)
>>> startmpich2.sh: check for local mpd daemon (9 of 10)
>>> startmpich2.sh: check for local mpd daemon (10 of 10)
>>> startmpich2.sh: local mpd could not be started, aborting
>>> -catch_rsh /opt/mpich2/gnu
>>> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_test_sge_254.undefined); possible causes:
>>> 1. no mpd is running on this host
>>> 2. an mpd is running but was started without a "console" (-n option)
>>> In case 1, you can start an mpd on this host with:
>>>   mpd &
>>> and you will be able to run jobs just on this host.
>>> For more details on starting mpds on a set of hosts, see
>>> the MPICH2 Installation Guide.
>>> error: error: ending connection before all data received
>>> error: 
>>> error reading job context from "qlogin_starter"
>>> 
>>> And I searched the "qlogin_starter",but didn't found it at all.I don't know it is my MPICH2's problems or SGE's? Can you give me some advices?Thank
>> 
>> To achieve a Tight Integration, the necssary setup will make a local `qrsh` call. "job_is_fist_task FALSE" in the PE? What type of communication did you set up in SGE's configuration: -builtin-, classic rsh or ssh?
>> 
>> -- Reuti
>> 
>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259007
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259214
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259311

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list