[GE users] mpich2 tight integration not working

reuti reuti at staff.uni-marburg.de
Mon Mar 9 19:30:50 GMT 2009


Am 09.03.2009 um 20:13 schrieb kennethsdsc:

> On Mon, 9 Mar 2009, reuti wrote:
>
>> Date: Mon, 9 Mar 2009 20:00:56 +0100
>> From: reuti <reuti at staff.uni-marburg.de>
>> Reply-To: users <users at gridengine.sunsource.net>
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] mpich2 tight integration not working
>>
>> Hi,
>>
>> Am 09.03.2009 um 19:39 schrieb kennethsdsc:
>>
>>> I also am playing with tight mpich2_mpd integration with sge 62u2.
>>> I'm not sure if my problem is related to yours.  I found some
>>> mismatches in the start scripts and what SGE is setting.  I was
>>> able to
>>> get the mpihello.c to work, by modifying start and stop scripts.
>>>
>>> It looks like SGE is not setting TASK_ID in the environment,
>>> but is setting SGE_TASK_ID, so I modified startmpich2.sh:
>>>
>>> #export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>>> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
>>
>> you are right, I will update the scripts - thx. Nevertheless it
>> should only concern when you submit MPI array tasks. The wrong
>> variable will just be expanded as empty right now.
>
> The problem I was seeing, with a missing TASK_ID was this error:
>
> mpiexec_ken1: cannot connect to local mpd (/tmp/ 
> mpd2.console_tester_sge_2027.1);
>   possible causes:
>    1. no mpd is running on this host
>    2. an mpd is running but was started without a "console" (-n  
> option)
> In case 1, you can start an mpd on this host with:
>      mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> But SGE was creating:
>
> srwxr-xr-x  1 tester tester 0 Mar  6 14:49 /tmp/mpd2.console_tester
> srwxr-xr-x  1 tester tester 0 Mar  9 11:11 /tmp/ 
> mpd2.console_tester_sge_2027.

You also adjusted the jobscript to look for $SGE_TASK_ID already  
before I assume.


>
> So, I think mpiexec is looking for the filename with the taskid,
> but startmpich2.sh was creating the filename without the taskid,
> and mpihello was having trouble finding the file.
>
> Is the array task method the usual way people submit MPI jobs?
> Or is usually done a different way?

Well, you would submit 4 jobs each having 4 processes. For just one  
job leave the -t option and its parameters out.


>>
>> The PATH is set in the script itself. Do you set the path to your
>> MPICH2 installation as first argument to the stop_proc_args defined
>> procedure in your PE configuration?
>
> Yes,
>
> $ qconf -sp mpich2_mpd
> pe_name            mpich2_mpd
> slots              4

This will be the limit in the complete cluster, not per job. Most  
likely you will like to set it to a higher value.


> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /usr/local/apps/sge/mpich2_mpd/startmpich2.sh - 
> catch_rsh \
>                     $pe_hostfile /usr/local/apps/sge/mpich2/install
> stop_proc_args     /usr/local/apps/sge/mpich2_mpd/stopmpich2.sh - 
> catch_rsh \
>                     $pe_hostfile /usr/local/apps/sge/mpich2/install

No hostfile here, like shown on the webpage, the template is wrong :-/

-- Reuti


> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> $
>
> the startmpich2.sh that I have sets sets:
> # get arguments
> MPICH2_ROOT=$1
>
> Maybe it's using the wrong argument for MPICH2_ROOT?
>
> Kenneth
>
>>
>> -- Reuti
>>
>>
>>> I also had to give stopmpich2.sh the full path to mpdallexit:
>>> #mpdallexit
>>> /usr/local/apps/sge/mpich2/install/bin/mpdallexit
>>>
>>> Kenneth
>>>
>>> On Thu, 4 Dec 2008, Patterson, Ron (NIH/NLM/NCBI) [C] wrote:
>>>
>>>> Date: Thu, 4 Dec 2008 14:25:47 -0500
>>>> From: "Patterson, Ron (NIH/NLM/NCBI) [C]"  
>>>> <patterso at ncbi.nlm.nih.gov>
>>>> Reply-To: users <users at gridengine.sunsource.net>
>>>> To: users at gridengine.sunsource.net
>>>> Subject: RE: [GE users] mpich2 tight integration not working
>>>>
>>>> Reuti,
>>>>
>>>>> you set "job_is_first_task  FALSE" in the PE?
>>>>
>>>> No - I had it set to TRUE. I made the change and my first test was
>>>> successful. Thank you very much for your amazingly speedy reply.
>>>>
>>>> Ron
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=91204
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=125724
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=125732
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=125743
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125763

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list