[GE users] mpich2 tight integration not working
reuti
reuti at staff.uni-marburg.de
Mon Mar 9 19:30:50 GMT 2009
Am 09.03.2009 um 20:13 schrieb kennethsdsc:
> On Mon, 9 Mar 2009, reuti wrote:
>
>> Date: Mon, 9 Mar 2009 20:00:56 +0100
>> From: reuti <reuti at staff.uni-marburg.de>
>> Reply-To: users <users at gridengine.sunsource.net>
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] mpich2 tight integration not working
>>
>> Hi,
>>
>> Am 09.03.2009 um 19:39 schrieb kennethsdsc:
>>
>>> I also am playing with tight mpich2_mpd integration with sge 62u2.
>>> I'm not sure if my problem is related to yours. I found some
>>> mismatches in the start scripts and what SGE is setting. I was
>>> able to
>>> get the mpihello.c to work, by modifying start and stop scripts.
>>>
>>> It looks like SGE is not setting TASK_ID in the environment,
>>> but is setting SGE_TASK_ID, so I modified startmpich2.sh:
>>>
>>> #export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>>> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
>>
>> you are right, I will update the scripts - thx. Nevertheless it
>> should only concern when you submit MPI array tasks. The wrong
>> variable will just be expanded as empty right now.
>
> The problem I was seeing, with a missing TASK_ID was this error:
>
> mpiexec_ken1: cannot connect to local mpd (/tmp/
> mpd2.console_tester_sge_2027.1);
> possible causes:
> 1. no mpd is running on this host
> 2. an mpd is running but was started without a "console" (-n
> option)
> In case 1, you can start an mpd on this host with:
> mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> But SGE was creating:
>
> srwxr-xr-x 1 tester tester 0 Mar 6 14:49 /tmp/mpd2.console_tester
> srwxr-xr-x 1 tester tester 0 Mar 9 11:11 /tmp/
> mpd2.console_tester_sge_2027.
You also adjusted the jobscript to look for $SGE_TASK_ID already
before I assume.
>
> So, I think mpiexec is looking for the filename with the taskid,
> but startmpich2.sh was creating the filename without the taskid,
> and mpihello was having trouble finding the file.
>
> Is the array task method the usual way people submit MPI jobs?
> Or is usually done a different way?
Well, you would submit 4 jobs each having 4 processes. For just one
job leave the -t option and its parameters out.
>>
>> The PATH is set in the script itself. Do you set the path to your
>> MPICH2 installation as first argument to the stop_proc_args defined
>> procedure in your PE configuration?
>
> Yes,
>
> $ qconf -sp mpich2_mpd
> pe_name mpich2_mpd
> slots 4
This will be the limit in the complete cluster, not per job. Most
likely you will like to set it to a higher value.
> user_lists NONE
> xuser_lists NONE
> start_proc_args /usr/local/apps/sge/mpich2_mpd/startmpich2.sh -
> catch_rsh \
> $pe_hostfile /usr/local/apps/sge/mpich2/install
> stop_proc_args /usr/local/apps/sge/mpich2_mpd/stopmpich2.sh -
> catch_rsh \
> $pe_hostfile /usr/local/apps/sge/mpich2/install
No hostfile here, like shown on the webpage, the template is wrong :-/
-- Reuti
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
> $
>
> the startmpich2.sh that I have sets sets:
> # get arguments
> MPICH2_ROOT=$1
>
> Maybe it's using the wrong argument for MPICH2_ROOT?
>
> Kenneth
>
>>
>> -- Reuti
>>
>>
>>> I also had to give stopmpich2.sh the full path to mpdallexit:
>>> #mpdallexit
>>> /usr/local/apps/sge/mpich2/install/bin/mpdallexit
>>>
>>> Kenneth
>>>
>>> On Thu, 4 Dec 2008, Patterson, Ron (NIH/NLM/NCBI) [C] wrote:
>>>
>>>> Date: Thu, 4 Dec 2008 14:25:47 -0500
>>>> From: "Patterson, Ron (NIH/NLM/NCBI) [C]"
>>>> <patterso at ncbi.nlm.nih.gov>
>>>> Reply-To: users <users at gridengine.sunsource.net>
>>>> To: users at gridengine.sunsource.net
>>>> Subject: RE: [GE users] mpich2 tight integration not working
>>>>
>>>> Reuti,
>>>>
>>>>> you set "job_is_first_task FALSE" in the PE?
>>>>
>>>> No - I had it set to TRUE. I made the change and my first test was
>>>> successful. Thank you very much for your amazingly speedy reply.
>>>>
>>>> Ron
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=91204
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=125724
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=125732
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=125743
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125763
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users
mailing list