[GE users] mpich2 tight integration not working

kennethsdsc kenneth at sdsc.edu
Mon Mar 9 19:13:55 GMT 2009


On Mon, 9 Mar 2009, reuti wrote:

> Date: Mon, 9 Mar 2009 20:00:56 +0100
> From: reuti <reuti at staff.uni-marburg.de>
> Reply-To: users <users at gridengine.sunsource.net>
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] mpich2 tight integration not working
> 
> Hi,
>
> Am 09.03.2009 um 19:39 schrieb kennethsdsc:
>
>> I also am playing with tight mpich2_mpd integration with sge 62u2.
>> I'm not sure if my problem is related to yours.  I found some
>> mismatches in the start scripts and what SGE is setting.  I was
>> able to
>> get the mpihello.c to work, by modifying start and stop scripts.
>>
>> It looks like SGE is not setting TASK_ID in the environment,
>> but is setting SGE_TASK_ID, so I modified startmpich2.sh:
>>
>> #export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
>
> you are right, I will update the scripts - thx. Nevertheless it
> should only concern when you submit MPI array tasks. The wrong
> variable will just be expanded as empty right now.

The problem I was seeing, with a missing TASK_ID was this error:

mpiexec_ken1: cannot connect to local mpd (/tmp/mpd2.console_tester_sge_2027.1);
  possible causes:
   1. no mpd is running on this host
   2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
     mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

But SGE was creating:

srwxr-xr-x  1 tester tester 0 Mar  6 14:49 /tmp/mpd2.console_tester
srwxr-xr-x  1 tester tester 0 Mar  9 11:11 /tmp/mpd2.console_tester_sge_2027.

So, I think mpiexec is looking for the filename with the taskid,
but startmpich2.sh was creating the filename without the taskid,
and mpihello was having trouble finding the file.

Is the array task method the usual way people submit MPI jobs?
Or is usually done a different way?

>
> The PATH is set in the script itself. Do you set the path to your
> MPICH2 installation as first argument to the stop_proc_args defined
> procedure in your PE configuration?

Yes,

$ qconf -sp mpich2_mpd
pe_name            mpich2_mpd
slots              4
user_lists         NONE
xuser_lists        NONE
start_proc_args    /usr/local/apps/sge/mpich2_mpd/startmpich2.sh -catch_rsh \
                    $pe_hostfile /usr/local/apps/sge/mpich2/install
stop_proc_args     /usr/local/apps/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
                    $pe_hostfile /usr/local/apps/sge/mpich2/install
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
$

the startmpich2.sh that I have sets sets:
# get arguments
MPICH2_ROOT=$1

Maybe it's using the wrong argument for MPICH2_ROOT?

Kenneth

>
> -- Reuti
>
>
>> I also had to give stopmpich2.sh the full path to mpdallexit:
>> #mpdallexit
>> /usr/local/apps/sge/mpich2/install/bin/mpdallexit
>>
>> Kenneth
>>
>> On Thu, 4 Dec 2008, Patterson, Ron (NIH/NLM/NCBI) [C] wrote:
>>
>>> Date: Thu, 4 Dec 2008 14:25:47 -0500
>>> From: "Patterson, Ron (NIH/NLM/NCBI) [C]" <patterso at ncbi.nlm.nih.gov>
>>> Reply-To: users <users at gridengine.sunsource.net>
>>> To: users at gridengine.sunsource.net
>>> Subject: RE: [GE users] mpich2 tight integration not working
>>>
>>> Reuti,
>>>
>>>> you set "job_is_first_task  FALSE" in the PE?
>>>
>>> No - I had it set to TRUE. I made the change and my first test was
>>> successful. Thank you very much for your amazingly speedy reply.
>>>
>>> Ron
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=91204
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=125724
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125732
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125743

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list