[GE users] mpich2 tight integration not working

kennethsdsc kenneth at sdsc.edu
Mon Mar 9 19:56:07 GMT 2009


On Mon, 9 Mar 2009, reuti wrote:

> Date: Mon, 9 Mar 2009 20:34:56 +0100
> From: reuti <reuti at staff.uni-marburg.de>
> Reply-To: users <users at gridengine.sunsource.net>
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] mpich2 tight integration not working
> 
> Am 09.03.2009 um 19:46 schrieb kennethsdsc:
>
>> A couple other issues:
>>
>> - I had to specify task count in my qsub line:
>> qsub -t 1-4:4 -l h_rt=18:00:00 -q all.q -pe mpich2_mpd 4 testjob.sh
>>
>> - I had to use SGE_TASK_ID, instead of TASK_ID in mpich2_mpd.sh:
>> #export MPICH2_ROOT=/usr/local/apps/sge/mpich2/install
>> #export PATH=$MPICH2_ROOT/bin:$PATH
>> #export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>> setenv MPICH2_ROOT /usr/local/apps/sge/mpich2/install
>> setenv PATH $MPICH2_ROOT/bin:$PATH
>> setenv MPD_CON_EXT "sge_$JOB_ID.$SGE_TASK_ID"
>>
>> It looks like SGE is using csh to execute the file, rather
>> than using #!/bin/ksh.  Not sure if that's a configuration issue on
>> my part?
>
>
> For the prolog/epilog it should just exec the specified binaries. You
> are on which platform? /bin/bash is available?

Sorry, I was unclear.  The job script, mpich2_mpd.sh, gets executed
by csh, rather than its #!/bin/ksh line.  That makes the export lines
not work.

>
> The queue settings for the interpreter should only affect the
> execution of the jobscript, not the prolog/epilog. Can you please
> post your queue definition?

Here's the qconf -sq output:

[root at ken1 testprog]# qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich2_mpd
rerun                 FALSE
slots                 4,[ken1=2],[ken2=2]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
[root at ken1 testprog]#

Kenneth

>
> -- Reuti
>
>
>
>> Kenneth
>>
>> On Mon, 9 Mar 2009, kennethsdsc wrote:
>>
>>> Date: Mon, 9 Mar 2009 11:39:16 -0700 (PDT)
>>> From: kennethsdsc <kenneth at sdsc.edu>
>>> Reply-To: users <users at gridengine.sunsource.net>
>>> To: users at gridengine.sunsource.net
>>> Subject: RE: [GE users] mpich2 tight integration not working
>>>
>>> I also am playing with tight mpich2_mpd integration with sge 62u2.
>>> I'm not sure if my problem is related to yours.  I found some
>>> mismatches in the start scripts and what SGE is setting.  I was
>>> able to
>>> get the mpihello.c to work, by modifying start and stop scripts.
>>>
>>> It looks like SGE is not setting TASK_ID in the environment,
>>> but is setting SGE_TASK_ID, so I modified startmpich2.sh:
>>>
>>> #export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>>> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
>>>
>>> I also had to give stopmpich2.sh the full path to mpdallexit:
>>> #mpdallexit
>>> /usr/local/apps/sge/mpich2/install/bin/mpdallexit
>>>
>>> Kenneth
>>>
>>> On Thu, 4 Dec 2008, Patterson, Ron (NIH/NLM/NCBI) [C] wrote:
>>>
>>>> Date: Thu, 4 Dec 2008 14:25:47 -0500
>>>> From: "Patterson, Ron (NIH/NLM/NCBI) [C]"
>>>> <patterso at ncbi.nlm.nih.gov>
>>>> Reply-To: users <users at gridengine.sunsource.net>
>>>> To: users at gridengine.sunsource.net
>>>> Subject: RE: [GE users] mpich2 tight integration not working
>>>>
>>>> Reuti,
>>>>
>>>>> you set "job_is_first_task  FALSE" in the PE?
>>>>
>>>> No - I had it set to TRUE. I made the change and my first test was
>>>> successful. Thank you very much for your amazingly speedy reply.
>>>>
>>>> Ron
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=91204
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=125724
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=125727
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125765
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125788

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list