[GE users] Parallel jobs remain in r state after finishing

Bart Willems b-willems at northwestern.edu
Tue Dec 9 19:09:41 GMT 2008


Hi Reuti,

it took me a while to get to this, but solution (b) did the trick for  
me.

Thanks!
Bart

> Hi,
>
> Am 25.11.2008 um 22:23 schrieb Bart Willems:
>
>> I have set up tight integration between mpich2 and sge 6.2 following
>> Reuti's howto:
>>
>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>> integration.html
>>
>> Everything worked fine during my testing period when I had a few
>> nodes dedicated exclusively to the parallel job queue. I have now re-
>> opened these nodes to other queues as well and now parallel jobs are
>> no longer deleted from the queue when they finish.
>>
>> My PE is set up as follows:
>>
>> # qconf -sp mpich2_smpd
>> pe_name            mpich2_smpd
>> slots              9999
>> user_lists         parallelusers
>> xuser_lists        NONE
>> start_proc_args    /opt/gridengine/mpich2_smpd/startmpich2.sh -
>> catch_rsh \
>>                     $pe_hostfile /share/apps/mpich2
>> stop_proc_args     /opt/gridengine/mpich2_smpd/stopmpich2.sh -
>> catch_rsh \
>>                     /share/apps/mpich2
>> allocation_rule    $fill_up
>> control_slaves     TRUE
>> job_is_first_task  FALSE
>> urgency_slots      min
>> accounting_summary FALSE
>
> this may be:
>
> a) as you use SGE 6.2: http://gridengine.sunsource.net/issues/
> show_bug.cgi?id=2775 you can fall back to an rsh or ssh startup.
>
> b) a race condition in MPICH2-1.0.8: http://lists.mcs.anl.gov/
> pipermail/mpich-discuss/2008-November/000138.html You need to set
> this variable in the start-script, the job script and the stop-script.
>
>
> I will put an udated Howto online tomorrow, as I also just checked
> whether all parts of the old Howto still apply and found b).
>
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=89855
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91979

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list