[GE users] Parallel jobs remain in r state after finishing

reuti reuti at staff.uni-marburg.de
Tue Nov 25 22:04:11 GMT 2008


Am 25.11.2008 um 22:23 schrieb Bart Willems:

> I have set up tight integration between mpich2 and sge 6.2 following
> Reuti's howto:
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
> integration.html
> Everything worked fine during my testing period when I had a few
> nodes dedicated exclusively to the parallel job queue. I have now re-
> opened these nodes to other queues as well and now parallel jobs are
> no longer deleted from the queue when they finish.
> My PE is set up as follows:
> # qconf -sp mpich2_smpd
> pe_name            mpich2_smpd
> slots              9999
> user_lists         parallelusers
> xuser_lists        NONE
> start_proc_args    /opt/gridengine/mpich2_smpd/startmpich2.sh -
> catch_rsh \
>                     $pe_hostfile /share/apps/mpich2
> stop_proc_args     /opt/gridengine/mpich2_smpd/stopmpich2.sh -
> catch_rsh \
>                     /share/apps/mpich2
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE

this may be:

a) as you use SGE 6.2: http://gridengine.sunsource.net/issues/ 
show_bug.cgi?id=2775 you can fall back to an rsh or ssh startup.

b) a race condition in MPICH2-1.0.8: http://lists.mcs.anl.gov/ 
pipermail/mpich-discuss/2008-November/000138.html You need to set  
this variable in the start-script, the job script and the stop-script.

I will put an udated Howto online tomorrow, as I also just checked  
whether all parts of the old Howto still apply and found b).

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list