[GE users] Parallel jobs remain in r state after finishing
reuti at staff.uni-marburg.de
Tue Nov 25 22:04:11 GMT 2008
Am 25.11.2008 um 22:23 schrieb Bart Willems:
> I have set up tight integration between mpich2 and sge 6.2 following
> Reuti's howto:
> Everything worked fine during my testing period when I had a few
> nodes dedicated exclusively to the parallel job queue. I have now re-
> opened these nodes to other queues as well and now parallel jobs are
> no longer deleted from the queue when they finish.
> My PE is set up as follows:
> # qconf -sp mpich2_smpd
> pe_name mpich2_smpd
> slots 9999
> user_lists parallelusers
> xuser_lists NONE
> start_proc_args /opt/gridengine/mpich2_smpd/startmpich2.sh -
> catch_rsh \
> $pe_hostfile /share/apps/mpich2
> stop_proc_args /opt/gridengine/mpich2_smpd/stopmpich2.sh -
> catch_rsh \
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
this may be:
a) as you use SGE 6.2: http://gridengine.sunsource.net/issues/
show_bug.cgi?id=2775 you can fall back to an rsh or ssh startup.
b) a race condition in MPICH2-1.0.8: http://lists.mcs.anl.gov/
pipermail/mpich-discuss/2008-November/000138.html You need to set
this variable in the start-script, the job script and the stop-script.
I will put an udated Howto online tomorrow, as I also just checked
whether all parts of the old Howto still apply and found b).
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users