[GE users] Parallel jobs don't terminate

rayson rayrayson at gmail.com
Tue Jul 28 17:15:52 BST 2009


Make sure that the slave MPI tasks are children of SGE's shepherd.

Rayson



On 7/28/09, markhewitt <mh613 at york.ac.uk> wrote:
> I have a problem with users running MPI jobs. Basically everything
> starts up ok. But for some reason when a job is terminated from SGE
> (either reaches maximum wallclock time or a user issues qdel). Then it
> removes the job from the list in SGE but the processes remain running on
> the nodes. Meaning they quickly become overloaded with orphan processes.
>
> Any ideas what could be going wrong here?
>
>
> # qconf -sp mpich-infiniband
> pe_name            mpich-infiniband
> slots              32
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/n1ge6/mvapich/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args     /opt/n1ge6/mvapich/stopmpi.sh
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
>
> Many thanks for your help
>
> Mark Hewitt
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209885
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209918

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list