[GE users] Parallel jobs don't terminate

markhewitt mh613 at york.ac.uk
Tue Jul 28 12:42:01 BST 2009

I have a problem with users running MPI jobs. Basically everything 
starts up ok. But for some reason when a job is terminated from SGE 
(either reaches maximum wallclock time or a user issues qdel). Then it 
removes the job from the list in SGE but the processes remain running on 
the nodes. Meaning they quickly become overloaded with orphan processes.

Any ideas what could be going wrong here?

# qconf -sp mpich-infiniband
pe_name            mpich-infiniband
slots              32
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/n1ge6/mvapich/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/n1ge6/mvapich/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Many thanks for your help

Mark Hewitt


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list