[GE users] jobs never die on nodes with mpich

Michel Cuendet michel.cuendet at epfl.ch
Mon Aug 2 11:30:31 BST 2004


Hi,

I'm running sge 5.3p4 on an opteron cluster,  together with  mpich 1.2.5.10

>  qconf -sp mpich
pe_name           mpich
queue_list        all
slots             64
user_lists        cluster_users
xuser_lists       NONE
start_proc_args   /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /opt/sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE

The jobs run nicely in parallel, but qdel kills only the job on the
master node. The job disappears from the queue, but ghost processes keep
on taking 99% of cpu on all slave nodes. If the program exits in error,
the same happens.

I've browsed this mailing list and tried a few things, but it didn't
improve:

On a slave node, here is what remains after a job is killed, which shows
that I'm using tight integration (?):

UID        PID  PPID  PGID   SID  C STIME TTY          TIME CMD
sgeadmin  2434     1  2434  1784  0 Jul19 ?        00:20:28
/opt/sge/bin/glinux/sge_execd
sgeadmin  9030  2434  9030  1784  0 15:04 ?        00:00:00  \_
sge_shepherd-4889 -bg
root      9031  9030  9031  9031  0 15:04 ?        00:00:00      \_
/opt/sge/utilbin/glinux/rshd -l
mitch     9032  9031  9032  9031  0 15:04 ?        00:00:00          \_
[qrsh_starter <defunct>]

I tried also to add this to my qsub submission script, but it didn'twork :
export MPICH_PROCESS_GROUP=no

I really need to get that working, because users keep on submitting jobs
when they see room in the queue, and the cluster gets totally clugged
with ghost jobs

Thanks,

Michel

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Michel Cuendet, Ph.D. student
Laboratory of Computational Biochemistry and Chemistry
Swiss Federal Institute of Technology in Lausanne (EPFL)
CH-1015 Lausanne						
Switzerland                         	Phone : +41 1 693 0324
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list