[GE users] Yet another qdel mpich problem (SGE 6.0u1)

Vladimir Florinski vflorins at ucr.edu
Wed Sep 8 00:54:47 BST 2004


It appears the problem with the qdel command (inability to terminate the
children processes) continues to haunt MPI users. I have studied reports
of this problem in the mailing list archive (dealing with older versions
of SGE), but was unable to find a working solution. I have recently
installed the latest version of SGE on our Myrinet cluster of SMP
machines (2 queue slots per node) and set up the "mpi" parallel
environment according to the tight integration template. Jobs are
started properly, but don't clean up correctly after a qdel. That only
removes the parent processes (shepherds, bash, qrsh, etc.), but not the
computational processes themselves on all nodes except one. On that
remaining node node one of the 2 processes is correctly terminated, but
the other is left running. I think this behavior is different from what
was reported previously.

To provide some background, I am using mpich-gm version 1.2.5..12 (from
Myricom). Parallel jobs run properly when started with mpirun.ch_gm.
Output from qconf -sp mpi

pe_name           mpi
slots             128
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/sge/mpi/myrinet/startmpi.sh -catch_rsh
$pe_hostfile /opt/mpich-gm/bin/mpirun.ch_gm
stop_proc_args    /opt/sge/mpi/myrinet/stopmpi.sh
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Typical process tree on a slave node (2 slots used):

sge       9737     1  9737  9450  0 Sep05 ?        00:00:28
/opt/sge/bin/lx24-x86/sge_execd
sge      11146  9737 11146  9450  0 13:38 ?        00:00:00  \_
sge_shepherd-10 -bg
root     11148 11146 11148 11148  0 13:38 ?        00:00:00  |   \_
/opt/sge/utilbin/lx24-x86/rshd -l
vladimir 11150 11148 11150 11148  0 13:38 ?        00:00:00  |       \_
/opt/sge/utilbin/lx24-x86/qrsh_starter
/opt/sge/default/spool/node10e/active_jvladimir 11152 11150 11152 11148 
0 13:38 ?        00:00:00  |           \_ bash -c cd
/home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
vladimir 11153 11152 11153 11153 99 13:38 ?        00:01:58 
|               \_ /home/vladimir/Test1/./mpi_main -new 100.0
sge      11147  9737 11147  9450  0 13:38 ?        00:00:00  \_
sge_shepherd-10 -bg
root     11149 11147 11149 11149  0 13:38 ?        00:00:00      \_
/opt/sge/utilbin/lx24-x86/rshd -l
vladimir 11151 11149 11151 11149  0 13:38 ?        00:00:00          \_
/opt/sge/utilbin/lx24-x86/qrsh_starter
/opt/sge/default/spool/node10e/active_jvladimir 11154 11151 11154 11149 
0 13:38 ?        00:00:00              \_ bash -c cd
/home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
vladimir 11155 11154 11155 11155 98 13:38 ?       
00:01:57                  \_ /home/vladimir/Test1/./mpi_main -new 100.0

(sorry for the formatting).
When qdel is issued, process 11153 mpi_main is killed, while process
11155 keeps running. I don't understand this because they appear to be
identical.

Here is the startup script, just in case:

#$ -N inst-nn-6
#$ -cwd
#$ -pe mpi 2-10
#$ -v MPIR_HOME
/opt/mpich-gm/bin/mpirun.ch_gm --gm-no-shmem -machinefile
$TMPDIR/machines --gm-kill 15 -np $NSLOTS ./mpi_main -new 100.0

I see a few errors in the output file of the kind:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

Although they are probably unrelated to the main issue (qdel).

Anyone knows how to fix the qdel problem?


-- 
Vladimir Florinski
Assistant Research Physicist
Institute of Geophysics and Planetary Physics
University of California
Riverside, CA 92521
phone: 1-909-787-3943
fax: 1-909-787-4509


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list