[GE users] Yet another qdel mpich problem (SGE 6.0u1)

Charu Chaubal Charu.Chaubal at Sun.COM
Wed Sep 8 01:05:20 BST 2004


Hi,

One suggestion is to write a custom delete method which looks for all processes 
with the special GE additional group ID set and kill them forcefully.... unless 
you find that the escaped processes somehow "shed" their additional group id in 
the process....

Regards,
	Charu


Vladimir Florinski wrote:
> It appears the problem with the qdel command (inability to terminate the
> children processes) continues to haunt MPI users. I have studied reports
> of this problem in the mailing list archive (dealing with older versions
> of SGE), but was unable to find a working solution. I have recently
> installed the latest version of SGE on our Myrinet cluster of SMP
> machines (2 queue slots per node) and set up the "mpi" parallel
> environment according to the tight integration template. Jobs are
> started properly, but don't clean up correctly after a qdel. That only
> removes the parent processes (shepherds, bash, qrsh, etc.), but not the
> computational processes themselves on all nodes except one. On that
> remaining node node one of the 2 processes is correctly terminated, but
> the other is left running. I think this behavior is different from what
> was reported previously.
> 
> To provide some background, I am using mpich-gm version 1.2.5..12 (from
> Myricom). Parallel jobs run properly when started with mpirun.ch_gm.
> Output from qconf -sp mpi
> 
> pe_name           mpi
> slots             128
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/sge/mpi/myrinet/startmpi.sh -catch_rsh
> $pe_hostfile /opt/mpich-gm/bin/mpirun.ch_gm
> stop_proc_args    /opt/sge/mpi/myrinet/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> Typical process tree on a slave node (2 slots used):
> 
> sge       9737     1  9737  9450  0 Sep05 ?        00:00:28
> /opt/sge/bin/lx24-x86/sge_execd
> sge      11146  9737 11146  9450  0 13:38 ?        00:00:00  \_
> sge_shepherd-10 -bg
> root     11148 11146 11148 11148  0 13:38 ?        00:00:00  |   \_
> /opt/sge/utilbin/lx24-x86/rshd -l
> vladimir 11150 11148 11150 11148  0 13:38 ?        00:00:00  |       \_
> /opt/sge/utilbin/lx24-x86/qrsh_starter
> /opt/sge/default/spool/node10e/active_jvladimir 11152 11150 11152 11148 
> 0 13:38 ?        00:00:00  |           \_ bash -c cd
> /home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
> vladimir 11153 11152 11153 11153 99 13:38 ?        00:01:58 
> |               \_ /home/vladimir/Test1/./mpi_main -new 100.0
> sge      11147  9737 11147  9450  0 13:38 ?        00:00:00  \_
> sge_shepherd-10 -bg
> root     11149 11147 11149 11149  0 13:38 ?        00:00:00      \_
> /opt/sge/utilbin/lx24-x86/rshd -l
> vladimir 11151 11149 11151 11149  0 13:38 ?        00:00:00          \_
> /opt/sge/utilbin/lx24-x86/qrsh_starter
> /opt/sge/default/spool/node10e/active_jvladimir 11154 11151 11154 11149 
> 0 13:38 ?        00:00:00              \_ bash -c cd
> /home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
> vladimir 11155 11154 11155 11155 98 13:38 ?       
> 00:01:57                  \_ /home/vladimir/Test1/./mpi_main -new 100.0
> 
> (sorry for the formatting).
> When qdel is issued, process 11153 mpi_main is killed, while process
> 11155 keeps running. I don't understand this because they appear to be
> identical.
> 
> Here is the startup script, just in case:
> 
> #$ -N inst-nn-6
> #$ -cwd
> #$ -pe mpi 2-10
> #$ -v MPIR_HOME
> /opt/mpich-gm/bin/mpirun.ch_gm --gm-no-shmem -machinefile
> $TMPDIR/machines --gm-kill 15 -np $NSLOTS ./mpi_main -new 100.0
> 
> I see a few errors in the output file of the kind:
> 
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> 
> Although they are probably unrelated to the main issue (qdel).
> 
> Anyone knows how to fix the qdel problem?
> 
> 

-- 
####################################################################
# Charu V. Chaubal              # Phone: (650) 786-7672 (x87672)   #
# Grid Computing Technologist   # Fax:   (650) 786-4591            #
# Sun Microsystems, Inc.        # Email: charu.chaubal at sun.com     #
####################################################################


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list