[GE users] Yet another qdel mpich problem (SGE 6.0u1)

Ron Chen ron_chen_123 at yahoo.com
Wed Sep 8 01:13:18 BST 2004


Will enabling the code in shepherd to kill all
processes in the process group fix this problem?

 -Ron

--- Charu Chaubal <Charu.Chaubal at Sun.COM> wrote:
> Hi,
> 
> One suggestion is to write a custom delete method
> which looks for all processes 
> with the special GE additional group ID set and kill
> them forcefully.... unless 
> you find that the escaped processes somehow "shed"
> their additional group id in 
> the process....
> 
> Regards,
> 	Charu
> 
> 
> Vladimir Florinski wrote:
> > It appears the problem with the qdel command
> (inability to terminate the
> > children processes) continues to haunt MPI users.
> I have studied reports
> > of this problem in the mailing list archive
> (dealing with older versions
> > of SGE), but was unable to find a working
> solution. I have recently
> > installed the latest version of SGE on our Myrinet
> cluster of SMP
> > machines (2 queue slots per node) and set up the
> "mpi" parallel
> > environment according to the tight integration
> template. Jobs are
> > started properly, but don't clean up correctly
> after a qdel. That only
> > removes the parent processes (shepherds, bash,
> qrsh, etc.), but not the
> > computational processes themselves on all nodes
> except one. On that
> > remaining node node one of the 2 processes is
> correctly terminated, but
> > the other is left running. I think this behavior
> is different from what
> > was reported previously.
> > 
> > To provide some background, I am using mpich-gm
> version 1.2.5..12 (from
> > Myricom). Parallel jobs run properly when started
> with mpirun.ch_gm.
> > Output from qconf -sp mpi
> > 
> > pe_name           mpi
> > slots             128
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /opt/sge/mpi/myrinet/startmpi.sh
> -catch_rsh
> > $pe_hostfile /opt/mpich-gm/bin/mpirun.ch_gm
> > stop_proc_args    /opt/sge/mpi/myrinet/stopmpi.sh
> > allocation_rule   $fill_up
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > 
> > Typical process tree on a slave node (2 slots
> used):
> > 
> > sge       9737     1  9737  9450  0 Sep05 ?       
> 00:00:28
> > /opt/sge/bin/lx24-x86/sge_execd
> > sge      11146  9737 11146  9450  0 13:38 ?       
> 00:00:00  \_
> > sge_shepherd-10 -bg
> > root     11148 11146 11148 11148  0 13:38 ?       
> 00:00:00  |   \_
> > /opt/sge/utilbin/lx24-x86/rshd -l
> > vladimir 11150 11148 11150 11148  0 13:38 ?       
> 00:00:00  |       \_
> > /opt/sge/utilbin/lx24-x86/qrsh_starter
> > /opt/sge/default/spool/node10e/active_jvladimir
> 11152 11150 11152 11148 
> > 0 13:38 ?        00:00:00  |           \_ bash -c
> cd
> > /home/vladimir/Test1 ; env GMPI_MASTER=node12e
> GMPI_PORT=47081
> > vladimir 11153 11152 11153 11153 99 13:38 ?       
> 00:01:58 
> > |               \_ /home/vladimir/Test1/./mpi_main
> -new 100.0
> > sge      11147  9737 11147  9450  0 13:38 ?       
> 00:00:00  \_
> > sge_shepherd-10 -bg
> > root     11149 11147 11149 11149  0 13:38 ?       
> 00:00:00      \_
> > /opt/sge/utilbin/lx24-x86/rshd -l
> > vladimir 11151 11149 11151 11149  0 13:38 ?       
> 00:00:00          \_
> > /opt/sge/utilbin/lx24-x86/qrsh_starter
> > /opt/sge/default/spool/node10e/active_jvladimir
> 11154 11151 11154 11149 
> > 0 13:38 ?        00:00:00              \_ bash -c
> cd
> > /home/vladimir/Test1 ; env GMPI_MASTER=node12e
> GMPI_PORT=47081
> > vladimir 11155 11154 11155 11155 98 13:38 ?       
> > 00:01:57                  \_
> /home/vladimir/Test1/./mpi_main -new 100.0
> > 
> > (sorry for the formatting).
> > When qdel is issued, process 11153 mpi_main is
> killed, while process
> > 11155 keeps running. I don't understand this
> because they appear to be
> > identical.
> > 
> > Here is the startup script, just in case:
> > 
> > #$ -N inst-nn-6
> > #$ -cwd
> > #$ -pe mpi 2-10
> > #$ -v MPIR_HOME
> > /opt/mpich-gm/bin/mpirun.ch_gm --gm-no-shmem
> -machinefile
> > $TMPDIR/machines --gm-kill 15 -np $NSLOTS
> ./mpi_main -new 100.0
> > 
> > I see a few errors in the output file of the kind:
> > 
> > Warning: no access to tty (Bad file descriptor).
> > Thus no job control in this shell.
> > 
> > Although they are probably unrelated to the main
> issue (qdel).
> > 
> > Anyone knows how to fix the qdel problem?
> > 
> > 
> 
> -- 
>
####################################################################
> # Charu V. Chaubal              # Phone: (650)
> 786-7672 (x87672)   #
> # Grid Computing Technologist   # Fax:   (650)
> 786-4591            #
> # Sun Microsystems, Inc.        # Email:
> charu.chaubal at sun.com     #
>
####################################################################
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



		
_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list