[GE users] Yet another qdel mpich problem (SGE 6.0u1)

Andreas Haas Andreas.Haas at Sun.COM
Wed Sep 8 12:54:20 BST 2004


Killing qrsh -inherit tasks is done based on the pid of the
process that was forked by 'qrsh_starter' using a

   kill(-pid, signal)

to hit the whole process group spawend by that process.
If you encounter processes that remain alive after qdel
with tight MPICH integration this is almost surely due to
a process in the chain did a

   setpgrp()

I recommend to investigate this at first.

>From an earlier similar case I recall there was a MPICH
compile time option --enable-processgroup=no which might
play some role.

Regards,
Andreas

On Tue, 7 Sep 2004, Vladimir Florinski wrote:

> It appears the problem with the qdel command (inability to terminate the
> children processes) continues to haunt MPI users. I have studied reports
> of this problem in the mailing list archive (dealing with older versions
> of SGE), but was unable to find a working solution. I have recently
> installed the latest version of SGE on our Myrinet cluster of SMP
> machines (2 queue slots per node) and set up the "mpi" parallel
> environment according to the tight integration template. Jobs are
> started properly, but don't clean up correctly after a qdel. That only
> removes the parent processes (shepherds, bash, qrsh, etc.), but not the
> computational processes themselves on all nodes except one. On that
> remaining node node one of the 2 processes is correctly terminated, but
> the other is left running. I think this behavior is different from what
> was reported previously.
>
> To provide some background, I am using mpich-gm version 1.2.5..12 (from
> Myricom). Parallel jobs run properly when started with mpirun.ch_gm.
> Output from qconf -sp mpi
>
> pe_name           mpi
> slots             128
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/sge/mpi/myrinet/startmpi.sh -catch_rsh
> $pe_hostfile /opt/mpich-gm/bin/mpirun.ch_gm
> stop_proc_args    /opt/sge/mpi/myrinet/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> Typical process tree on a slave node (2 slots used):
>
> sge       9737     1  9737  9450  0 Sep05 ?        00:00:28
> /opt/sge/bin/lx24-x86/sge_execd
> sge      11146  9737 11146  9450  0 13:38 ?        00:00:00  \_
> sge_shepherd-10 -bg
> root     11148 11146 11148 11148  0 13:38 ?        00:00:00  |   \_
> /opt/sge/utilbin/lx24-x86/rshd -l
> vladimir 11150 11148 11150 11148  0 13:38 ?        00:00:00  |       \_
> /opt/sge/utilbin/lx24-x86/qrsh_starter
> /opt/sge/default/spool/node10e/active_jvladimir 11152 11150 11152 11148
> 0 13:38 ?        00:00:00  |           \_ bash -c cd
> /home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
> vladimir 11153 11152 11153 11153 99 13:38 ?        00:01:58
> |               \_ /home/vladimir/Test1/./mpi_main -new 100.0
> sge      11147  9737 11147  9450  0 13:38 ?        00:00:00  \_
> sge_shepherd-10 -bg
> root     11149 11147 11149 11149  0 13:38 ?        00:00:00      \_
> /opt/sge/utilbin/lx24-x86/rshd -l
> vladimir 11151 11149 11151 11149  0 13:38 ?        00:00:00          \_
> /opt/sge/utilbin/lx24-x86/qrsh_starter
> /opt/sge/default/spool/node10e/active_jvladimir 11154 11151 11154 11149
> 0 13:38 ?        00:00:00              \_ bash -c cd
> /home/vladimir/Test1 ; env GMPI_MASTER=node12e GMPI_PORT=47081
> vladimir 11155 11154 11155 11155 98 13:38 ?
> 00:01:57                  \_ /home/vladimir/Test1/./mpi_main -new 100.0
>
> (sorry for the formatting).
> When qdel is issued, process 11153 mpi_main is killed, while process
> 11155 keeps running. I don't understand this because they appear to be
> identical.
>
> Here is the startup script, just in case:
>
> #$ -N inst-nn-6
> #$ -cwd
> #$ -pe mpi 2-10
> #$ -v MPIR_HOME
> /opt/mpich-gm/bin/mpirun.ch_gm --gm-no-shmem -machinefile
> $TMPDIR/machines --gm-kill 15 -np $NSLOTS ./mpi_main -new 100.0
>
> I see a few errors in the output file of the kind:
>
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
>
> Although they are probably unrelated to the main issue (qdel).
>
> Anyone knows how to fix the qdel problem?
>
>
> --
> Vladimir Florinski
> Assistant Research Physicist
> Institute of Geophysics and Planetary Physics
> University of California
> Riverside, CA 92521
> phone: 1-909-787-3943
> fax: 1-909-787-4509
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list