[GE users] Qdel problem

Liang Ge liang.ge at gmail.com
Tue Oct 3 20:25:33 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On 10/3/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 03.10.2006 um 21:06 schrieb Liang Ge:
>
> > Here is my script:
> > ---------------------------------------------------------
> > #!/bin/bash
> >
> > #$ -S /bin/bash
> > #$ -pe mpich 8
> > #$ -o temp
> >
> > cd $SGE_O_WORKDIR
> > export MPICH_PROCESS_GROUP=no
> >
> > /opt/mpich-mx.gcc/bin/mpirun -machinefile $TMPDIR/machines -np $NSLOTS
> > $SGE_O_WORKDIR/testt
> > ----------------------------------------------------
> >
> > I can successfully submit the job with qsub. But when I try to remove
> > it with qdel, only one process is killed and the rest 7 processes are
> > continuously running.
> >
> > I tried solution #2 and #3 as described in the web page by Reuti (I
> > couldn't follow #1), namely change the rsh_wrapper and recompile the
> > mpich from patched source code. Still I got the same results as
> > before: qdel only kill one process.
> >
>
> we have no Myrinet, but I heard that the #1 is no longer necessary,
> as the scripts provided by Myrinet changed. One point to look at, is
> the call to the slave processes. Are they done by rsh or ssh? This
> might be in one of the follow up scripts called by your mpirun command.
>
> Can you check a running program with "ps -e f" to have a look at the
> process tree - are all bound to sge_shepherd on the slaves?

I think the answer is yes. Here is the output of "ps -e f"
1565 ?        S      0:33 /opt/sge/bin/lx24-amd64/sge_execd
 5575 ?        S      0:00  \_ sge_shepherd-833 -bg
 5584 ?        Ss     0:00      \_ bash
/opt/sge/default/spool/node0046/job_scripts/833
 5585 ?        S      0:00          \_ perl -S -w
/opt/mpich-mx.gcc/bin/mpirun.ch_mx.pl --mx-kill 5 -np 8 -machinefile
/opt/sge/tmp/833.1.all.q/machines /home
 5614 ?        S      0:00              \_ perl -S -w
/opt/mpich-mx.gcc/bin/mpirun.ch_mx.pl --mx-kill 5 -np 8 -machinefile
/opt/sge/tmp/833.1.all.q/machines /
 5615 ?        S      0:24              \_ rsh node0008 cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_DIS
 5667 ?        Z      0:00              |   \_ [rsh] <defunct>
 5616 ?        S      0:00              \_ rsh node0008 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5617 ?        S      0:00              \_ rsh node0008 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5618 ?        S      0:00              \_ rsh node0008 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5619 ?        S      0:00              \_ rsh node0046 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5620 ?        S      0:00              \_ rsh node0046 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5621 ?        S      0:00              \_ rsh node0046 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 5622 ?        S      0:00              \_ rsh node0046 -n cd
/home/lg65/JCP/DrivenCavity/Re1000/2D_33 && exec env
MXMPI_MASTER=node0046 MXMPI_PORT=50349 MX_
 1587 ?        Ss     0:00 /usr/sbin/sshd
 1602 ?        Ss     0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
 5623 ?        Ss     0:00  \_ in.rshd
 5627 ?        Rl    33:49  |   \_
/home/lg65/JCP/DrivenCavity/Re1000/2D_33/testt
 5624 ?        Ss     0:00  \_ in.rshd
 5636 ?        Rl    33:15  |   \_
/home/lg65/JCP/DrivenCavity/Re1000/2D_33/testt
 5625 ?        Ss     0:00  \_ in.rshd
 5628 ?        Rl    33:49  |   \_
/home/lg65/JCP/DrivenCavity/Re1000/2D_33/testt
 5626 ?        Ss     0:00  \_ in.rshd
 5629 ?        Rl    33:49  |   \_
/home/lg65/JCP/DrivenCavity/Re1000/2D_33/testt
 5792 ?        Ss     0:00  \_ in.rlogind
 5793 ?        Ss     0:00      \_ login -- lg65
 5794 pts/0    Ss     0:00          \_ -bash
 5853 pts/0    R+     0:00              \_ ps -e f

>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list