[GE users] Mvapich processes not killed on qdel with mpirun_ssh

Hari Prasad m.hariprasad at yahoo.co.in
Fri Jun 1 12:53:14 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

we have GE 6.0u8 on a Rocks 4.2.1 cluster with
Infiniband and the Topspin roll (which includes
mvapich).

SGE tightly integrated with mvapich as described in
HOWTOs.But we are using mpirun_ssh with topspin mpich
(/usr/local/topspin/mpi/mpich/mpirun_ssh -np 8 xhpl)



qdel is not killing the child processes on compute
nodes.We are getting bunch of "connection refused"
errors with mpirun_rsh.

I have read one mail about "Mvapich processes not
killed on qdel", saying that SGE 6.1 will resolve this
 issue even if i use mpirun_ssh.

can i have some suggestions which work with SGE 6.ou8.

OUTPUT I have enclosed(submit script, master ps,
compute ps).

Any suggestions on what I should look for?


Thanks & Regards,
Hari 




      Download prohibited? No problem! To chat from any browser without download, Click Here: http://in.messenger.yahoo.com/webmessengerpromo.php

    [ Part 2: "2189029774-topspin.doc.txt" ]

#!/bin/bash
#$ -pe mpich  8
#$ -cwd 
#$  -N topspin
#$   -v P4_GLOBMEMSIZE 
#$   -v MPI_PROCESS_GROUP=no


/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -ssh -np 8  -machinefile $TMPDIR/machines  /home/test/xhpl.mpich.ib.mkl.icc


    [ Part 3: "3478744173-master.doc.txt" ]

c16-15:   PID  PPID  PGRP CMD
    1     0     0 init [3]                               
    2     1     0 [migration/0]
    3     1     0 [ksoftirqd/0]
    4     1     0 [migration/1]
    5     1     0 [ksoftirqd/1]
    6     1     0 [events/0]
    8     6     0  \_ [khelper]
    9     6     0  \_ [kacpid]
   38     6     0  \_ [kblockd/0]
   39     6     0  \_ [kblockd/1]
   63     6     0  \_ [pdflush]
   64     6     0  \_ [pdflush]
   66     6     0  \_ [aio/0]
   67     6     0  \_ [aio/1]
  320     6     0  \_ [ata/0]
  321     6     0  \_ [ata/1]
 2262     6     0  \_ [kmirrord]
    7     1     0 [events/1]
 2161     7     0  \_ [kauditd]
   40     1     1 [khubd]
   65     1     1 [kswapd0]
  211     1     1 [kseriod]
  325     1     1 [scsi_eh_0]
  342     1     1 [kjournald]
 1525     1  1525 udevd
 1583     1     1 [ts_poll]
 1610     1     1 [cleanup_thread]
 1625     1     1 [ts_ib_completio]
 1626     1     1 [ts_ib_async]
 1627     1     1 [ts_ib_mad]
 1688     1     1 [ts_fmr]
 2281     1     1 [kjournald]
 2282     1     1 [kjournald]
 2830     1  2830 cpuspeed -d -n
 2831  2830  2830  \_ [cpuspeed] <defunct>
 3183     1  3183 /opt/rocks/bin/python /opt/rocks/bin/greceptor
 3193     1  3193 syslogd -m 0
 3198     1  3198 klogd -x
 3208     1  3208 irqbalance
 3217     1  3217 portmap
 3236     1  3236 rpc.statd
 3264     1  3264 rpc.idmapd
 3312     1     1 [ts_fmr]
 3313     1     1 [ts_srp_dm]
 3314     1     1 [scsi_eh_1]
 3471     1  3471 /usr/sbin/automount --timeout=1200 /share file /etc/auto.share
 3511     1  3511 /usr/sbin/automount --timeout=1200 /home file /etc/auto.home
 3525     1  3524 /usr/sbin/smartd
 3534     1  3534 /usr/sbin/acpid
 3544     1  3543 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd -a
 3587     1  3587 /usr/sbin/sshd
11692  3587 11692  \_ sshd: test [priv]
11696 11692 11692  |   \_ sshd: test at notty 
11697 11696 11697  |       \_ bash -c cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=1 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11803 11697 11697  |           \_ /home/test/xhpl.mpich.ib.mkl.icc
11693  3587 11693  \_ sshd: test [priv]
11720 11693 11693  |   \_ sshd: test at notty 
11721 11720 11721  |       \_ bash -c cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11825 11721 11721  |           \_ /home/test/xhpl.mpich.ib.mkl.icc
11985  3587 11985  \_ sshd: root at notty 
11987 11985 11987      \_ ps -e f -o pid,ppid,pgrp,cmd
 3600     1  3600 xinetd -stayalive -pidfile /var/run/xinetd.pid
 3627     1  3626 /usr/sbin/gmond
 3681     1  3681 /usr/libexec/postfix/master
 3692  3681  3681  \_ qmgr -l -t fifo -u
11532  3681  3681  \_ pickup -l -t fifo -u
 3693     1  3693 /usr/sbin/httpd
 3737  3693  3693  \_ /usr/sbin/httpd
 3738  3693  3693  \_ /usr/sbin/httpd
 3739  3693  3693  \_ /usr/sbin/httpd
 3740  3693  3693  \_ /usr/sbin/httpd
 3741  3693  3693  \_ /usr/sbin/httpd
 3742  3693  3693  \_ /usr/sbin/httpd
 3743  3693  3693  \_ /usr/sbin/httpd
 3744  3693  3693  \_ /usr/sbin/httpd
 3702     1  3702 crond
 3719     1  3719 /usr/sbin/atd
 3728     1  3728 dbus-daemon-1 --system
 3745     1  3745 hald
 3848     1  3848 /opt/gridengine/bin/lx26-amd64/sge_execd
11576  3848 11576  \_ sge_shepherd-165 -bg
11623 11576 11623      \_ -csh /opt/gridengine/default/spool/compute-16-15/job_scripts/165
11683 11623 11623          \_ /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -ssh -np 8 -machinefile /tmp/165.1.all.q/machines /home/test/xhpl.mpich.ib.mkl.icc
11684 11683 11623              \_ /usr/bin/ssh -q compute-16-15 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11685 11683 11623              \_ /usr/bin/ssh -q compute-16-15 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=1 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11686 11683 11623              \_ /usr/bin/ssh -q compute-16-16 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=2 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11687 11683 11623              \_ /usr/bin/ssh -q compute-16-16 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=3 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11688 11683 11623              \_ /usr/bin/ssh -q compute-16-12 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=4 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11689 11683 11623              \_ /usr/bin/ssh -q compute-16-12 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=5 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11690 11683 11623              \_ /usr/bin/ssh -q compute-16-5 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=6 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11691 11683 11623              \_ /usr/bin/ssh -q compute-16-5 cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=7 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
 3852     1  3852 /sbin/mingetty tty1
 3854     1  3854 /sbin/mingetty tty2
 3855     1  3855 /sbin/mingetty tty3
 3856     1  3856 /sbin/mingetty tty4
 3857     1  3857 /sbin/mingetty tty5
 3858     1  3858 /sbin/mingetty tty6
11580     1     1 [rpciod]
11581     1     1 [lockd]
11850     1 11850 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid




    [ Part 4: "3803961722-compute-16-5.doc.txt" ]

c16-5:   PID  PPID  PGRP CMD
    1     0     0 init [3]                               
    2     1     0 [migration/0]
    3     1     0 [ksoftirqd/0]
    4     1     0 [migration/1]
    5     1     0 [ksoftirqd/1]
    6     1     0 [events/0]
    8     6     0  \_ [khelper]
    9     6     0  \_ [kacpid]
   38     6     0  \_ [kblockd/0]
   39     6     0  \_ [kblockd/1]
   63     6     0  \_ [pdflush]
   64     6     0  \_ [pdflush]
   66     6     0  \_ [aio/0]
   67     6     0  \_ [aio/1]
  320     6     0  \_ [ata/0]
  321     6     0  \_ [ata/1]
    7     1     0 [events/1]
 2141     7     0  \_ [kauditd]
 2259     7     0  \_ [kmirrord]
   40     1     1 [khubd]
   65     1     1 [kswapd0]
  211     1     1 [kseriod]
  325     1     1 [scsi_eh_0]
  342     1     1 [kjournald]
 1531     1  1531 udevd
 1583     1     1 [ts_poll]
 1610     1     1 [cleanup_thread]
 1625     1     1 [ts_ib_completio]
 1626     1     1 [ts_ib_async]
 1627     1     1 [ts_ib_mad]
 1688     1     1 [ts_fmr]
 2278     1     1 [kjournald]
 2279     1     1 [kjournald]
 2822     1  2822 cpuspeed -d -n
 2823  2822  2822  \_ [cpuspeed] <defunct>
 3175     1  3175 /opt/rocks/bin/python /opt/rocks/bin/greceptor
 3186     1  3186 syslogd -m 0
 3190     1  3190 klogd -x
 3200     1  3200 irqbalance
 3209     1  3209 portmap
 3228     1  3228 rpc.statd
 3256     1  3256 rpc.idmapd
 3304     1     1 [ts_fmr]
 3305     1     1 [ts_srp_dm]
 3306     1     1 [scsi_eh_1]
 3469     1  3469 /usr/sbin/automount --timeout=1200 /share file /etc/auto.share
 3503     1  3503 /usr/sbin/automount --timeout=1200 /home file /etc/auto.home
 3517     1  3516 /usr/sbin/smartd
 3526     1  3526 /usr/sbin/acpid
 3536     1  3535 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd -a
 3579     1  3579 /usr/sbin/sshd
11361  3579 11361  \_ sshd: test [priv]
11370 11361 11361  |   \_ sshd: test at notty 
11372 11370 11372  |       \_ bash -c cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=6 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11488 11372 11372  |           \_ /home/test/xhpl.mpich.ib.mkl.icc
11362  3579 11362  \_ sshd: test [priv]
11369 11362 11362  |   \_ sshd: test at notty 
11371 11369 11371  |       \_ bash -c cd /home/test; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-16-15.local MPIRUN_PORT=32820 MPIRUN_PROCESSES='compute-16-15:compute-16-15:compute-16-16:compute-16-16:compute-16-12:compute-16-12:compute-16-5:compute-16-5:' MPIRUN_RANK=7 MPIRUN_NPROCS=8 MPIRUN_ID=11683      /home/test/xhpl.mpich.ib.mkl.icc  
11498 11371 11371  |           \_ /home/test/xhpl.mpich.ib.mkl.icc
11658  3579 11658  \_ sshd: root at notty 
11660 11658 11660      \_ ps -e f -o pid,ppid,pgrp,cmd
 3592     1  3592 xinetd -stayalive -pidfile /var/run/xinetd.pid
 3619     1  3618 /usr/sbin/gmond
 3673     1  3673 /usr/libexec/postfix/master
 3682  3673  3673  \_ qmgr -l -t fifo -u
11321  3673  3673  \_ pickup -l -t fifo -u
 3685     1  3685 /usr/sbin/httpd
 3720  3685  3685  \_ /usr/sbin/httpd
 3721  3685  3685  \_ /usr/sbin/httpd
 3722  3685  3685  \_ /usr/sbin/httpd
 3723  3685  3685  \_ /usr/sbin/httpd
 3724  3685  3685  \_ /usr/sbin/httpd
 3725  3685  3685  \_ /usr/sbin/httpd
 3726  3685  3685  \_ /usr/sbin/httpd
 3727  3685  3685  \_ /usr/sbin/httpd
 3694     1  3694 crond
 3711     1  3711 /usr/sbin/atd
 3728     1  3728 dbus-daemon-1 --system
 3737     1  3737 hald
 3826     1  3826 /opt/gridengine/bin/lx26-amd64/sge_execd
 3829     1  3829 /sbin/mingetty tty1
 3830     1  3830 /sbin/mingetty tty2
 3831     1  3831 /sbin/mingetty tty3
 3832     1  3832 /sbin/mingetty tty4
 3833     1  3833 /sbin/mingetty tty5
 3834     1  3834 /sbin/mingetty tty6
11367     1     1 [rpciod]
11368     1     1 [lockd]
11522     1 11522 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid -g





    [ Part 5: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list