[GE users] qdel makes me mad

Xiaoling Yang xuy3 at psu.edu
Mon Aug 6 22:34:00 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

My qdel can only remove the job from SGE queue but the job is still running in the computational nodes.

I tried the solution from Retui, http://gridengine.sunsource.net/howto/mpich-integration.html. Because my SGE was from RocksClusters V4.2, I just want to follow the second option, Define MPICH_PROCESS_GROUP=no to solve this problem. Unfortunatelly, it seems not work. The "-V" option change for rsh-wrapper (/opt/gridengine/mpi/rsh) was already done by RocksClusters. So, what I need do is just put the MPICH_PROCESS_GROUP=no in my environment variables. I put the "export MPICH_PROCESS_GROUP=no" in my script file, it doesn't work. Ant then I put the "#$ -v MPICH_PROCESS_GROUP=no", it doesn't work either. In order to make sure the "rsh" was the exact rsh-wrapper, I put export "PATH=$TMPDIR:$PATH", it is no use at all.

Using the command "ps f -eo pid,uid,gid,user,pgrp,command --cols=120", I can see,

  PID   UID   GID USER      PGRP COMMAND
  4856   400   400 sge       4856 /opt/gridengine/bin/lx26-x86/sge_execd
22954   400   400 sge      22954  \_ sge_shepherd-132 -bg
22986   500   100 yang     22986      \_ bash /opt/gridengine/default/spool/ecl/job_scripts/132
22987   500   100 yang     22986          \_ /bin/sh /opt/mpich/gnu/bin/mpirun -np 4 -machinefile /tmp/132.1.all.q/machi
23127   500   100 yang     22986              \_ /home/yang/testdel/loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4wd /
23128   500   100 yang     22986                  \_ /home/yang/testdel/loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4
23129   500   100 yang     22986                  \_ ssh ecl -l yang -n /home/yang/testdel/loopcode ecl 44726 \-p4amslav
23259   500   100 yang     22986                  \_ ssh pvfs2-compute-0-1 -l yang -n /home/yang/testdel/loopcode ecl 44
23261   500   100 yang     22986                  \_ ssh pvfs2-compute-0-0 -l yang -n /home/yang/testdel/loopcode ecl 44

16456     0     0 root     16456  \_ sshd: yang [priv]
16459   500   100 yang     16456  |   \_ sshd: yang at pts/2 
16460   500   100 yang     16460  |       \_ -bash
17397     0     0 root     17397  \_ sshd: root at pts/3 
17399     0     0 root     17399  |   \_ -bash
19315     0     0 root     19315  \_ sshd: qiuchu [priv]
19317   506   100 qiuchu   19315  |   \_ sshd: qiuchu at notty
19318   506   100 qiuchu   19318  |       \_ /usr/libexec/openssh/sftp-server
22524     0     0 root     22524  \_ sshd: yang [priv]
22526   500   100 yang     22524  |   \_ sshd: yang at pts/0 
22527   500   100 yang     22527  |       \_ -bash
23265   500   100 yang     23265  |           \_ ps f -eo pid,uid,gid,user,pgrp,command --cols=120
23130     0     0 root     23130  \_ sshd: yang [priv]
23132   500   100 yang     23130      \_ sshd: yang at notty 
23133   500   100 yang     23133          \_ /home/yang/testdel/loopcode ecl 44726   4amslave -p4yourname ecl -p4rmrank 
23260   500   100 yang     23133              \_ /home/yang/testdel/loopcode ecl 44726   4amslave -p4yourname ecl -p4rmr

The PGRP was changed from 23130 to 23133. Is that the reason I can not delete the running process?

By the way, will the mpirun execute the rsh (rsh-wrapper) or ssh to start the job in each computer node?

Thanks for any suggestion.

Bob



More information about the gridengine-users mailing list