[GE users] qdel makes me mad

Reuti reuti at staff.uni-marburg.de
Tue Aug 7 10:28:36 BST 2007


Hi,

Am 06.08.2007 um 23:34 schrieb Xiaoling Yang:

> My qdel can only remove the job from SGE queue but the job is still  
> running in the computational nodes.
>
> I tried the solution from Retui, http://gridengine.sunsource.net/ 
> howto/mpich-integration.html. Because my SGE was from RocksClusters  
> V4.2, I just want to follow the second option, Define  
> MPICH_PROCESS_GROUP=no to solve this problem. Unfortunatelly, it  
> seems not work. The "-V" option change for rsh-wrapper (/opt/ 
> gridengine/mpi/rsh) was already done by RocksClusters. So, what I  
> need do is just put the MPICH_PROCESS_GROUP=no in my environment  
> variables. I put the "export MPICH_PROCESS_GROUP=no" in my script  
> file, it doesn't work. Ant then I put the "#$ -v  
> MPICH_PROCESS_GROUP=no", it doesn't work either. In order to make  
> sure the "rsh" was the exact rsh-wrapper, I put export "PATH= 
> $TMPDIR:$PATH", it is no use at all.
>
> Using the command "ps f -eo pid,uid,gid,user,pgrp,command -- 
> cols=120", I can see,
>
>   PID   UID   GID USER      PGRP COMMAND
>   4856   400   400 sge       4856 /opt/gridengine/bin/lx26-x86/ 
> sge_execd
> 22954   400   400 sge      22954  \_ sge_shepherd-132 -bg
> 22986   500   100 yang     22986      \_ bash /opt/gridengine/ 
> default/spool/ecl/job_scripts/132
> 22987   500   100 yang     22986          \_ /bin/sh /opt/mpich/gnu/ 
> bin/mpirun -np 4 -machinefile /tmp/132.1.all.q/machi
> 23127   500   100 yang     22986              \_ /home/yang/testdel/ 
> loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4wd /
> 23128   500   100 yang     22986                  \_ /home/yang/ 
> testdel/loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4
> 23129   500   100 yang     22986                  \_ ssh ecl -l  
> yang -n /home/yang/testdel/loopcode ecl 44726 \-p4amslav
> 23259   500   100 yang     22986                  \_ ssh pvfs2- 
> compute-0-1 -l yang -n /home/yang/testdel/loopcode ecl 44
> 23261   500   100 yang     22986                  \_ ssh pvfs2- 
> compute-0-0 -l yang -n /home/yang/testdel/loopcode ecl 44
>
> 16456     0     0 root     16456  \_ sshd: yang [priv]
> 16459   500   100 yang     16456  |   \_ sshd: yang at pts/2
> 16460   500   100 yang     16460  |       \_ -bash
> 17397     0     0 root     17397  \_ sshd: root at pts/3
> 17399     0     0 root     17399  |   \_ -bash
> 19315     0     0 root     19315  \_ sshd: qiuchu [priv]
> 19317   506   100 qiuchu   19315  |   \_ sshd: qiuchu at notty
> 19318   506   100 qiuchu   19318  |       \_ /usr/libexec/openssh/ 
> sftp-server
> 22524     0     0 root     22524  \_ sshd: yang [priv]
> 22526   500   100 yang     22524  |   \_ sshd: yang at pts/0
> 22527   500   100 yang     22527  |       \_ -bash
> 23265   500   100 yang     23265  |           \_ ps f -eo  
> pid,uid,gid,user,pgrp,command --cols=120
> 23130     0     0 root     23130  \_ sshd: yang [priv]
> 23132   500   100 yang     23130      \_ sshd: yang at notty
> 23133   500   100 yang     23133          \_ /home/yang/testdel/ 
> loopcode ecl 44726   4amslave -p4yourname ecl -p4rmrank
> 23260   500   100 yang     23133              \_ /home/yang/testdel/ 
> loopcode ecl 44726   4amslave -p4yourname ecl -p4rmr

you are ending up using the default (ssh-)login daemon, and not the  
one from SGE. Please try in the jobscript:

export P4_RSHCOMMAND=rsh

The other thing which we discussed in PM: AFAIK ROCKS returns the  
FQDN with the command `hostname`, hence the job distribution will be  
wrong and you have to adjust in the startmpi.sh PeHostfile2MachineFile 
():

echo $host

to read

echo $host.local


to change

pvfs2-compute-0-1

to read

pvfs2-compute-0-1.local

> The PGRP was changed from 23130 to 23133. Is that the reason I can  
> not delete the running process?

Not directly, they should all be kids of the sge_execd and the  
sge_shepherd.


>  By the way, will the mpirun execute the rsh (rsh-wrapper) or ssh  
> to start the job in each computer node?

If you must use ssh (IMO it's unnecessary in a private subnet of a  
cluster), keep MPICH to still believe to use rsh, then the rsh- 
wrapper of SGE will take care of it. Then adjust SGE to direct the  
"qrsh -inherit ..." to use ssh in the end:

http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list