[GE users] qdel makes me mad

Xiaoling Yang xuy3 at psu.edu
Tue Aug 7 13:50:01 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

Yes. You are right. I traced the executation yesterday and found the MPICHs 
(not only for P4 but also for myrinet) from RocksClusters are using ssh as 
default.

I modified my qsub script and add the following lines,

export PATH=$TMPDIR:$PATH (otherwise, the system will try to use rsh from 
system instead of rsh-wrapper created by startmpi.sh )
export RSHCOMMAND=rsh ( for myrinet )
export P4_RSHCOMMAND=rsh ( for p4 )
export MPICH_PROCESS_GROUP=no

For myrinet, just use mpirun.ch_mx or mpirun.ch_gm instead of mpirun in this 
script.

And now, the problem has been fixed.

Thank Reuti. I also heared about the new RocksCluster (V 4.3) fixed this 
problem. Is that true?

Bob

----- Original Message ----- 
From: "Reuti" <reuti at staff.uni-marburg.de>
To: <users at gridengine.sunsource.net>
Sent: Tuesday, August 07, 2007 5:28 AM
Subject: Re: [GE users] qdel makes me mad


> Hi,
>
> Am 06.08.2007 um 23:34 schrieb Xiaoling Yang:
>
>> My qdel can only remove the job from SGE queue but the job is still 
>> running in the computational nodes.
>>
>> I tried the solution from Retui, http://gridengine.sunsource.net/ 
>> howto/mpich-integration.html. Because my SGE was from RocksClusters 
>> V4.2, I just want to follow the second option, Define 
>> MPICH_PROCESS_GROUP=no to solve this problem. Unfortunatelly, it  seems 
>> not work. The "-V" option change for rsh-wrapper (/opt/ 
>> gridengine/mpi/rsh) was already done by RocksClusters. So, what I  need 
>> do is just put the MPICH_PROCESS_GROUP=no in my environment  variables. I 
>> put the "export MPICH_PROCESS_GROUP=no" in my script  file, it doesn't 
>> work. Ant then I put the "#$ -v  MPICH_PROCESS_GROUP=no", it doesn't work 
>> either. In order to make  sure the "rsh" was the exact rsh-wrapper, I put 
>> export "PATH= $TMPDIR:$PATH", it is no use at all.
>>
>> Using the command "ps f -eo pid,uid,gid,user,pgrp,command -- 
>> cols=120", I can see,
>>
>>   PID   UID   GID USER      PGRP COMMAND
>>   4856   400   400 sge       4856 /opt/gridengine/bin/lx26-x86/ sge_execd
>> 22954   400   400 sge      22954  \_ sge_shepherd-132 -bg
>> 22986   500   100 yang     22986      \_ bash /opt/gridengine/ 
>> default/spool/ecl/job_scripts/132
>> 22987   500   100 yang     22986          \_ /bin/sh /opt/mpich/gnu/ 
>> bin/mpirun -np 4 -machinefile /tmp/132.1.all.q/machi
>> 23127   500   100 yang     22986              \_ /home/yang/testdel/ 
>> loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4wd /
>> 23128   500   100 yang     22986                  \_ /home/yang/ 
>> testdel/loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4
>> 23129   500   100 yang     22986                  \_ ssh ecl -l  yang -n 
>> /home/yang/testdel/loopcode ecl 44726 \-p4amslav
>> 23259   500   100 yang     22986                  \_ ssh pvfs2- 
>> compute-0-1 -l yang -n /home/yang/testdel/loopcode ecl 44
>> 23261   500   100 yang     22986                  \_ ssh pvfs2- 
>> compute-0-0 -l yang -n /home/yang/testdel/loopcode ecl 44
>>
>> 16456     0     0 root     16456  \_ sshd: yang [priv]
>> 16459   500   100 yang     16456  |   \_ sshd: yang at pts/2
>> 16460   500   100 yang     16460  |       \_ -bash
>> 17397     0     0 root     17397  \_ sshd: root at pts/3
>> 17399     0     0 root     17399  |   \_ -bash
>> 19315     0     0 root     19315  \_ sshd: qiuchu [priv]
>> 19317   506   100 qiuchu   19315  |   \_ sshd: qiuchu at notty
>> 19318   506   100 qiuchu   19318  |       \_ /usr/libexec/openssh/ 
>> sftp-server
>> 22524     0     0 root     22524  \_ sshd: yang [priv]
>> 22526   500   100 yang     22524  |   \_ sshd: yang at pts/0
>> 22527   500   100 yang     22527  |       \_ -bash
>> 23265   500   100 yang     23265  |           \_ ps f -eo 
>> pid,uid,gid,user,pgrp,command --cols=120
>> 23130     0     0 root     23130  \_ sshd: yang [priv]
>> 23132   500   100 yang     23130      \_ sshd: yang at notty
>> 23133   500   100 yang     23133          \_ /home/yang/testdel/ loopcode 
>> ecl 44726   4amslave -p4yourname ecl -p4rmrank
>> 23260   500   100 yang     23133              \_ /home/yang/testdel/ 
>> loopcode ecl 44726   4amslave -p4yourname ecl -p4rmr
>
> you are ending up using the default (ssh-)login daemon, and not the  one 
> from SGE. Please try in the jobscript:
>
> export P4_RSHCOMMAND=rsh
>
> The other thing which we discussed in PM: AFAIK ROCKS returns the  FQDN 
> with the command `hostname`, hence the job distribution will be  wrong and 
> you have to adjust in the startmpi.sh PeHostfile2MachineFile ():
>
> echo $host
>
> to read
>
> echo $host.local
>
>
> to change
>
> pvfs2-compute-0-1
>
> to read
>
> pvfs2-compute-0-1.local
>
>> The PGRP was changed from 23130 to 23133. Is that the reason I can  not 
>> delete the running process?
>
> Not directly, they should all be kids of the sge_execd and the 
> sge_shepherd.
>
>
>>  By the way, will the mpirun execute the rsh (rsh-wrapper) or ssh  to 
>> start the job in each computer node?
>
> If you must use ssh (IMO it's unnecessary in a private subnet of a 
> cluster), keep MPICH to still believe to use rsh, then the rsh- wrapper of 
> SGE will take care of it. Then adjust SGE to direct the  "qrsh -inherit 
> ..." to use ssh in the end:
>
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list