[GE users] qdel makes me mad

Xiaoling Yang xuy3 at psu.edu
Tue Aug 7 22:06:14 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

If you are using RocksCluster V 4.1, one more work will be done for the SGE.

You might see some information just like "permission" problem. If this 
happened, you will use qconf -mconf and add
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
to make the qrsh use ssh instead of rsh.

Bob

----- Original Message ----- 
From: "Xiaoling Yang" <xuy3 at psu.edu>
To: <users at gridengine.sunsource.net>
Sent: Tuesday, August 07, 2007 8:50 AM
Subject: Re: [GE users] qdel makes me mad


> Hi Reuti,
>
> Yes. You are right. I traced the executation yesterday and found the 
> MPICHs (not only for P4 but also for myrinet) from RocksClusters are using 
> ssh as default.
>
> I modified my qsub script and add the following lines,
>
> export PATH=$TMPDIR:$PATH (otherwise, the system will try to use rsh from 
> system instead of rsh-wrapper created by startmpi.sh )
> export RSHCOMMAND=rsh ( for myrinet )
> export P4_RSHCOMMAND=rsh ( for p4 )
> export MPICH_PROCESS_GROUP=no
>
> For myrinet, just use mpirun.ch_mx or mpirun.ch_gm instead of mpirun in 
> this script.
>
> And now, the problem has been fixed.
>
> Thank Reuti. I also heared about the new RocksCluster (V 4.3) fixed this 
> problem. Is that true?
>
> Bob
>
> ----- Original Message ----- 
> From: "Reuti" <reuti at staff.uni-marburg.de>
> To: <users at gridengine.sunsource.net>
> Sent: Tuesday, August 07, 2007 5:28 AM
> Subject: Re: [GE users] qdel makes me mad
>
>
>> Hi,
>>
>> Am 06.08.2007 um 23:34 schrieb Xiaoling Yang:
>>
>>> My qdel can only remove the job from SGE queue but the job is still 
>>> running in the computational nodes.
>>>
>>> I tried the solution from Retui, http://gridengine.sunsource.net/ 
>>> howto/mpich-integration.html. Because my SGE was from RocksClusters 
>>> V4.2, I just want to follow the second option, Define 
>>> MPICH_PROCESS_GROUP=no to solve this problem. Unfortunatelly, it  seems 
>>> not work. The "-V" option change for rsh-wrapper (/opt/ 
>>> gridengine/mpi/rsh) was already done by RocksClusters. So, what I  need 
>>> do is just put the MPICH_PROCESS_GROUP=no in my environment  variables. 
>>> I put the "export MPICH_PROCESS_GROUP=no" in my script  file, it doesn't 
>>> work. Ant then I put the "#$ -v  MPICH_PROCESS_GROUP=no", it doesn't 
>>> work either. In order to make  sure the "rsh" was the exact rsh-wrapper, 
>>> I put export "PATH= $TMPDIR:$PATH", it is no use at all.
>>>
>>> Using the command "ps f -eo pid,uid,gid,user,pgrp,command -- 
>>> cols=120", I can see,
>>>
>>>   PID   UID   GID USER      PGRP COMMAND
>>>   4856   400   400 sge       4856 /opt/gridengine/bin/lx26-x86/ 
>>> sge_execd
>>> 22954   400   400 sge      22954  \_ sge_shepherd-132 -bg
>>> 22986   500   100 yang     22986      \_ bash /opt/gridengine/ 
>>> default/spool/ecl/job_scripts/132
>>> 22987   500   100 yang     22986          \_ /bin/sh /opt/mpich/gnu/ 
>>> bin/mpirun -np 4 -machinefile /tmp/132.1.all.q/machi
>>> 23127   500   100 yang     22986              \_ /home/yang/testdel/ 
>>> loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4wd /
>>> 23128   500   100 yang     22986                  \_ /home/yang/ 
>>> testdel/loopcode -p4pg /tmp/132.1.all.q/PIlKtYC23080 -p4
>>> 23129   500   100 yang     22986                  \_ ssh ecl -l  yang -n 
>>> /home/yang/testdel/loopcode ecl 44726 \-p4amslav
>>> 23259   500   100 yang     22986                  \_ ssh pvfs2- 
>>> compute-0-1 -l yang -n /home/yang/testdel/loopcode ecl 44
>>> 23261   500   100 yang     22986                  \_ ssh pvfs2- 
>>> compute-0-0 -l yang -n /home/yang/testdel/loopcode ecl 44
>>>
>>> 16456     0     0 root     16456  \_ sshd: yang [priv]
>>> 16459   500   100 yang     16456  |   \_ sshd: yang at pts/2
>>> 16460   500   100 yang     16460  |       \_ -bash
>>> 17397     0     0 root     17397  \_ sshd: root at pts/3
>>> 17399     0     0 root     17399  |   \_ -bash
>>> 19315     0     0 root     19315  \_ sshd: qiuchu [priv]
>>> 19317   506   100 qiuchu   19315  |   \_ sshd: qiuchu at notty
>>> 19318   506   100 qiuchu   19318  |       \_ /usr/libexec/openssh/ 
>>> sftp-server
>>> 22524     0     0 root     22524  \_ sshd: yang [priv]
>>> 22526   500   100 yang     22524  |   \_ sshd: yang at pts/0
>>> 22527   500   100 yang     22527  |       \_ -bash
>>> 23265   500   100 yang     23265  |           \_ ps f -eo 
>>> pid,uid,gid,user,pgrp,command --cols=120
>>> 23130     0     0 root     23130  \_ sshd: yang [priv]
>>> 23132   500   100 yang     23130      \_ sshd: yang at notty
>>> 23133   500   100 yang     23133          \_ /home/yang/testdel/ 
>>> loopcode ecl 44726   4amslave -p4yourname ecl -p4rmrank
>>> 23260   500   100 yang     23133              \_ /home/yang/testdel/ 
>>> loopcode ecl 44726   4amslave -p4yourname ecl -p4rmr
>>
>> you are ending up using the default (ssh-)login daemon, and not the  one 
>> from SGE. Please try in the jobscript:
>>
>> export P4_RSHCOMMAND=rsh
>>
>> The other thing which we discussed in PM: AFAIK ROCKS returns the  FQDN 
>> with the command `hostname`, hence the job distribution will be  wrong 
>> and you have to adjust in the startmpi.sh PeHostfile2MachineFile ():
>>
>> echo $host
>>
>> to read
>>
>> echo $host.local
>>
>>
>> to change
>>
>> pvfs2-compute-0-1
>>
>> to read
>>
>> pvfs2-compute-0-1.local
>>
>>> The PGRP was changed from 23130 to 23133. Is that the reason I can  not 
>>> delete the running process?
>>
>> Not directly, they should all be kids of the sge_execd and the 
>> sge_shepherd.
>>
>>
>>>  By the way, will the mpirun execute the rsh (rsh-wrapper) or ssh  to 
>>> start the job in each computer node?
>>
>> If you must use ssh (IMO it's unnecessary in a private subnet of a 
>> cluster), keep MPICH to still believe to use rsh, then the rsh- wrapper 
>> of SGE will take care of it. Then adjust SGE to direct the 
>>  "qrsh -inherit ..." to use ssh in the end:
>>
>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>
>> -- Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list