[GE users] tight intergration problem

Jean-Paul Minet minet at cism.ucl.ac.be
Wed Jan 25 14:42:52 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

I am totally lost with this tight intergration...

1) as root user, if I use the -nolocal flag as mpirun argument, I end up with 
the following process on the "master node":

root      5349  5326  0 14:27 ?        00:00:00 bash 
/var/spool/sge/lmexec-121/job_scripts/2375
root      5352  5349  0 14:27 ?        00:00:00 /bin/sh 
/usr/local/mpich-eth-intel/bin/mpirun -nolocal -np 2 -machinefile 
/tmp/2375.1.all.q/machines abinip_eth
root      5446  5352  0 14:27 ?        00:00:00 
/gridware/sge/bin/lx24-amd64/qrsh -inherit lmexec-62 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg /home/
root      5454  5446  0 14:27 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/rsh -p 32816 lmexec-62 exec 
'/gridware/sge/utilbin/lx24-amd64/qrsh_starter' '/v
root      5455  5454  0 14:27 ?        00:00:00 [rsh] <defunct>

and on the slave node:

sgeadmin 14300  3464  0 14:27 ?        00:00:00 sge_shepherd-2375 -bg
root     14301 14300  0 14:27 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/rshd -l
root     14302 14301  0 14:27 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/lmexec-62/active_jobs/2375.1/1.lmexec-62
root     14303 14302 27 14:27 ?        00:00:31 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
root     14304 14303  0 14:27 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
root     14305 14303  0 14:27 ?        00:00:00 /usr/bin/rsh lmexec-62 -l root 
-n /home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-62 32819 \-p4amslave 
\-p4yourname lmexec-62
root     14306  3995  0 14:27 ?        00:00:00 in.rshd -aL
root     14307 14306 86 14:27 ?        00:01:38 
/home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave 
-p4yourname lmexec-62 -p4rmrank 1
root     14357 14307  0 14:27 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave 
-p4yourname lmexec-62 -p4rmrank 1

So, I can see that, in a way, the SGE qrsh/rsh/qrsh_starter are coming into play 
ans that the sge_shepherd is initiating remote process.  Nevertheless:

- as expected, there is no local instance of the program run on the master node, 
which is not what we want.
- the slave node issues a rsh onto itself, is that expected ?

Under these conditions, qstat -ext reports 0 usage (cpu/mem).

If I don't use this -nolocal flag, then the rsh/qrsh wrapper mechanism doesn't 
seem to come into play, and the master node does direct rsh to the slave node. 
In these conditions, the qstat -ext reports cpu time (from a single process, 
which is also expected since there is no SGE control in this case).

All in all, I don't see how this -nolocal flag can make the rsh wrapper appear 
to work or fail.

2) as non root user, the first scenario doesn't work as I get an "error:rcmd: 
permission denied".  Second scenario work as for root user.

Quite a bit lost...

Jean-Paul

Reuti wrote:
> Hi Jean-Paul,
> 
> Am 23.01.2006 um 14:31 schrieb Jean-Paul Minet:
> 
>> Reuti,
>>
>>> for using qrsh the /etc/hosts.equiv isn't necessary. I set this  
>>> just  to reflect the login node on all exec nodes to allow  
>>> interactive qrsh/ qlogin sessions.
>>
>>
>> OK, got this.
>>
>>> As qrsh will use a chosen port: any firewall and/or etc/hosts. 
>>> (allow| deny) configured? - Reuti
>>
>>
>> No firewall nor hosts.xxx.  The problem was from wrong mode set on  
>> rsh/rlogin on exec nodes (I had played with those following some  
>> hints for qrsh problem solving on the SGE FAQ, which probably  messed 
>> up everything).
>>
>> MPI jobs can now run with qrsh... CPU time displayed by "qstat - ext" 
>> is no longer 0... but it corresponds to a single cpu!
>>
>> 2218 0.24170 0.24169 Test_abini minet        NA                
>> grppcpm    r 0:00:12:09 205.41073 0.00000 74727     0     0 71428   
>> 3298 0.04  all.q at lmexec-88                    2
>>
>> This job started about 12 minutes earlier, and runs on 2 cpus.   
>> Shouldn't the displayed "cpu" be the sum of all cpu times or is  this 
>> the correct behavior?
>>
>> Thks for your input
>>
> 
> is "qstat -j 2218" giving you more reasonable results in the "usage  1:" 
> line? As "qstat -g t -ext" will also display the CPU time for  slave 
> processes, these should be per process. - Reuti
> 
>> jp
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list