[GE users] tight intergration problem

Jean-Paul Minet minet at cism.ucl.ac.be
Thu Jan 26 08:47:41 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

> In short: don't use -nolocal! This will exclude the starting node  from 
> all starts of a MPICH process and will lead to an uneven process  
> distribution. With SGE looking at the number of issued qrsh's, this  
> might break the complete setup. And what you see, is that a slave  node 
> is becoming the new master of the MPICH program. Looks okay, but  of 
> course wrong for the Tight Integration.

The fact that -nolocal was included in the submission script is somehow 
historical.  It's indeed clear to me that the -nolocal switch is not desirable 
in order for the mpich "head-node" to do indeed some work.  But, during my 
"troubleshooting" of the qrsh wrapping scheme, it coincidentally happened that, 
while the -nolocal switch was there, I could see some qrsh issued (and therefore 
could conlcude that the rsh redirection through the /tmp/... directory and PATH 
fiddling was working).  I then wanted to remove this offending -nolocal switch, 
but then any reference to qrsh disappears... (see below for ps -ef output)!

> What do you mean by master-node? The master-node of the cluster or  the 
> "head-node" of the parallel job? With MPICH, also the "head-node"  of 
> the parallel job (which is actually a conventional exec node in  the 
> cluster) will do some work (therefore don't use -nolocal). This  is the 
> one you see in the "qstat" output (or named "MASTER" with  "qstat -g t").

I indeed meant the SGE MASTER, which is the mpich head node.  No intention at 
all to deliberately use -nolocal (just by accident...)

> Can you please post the "ps -e f" output when not using -nolocal? You  
> checked also in the default ouput file of the PE (.po), that the  
> hostnames listed there, are the ones which you get by the command  
> "hostname" on the nodes? Otherwise MPICH might fail to subtract one  
> process on the "head-node" of the parallel job.
>
> Using the Tight Integration, also the normal user can just use qrsh.  
> So, I'd suggest to submit a small script with the proper -pe request:
> 
> #!/bin/sh
> cat $TMPDIR/machines
> sleep 120
> 
> and check on the head-node of the parallel job, that the link to the  
> rsh wrapper was created in the intended way in $TMPDIR. Is $SGE_ROOT  
> mounted nosuid on the nodes? This might explain that only root can do  it.

Before listing the ps -ef output, let me confirm that I checked that, on the 
mpich head-node, the /tmp/job n°/rsh symbolic link is there pointing to the 
/sge/gridware/mpi/bin/rsh, and, in the same temporary directory, the machine 
file seems correct (also confirmed by the .po output file).  As far as qrsh is 
concerned, root can use it interactively, but I just realize now that normal 
users get the same error.  SGE was installed (by Sun on delivery of cluster) on 
each node (no remote mount) and this seems not right to me.  I would prefer an 
NFS mount.  You mentioned earlier that no SUID is required on specific SGE 
binaries, but in SGE HowTo's (
http://gridengine.sunsource.net/howto/commonproblems.html#interactive), there is 
some mention of SUID for utilbin rlogin and rsh.  Anyway, I followed that, and I 
still get the error:

qrsh -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-100 date
your job 2402 ("date") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 2402 has been successfully scheduled.
Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host lmexec-100 ...
rcmd: socket: Permission denied

on lmexec-100, I have:
-rwxr-xr-x  1 sgeadmin root  194380 Jul 22  2005 qrsh_starter
-r-sr-xr-x  1 root     root   32607 Jul 22  2005 rlogin
-r-sr-xr-x  1 root     root   22180 Jul 22  2005 rsh
-rwxr-xr-x  1 sgeadmin root  218778 Jul 22  2005 rshd

Here is the ps -ef output on the head node for a mpich job (np=2) when -nolocal 
is not used:

sgeadmin  1181  3492  0 08:43 ?        00:00:00 sge_shepherd-2404 -bg
minet     1205  1181  0 08:43 ?        00:00:00 bash 
/var/spool/sge/lmexec-94/job_scripts/2404
minet     1208  1205  0 08:43 ?        00:00:00 /bin/sh 
/usr/local/mpich-eth-intel/bin/mpirun -np 2 -machinefile 
/tmp/2404.1.all.q/machines abinip_e
minet     1292  1208  0 08:43 ?        00:00:12 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI1208 -p4w
minet     1293  1292  0 08:43 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI1208 -p4w
minet     1294  1292  0 08:43 ?        00:00:00 /usr/bin/rsh lmexec-72 -l minet 
-n /home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-94 32858 \-

We indeed have the node doing some work... but the connexion to the "slave" is 
through /usr/bin/rsh instead of qrsh.  On this head node, we have also:

lmexec-94 /tmp/2404.1.all.q # ls -al
total 12
drwxr-xr-x  2 minet grppcpm 4096 Jan 26 08:42 .
drwxrwxrwt  9 root  root    4096 Jan 26 08:42 ..
-rw-r--r--  1 minet grppcpm   20 Jan 26 08:42 machines
lrwxrwxrwx  1 minet grppcpm   21 Jan 26 08:42 rsh -> /gridware/sge/mpi/rsh

and also:
lmexec-94 /tmp/2404.1.all.q # cat machines
lmexec-94
lmexec-72

but somehow, the rsh wrapping mechanism get bypassed somewhere...

I hope I have been clear in my replies so that it helps you to help me ;-)

Thanks again for your support

Jean-Paul

> HTH  -Reuti
> 
> 
> Am 25.01.2006 um 15:42 schrieb Jean-Paul Minet:
> 
>> Reuti,
>>
>> I am totally lost with this tight intergration...
>>
>> 1) as root user, if I use the -nolocal flag as mpirun argument, I  end 
>> up with the following process on the "master node":
>>
>> root      5349  5326  0 14:27 ?        00:00:00 bash /var/spool/sge/ 
>> lmexec-121/job_scripts/2375
>> root      5352  5349  0 14:27 ?        00:00:00 /bin/sh /usr/local/ 
>> mpich-eth-intel/bin/mpirun -nolocal -np 2 -machinefile /tmp/ 
>> 2375.1.all.q/machines abinip_eth
>> root      5446  5352  0 14:27 ?        00:00:00 /gridware/sge/bin/ 
>> lx24-amd64/qrsh -inherit lmexec-62 /home/pan/minet/abinit/ 
>> parallel_eth/abinip_eth -p4pg /home/
>> root      5454  5446  0 14:27 ?        00:00:00 /gridware/sge/ 
>> utilbin/lx24-amd64/rsh -p 32816 lmexec-62 exec '/gridware/sge/ 
>> utilbin/lx24-amd64/qrsh_starter' '/v
>> root      5455  5454  0 14:27 ?        00:00:00 [rsh] <defunct>
>>
>> and on the slave node:
>>
>> sgeadmin 14300  3464  0 14:27 ?        00:00:00 sge_shepherd-2375 -bg
>> root     14301 14300  0 14:27 ?        00:00:00 /gridware/sge/ 
>> utilbin/lx24-amd64/rshd -l
>> root     14302 14301  0 14:27 ?        00:00:00 /gridware/sge/ 
>> utilbin/lx24-amd64/qrsh_starter /var/spool/sge/lmexec-62/ 
>> active_jobs/2375.1/1.lmexec-62
>> root     14303 14302 27 14:27 ?        00:00:31 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/ 
>> parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
>> root     14304 14303  0 14:27 ?        00:00:00 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/ 
>> parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
>> root     14305 14303  0 14:27 ?        00:00:00 /usr/bin/rsh  
>> lmexec-62 -l root -n /home/pan/minet/abinit/parallel_eth/abinip_eth  
>> lmexec-62 32819 \-p4amslave \-p4yourname lmexec-62
>> root     14306  3995  0 14:27 ?        00:00:00 in.rshd -aL
>> root     14307 14306 86 14:27 ?        00:01:38 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave - p4yourname 
>> lmexec-62 -p4rmrank 1
>> root     14357 14307  0 14:27 ?        00:00:00 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave - p4yourname 
>> lmexec-62 -p4rmrank 1
>>
>> So, I can see that, in a way, the SGE qrsh/rsh/qrsh_starter are  
>> coming into play ans that the sge_shepherd is initiating remote  
>> process.  Nevertheless:
>>
>> - as expected, there is no local instance of the program run on the  
>> master node, which is not what we want.
>> - the slave node issues a rsh onto itself, is that expected ?
>>
>> Under these conditions, qstat -ext reports 0 usage (cpu/mem).
>>
>> If I don't use this -nolocal flag, then the rsh/qrsh wrapper  
>> mechanism doesn't seem to come into play, and the master node does  
>> direct rsh to the slave node. In these conditions, the qstat -ext  
>> reports cpu time (from a single process, which is also expected  since 
>> there is no SGE control in this case).
>>
>> All in all, I don't see how this -nolocal flag can make the rsh  
>> wrapper appear to work or fail.
>>
>> 2) as non root user, the first scenario doesn't work as I get an  
>> "error:rcmd: permission denied".  Second scenario work as for root  user.
>>
>> Quite a bit lost...
>>
>> Jean-Paul
>>
>> Reuti wrote:
>>
>>> Hi Jean-Paul,
>>> Am 23.01.2006 um 14:31 schrieb Jean-Paul Minet:
>>>
>>>> Reuti,
>>>>
>>>>> for using qrsh the /etc/hosts.equiv isn't necessary. I set this   
>>>>> just  to reflect the login node on all exec nodes to allow   
>>>>> interactive qrsh/ qlogin sessions.
>>>>
>>>>
>>>>
>>>> OK, got this.
>>>>
>>>>> As qrsh will use a chosen port: any firewall and/or etc/hosts.  
>>>>> (allow| deny) configured? - Reuti
>>>>
>>>>
>>>>
>>>> No firewall nor hosts.xxx.  The problem was from wrong mode set  on  
>>>> rsh/rlogin on exec nodes (I had played with those following  some  
>>>> hints for qrsh problem solving on the SGE FAQ, which  probably  
>>>> messed up everything).
>>>>
>>>> MPI jobs can now run with qrsh... CPU time displayed by "qstat -  
>>>> ext" is no longer 0... but it corresponds to a single cpu!
>>>>
>>>> 2218 0.24170 0.24169 Test_abini minet        NA                 
>>>> grppcpm    r 0:00:12:09 205.41073 0.00000 74727     0     0  71428   
>>>> 3298 0.04  all.q at lmexec-88                    2
>>>>
>>>> This job started about 12 minutes earlier, and runs on 2 cpus.    
>>>> Shouldn't the displayed "cpu" be the sum of all cpu times or is   
>>>> this the correct behavior?
>>>>
>>>> Thks for your input
>>>>
>>> is "qstat -j 2218" giving you more reasonable results in the  "usage  
>>> 1:" line? As "qstat -g t -ext" will also display the CPU  time for  
>>> slave processes, these should be per process. - Reuti
>>>
>>>> jp
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> -- 
>> Jean-Paul Minet
>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>> Université Catholique de Louvain
>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list