[GE users] tight intergration problem

Jean-Paul Minet minet at cism.ucl.ac.be
Thu Jan 26 09:27:37 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

Just correction of my last mail, /usr/local/mpich stuff is indeed NFS exported 
and not local to each node... sorry for this wrong info.

jp

Reuti wrote:
> Hi,
> 
> Am 26.01.2006 um 09:47 schrieb Jean-Paul Minet:
> 
>> Reuti,
>>
>>> In short: don't use -nolocal! This will exclude the starting node   
>>> from all starts of a MPICH process and will lead to an uneven  
>>> process  distribution. With SGE looking at the number of issued  
>>> qrsh's, this  might break the complete setup. And what you see, is  
>>> that a slave  node is becoming the new master of the MPICH  program. 
>>> Looks okay, but  of course wrong for the Tight Integration.
>>
>>
>> The fact that -nolocal was included in the submission script is  
>> somehow historical.  It's indeed clear to me that the -nolocal  switch 
>> is not desirable in order for the mpich "head-node" to do  indeed some 
>> work.  But, during my "troubleshooting" of the qrsh  wrapping scheme, 
>> it coincidentally happened that, while the - nolocal switch was there, 
>> I could see some qrsh issued (and  therefore could conlcude that the 
>> rsh redirection through the / tmp/... directory and PATH fiddling was 
>> working).  I then wanted to  remove this offending -nolocal switch, 
>> but then any reference to  qrsh disappears... (see below for ps -ef 
>> output)!
>>
>>> What do you mean by master-node? The master-node of the cluster  or  
>>> the "head-node" of the parallel job? With MPICH, also the  
>>> "head-node"  of the parallel job (which is actually a conventional  
>>> exec node in  the cluster) will do some work (therefore don't use - 
>>> nolocal). This  is the one you see in the "qstat" output (or named  
>>> "MASTER" with  "qstat -g t").
>>
>>
>> I indeed meant the SGE MASTER, which is the mpich head node.  No  
>> intention at all to deliberately use -nolocal (just by accident...)
>>
>>> Can you please post the "ps -e f" output when not using -nolocal?  
>>> You  checked also in the default ouput file of the PE (.po), that  
>>> the  hostnames listed there, are the ones which you get by the  
>>> command  "hostname" on the nodes? Otherwise MPICH might fail to  
>>> subtract one  process on the "head-node" of the parallel job.
>>>
>>> Using the Tight Integration, also the normal user can just use  
>>> qrsh.  So, I'd suggest to submit a small script with the proper - pe 
>>> request:
>>> #!/bin/sh
>>> cat $TMPDIR/machines
>>> sleep 120
>>> and check on the head-node of the parallel job, that the link to  
>>> the  rsh wrapper was created in the intended way in $TMPDIR. Is  
>>> $SGE_ROOT  mounted nosuid on the nodes? This might explain that  only 
>>> root can do  it.
>>
>>
>> Before listing the ps -ef output, let me confirm that I checked  that, 
>> on the mpich head-node, the /tmp/job n°/rsh symbolic link is  there 
>> pointing to the /sge/gridware/mpi/bin/rsh, and, in the same  temporary 
>> directory, the machine file seems correct (also confirmed  by the .po 
>> output file).  As far as qrsh is concerned, root can use  it 
>> interactively, but I just realize now that normal users get the  same 
>> error.  SGE was installed (by Sun on delivery of cluster) on  each 
>> node (no remote mount) and this seems not right to me.  I  would 
>> prefer an NFS mount.  You mentioned earlier that no SUID is  required 
>> on specific SGE binaries, but in SGE HowTo's (
> 
> 
> Correct, my remark was a question, whether it was mounted by accident  
> with nosuid, which would prevent the SUID bit to take effect and so  
> won't work. Sorry for the confusion.
> 
>> http://gridengine.sunsource.net/howto/ 
>> commonproblems.html#interactive), there is some mention of SUID for  
>> utilbin rlogin and rsh.  Anyway, I followed that, and I still get  the 
>> error:
>>
>> qrsh -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-100 date
>> your job 2402 ("date") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 2402 has been successfully scheduled.
>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
>> lmexec-100 ...
>> rcmd: socket: Permission denied
>>
> 
> Mmh, okay let's first look at the wrong $PATH. Can you put a "echo  
> $PATH" in your jobscript? Usually the $TMPDIR is the first one in the  
> created $PATH. If you change the $PATH, the $TMPDIR must again be put  
> in the first place to access the wrapper.
> 
> Another thing: what rsh statement is compiled in?
> 
> strings abinip_eth | egrep "(rsh|ssh)"
> 
> -- Reuti
> 
> 
>> on lmexec-100, I have:
>> -rwxr-xr-x  1 sgeadmin root  194380 Jul 22  2005 qrsh_starter
>> -r-sr-xr-x  1 root     root   32607 Jul 22  2005 rlogin
>> -r-sr-xr-x  1 root     root   22180 Jul 22  2005 rsh
>> -rwxr-xr-x  1 sgeadmin root  218778 Jul 22  2005 rshd
>>
>> Here is the ps -ef output on the head node for a mpich job (np=2)  
>> when -nolocal is not used:
>>
>> sgeadmin  1181  3492  0 08:43 ?        00:00:00 sge_shepherd-2404 -bg
>> minet     1205  1181  0 08:43 ?        00:00:00 bash /var/spool/sge/ 
>> lmexec-94/job_scripts/2404
>> minet     1208  1205  0 08:43 ?        00:00:00 /bin/sh /usr/local/ 
>> mpich-eth-intel/bin/mpirun -np 2 -machinefile /tmp/2404.1.all.q/ 
>> machines abinip_e
>> minet     1292  1208  0 08:43 ?        00:00:12 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/ 
>> parallel_eth/PI1208 -p4w
>> minet     1293  1292  0 08:43 ?        00:00:00 /home/pan/minet/ 
>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/ 
>> parallel_eth/PI1208 -p4w
>> minet     1294  1292  0 08:43 ?        00:00:00 /usr/bin/rsh  
>> lmexec-72 -l minet -n /home/pan/minet/abinit/parallel_eth/ abinip_eth 
>> lmexec-94 32858 \-
>>
>> We indeed have the node doing some work... but the connexion to the  
>> "slave" is through /usr/bin/rsh instead of qrsh.  On this head  node, 
>> we have also:
>>
>> lmexec-94 /tmp/2404.1.all.q # ls -al
>> total 12
>> drwxr-xr-x  2 minet grppcpm 4096 Jan 26 08:42 .
>> drwxrwxrwt  9 root  root    4096 Jan 26 08:42 ..
>> -rw-r--r--  1 minet grppcpm   20 Jan 26 08:42 machines
>> lrwxrwxrwx  1 minet grppcpm   21 Jan 26 08:42 rsh -> /gridware/sge/ 
>> mpi/rsh
>>
>> and also:
>> lmexec-94 /tmp/2404.1.all.q # cat machines
>> lmexec-94
>> lmexec-72
>>
>> but somehow, the rsh wrapping mechanism get bypassed somewhere...
>>
>> I hope I have been clear in my replies so that it helps you to help  
>> me ;-)
>>
>> Thanks again for your support
>>
>> Jean-Paul
>>
>>> HTH  -Reuti
>>> Am 25.01.2006 um 15:42 schrieb Jean-Paul Minet:
>>>
>>>> Reuti,
>>>>
>>>> I am totally lost with this tight intergration...
>>>>
>>>> 1) as root user, if I use the -nolocal flag as mpirun argument,  I  
>>>> end up with the following process on the "master node":
>>>>
>>>> root      5349  5326  0 14:27 ?        00:00:00 bash /var/spool/ 
>>>> sge/ lmexec-121/job_scripts/2375
>>>> root      5352  5349  0 14:27 ?        00:00:00 /bin/sh /usr/ local/ 
>>>> mpich-eth-intel/bin/mpirun -nolocal -np 2 -machinefile / tmp/ 
>>>> 2375.1.all.q/machines abinip_eth
>>>> root      5446  5352  0 14:27 ?        00:00:00 /gridware/sge/ bin/ 
>>>> lx24-amd64/qrsh -inherit lmexec-62 /home/pan/minet/abinit/  
>>>> parallel_eth/abinip_eth -p4pg /home/
>>>> root      5454  5446  0 14:27 ?        00:00:00 /gridware/sge/  
>>>> utilbin/lx24-amd64/rsh -p 32816 lmexec-62 exec '/gridware/sge/  
>>>> utilbin/lx24-amd64/qrsh_starter' '/v
>>>> root      5455  5454  0 14:27 ?        00:00:00 [rsh] <defunct>
>>>>
>>>> and on the slave node:
>>>>
>>>> sgeadmin 14300  3464  0 14:27 ?        00:00:00 sge_shepherd-2375  -bg
>>>> root     14301 14300  0 14:27 ?        00:00:00 /gridware/sge/  
>>>> utilbin/lx24-amd64/rshd -l
>>>> root     14302 14301  0 14:27 ?        00:00:00 /gridware/sge/  
>>>> utilbin/lx24-amd64/qrsh_starter /var/spool/sge/lmexec-62/  
>>>> active_jobs/2375.1/1.lmexec-62
>>>> root     14303 14302 27 14:27 ?        00:00:31 /home/pan/minet/  
>>>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/  
>>>> parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
>>>> root     14304 14303  0 14:27 ?        00:00:00 /home/pan/minet/  
>>>> abinit/parallel_eth/abinip_eth -p4pg /home/pan/minet/abinit/  
>>>> parallel_eth/PI5352 -p4wd /home/pan/minet/abinit/paralle
>>>> root     14305 14303  0 14:27 ?        00:00:00 /usr/bin/rsh   
>>>> lmexec-62 -l root -n /home/pan/minet/abinit/parallel_eth/ 
>>>> abinip_eth  lmexec-62 32819 \-p4amslave \-p4yourname lmexec-62
>>>> root     14306  3995  0 14:27 ?        00:00:00 in.rshd -aL
>>>> root     14307 14306 86 14:27 ?        00:01:38 /home/pan/minet/  
>>>> abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave -  
>>>> p4yourname lmexec-62 -p4rmrank 1
>>>> root     14357 14307  0 14:27 ?        00:00:00 /home/pan/minet/  
>>>> abinit/parallel_eth/abinip_eth lmexec-62 32819 -p4amslave -  
>>>> p4yourname lmexec-62 -p4rmrank 1
>>>>
>>>> So, I can see that, in a way, the SGE qrsh/rsh/qrsh_starter are   
>>>> coming into play ans that the sge_shepherd is initiating remote   
>>>> process.  Nevertheless:
>>>>
>>>> - as expected, there is no local instance of the program run on  
>>>> the  master node, which is not what we want.
>>>> - the slave node issues a rsh onto itself, is that expected ?
>>>>
>>>> Under these conditions, qstat -ext reports 0 usage (cpu/mem).
>>>>
>>>> If I don't use this -nolocal flag, then the rsh/qrsh wrapper   
>>>> mechanism doesn't seem to come into play, and the master node  does  
>>>> direct rsh to the slave node. In these conditions, the  qstat -ext  
>>>> reports cpu time (from a single process, which is  also expected  
>>>> since there is no SGE control in this case).
>>>>
>>>> All in all, I don't see how this -nolocal flag can make the rsh   
>>>> wrapper appear to work or fail.
>>>>
>>>> 2) as non root user, the first scenario doesn't work as I get an   
>>>> "error:rcmd: permission denied".  Second scenario work as for  root  
>>>> user.
>>>>
>>>> Quite a bit lost...
>>>>
>>>> Jean-Paul
>>>>
>>>> Reuti wrote:
>>>>
>>>>> Hi Jean-Paul,
>>>>> Am 23.01.2006 um 14:31 schrieb Jean-Paul Minet:
>>>>>
>>>>>> Reuti,
>>>>>>
>>>>>>> for using qrsh the /etc/hosts.equiv isn't necessary. I set  
>>>>>>> this   just  to reflect the login node on all exec nodes to  
>>>>>>> allow   interactive qrsh/ qlogin sessions.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> OK, got this.
>>>>>>
>>>>>>> As qrsh will use a chosen port: any firewall and/or etc/ hosts.  
>>>>>>> (allow| deny) configured? - Reuti
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> No firewall nor hosts.xxx.  The problem was from wrong mode  set  
>>>>>> on  rsh/rlogin on exec nodes (I had played with those  following  
>>>>>> some  hints for qrsh problem solving on the SGE FAQ,  which  
>>>>>> probably  messed up everything).
>>>>>>
>>>>>> MPI jobs can now run with qrsh... CPU time displayed by "qstat  -  
>>>>>> ext" is no longer 0... but it corresponds to a single cpu!
>>>>>>
>>>>>> 2218 0.24170 0.24169 Test_abini minet        NA                  
>>>>>> grppcpm    r 0:00:12:09 205.41073 0.00000 74727     0     0   
>>>>>> 71428   3298 0.04  all.q at lmexec-88                    2
>>>>>>
>>>>>> This job started about 12 minutes earlier, and runs on 2  cpus.    
>>>>>> Shouldn't the displayed "cpu" be the sum of all cpu  times or is   
>>>>>> this the correct behavior?
>>>>>>
>>>>>> Thks for your input
>>>>>>
>>>>> is "qstat -j 2218" giving you more reasonable results in the   
>>>>> "usage  1:" line? As "qstat -g t -ext" will also display the  CPU  
>>>>> time for  slave processes, these should be per process. -  Reuti
>>>>>
>>>>>> jp
>>>>>>
>>>>>> ------------------------------------------------------------------ 
>>>>>> -- -
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------- --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> -- 
>>>> Jean-Paul Minet
>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  de  
>>>> Masse
>>>> Université Catholique de Louvain
>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> -- 
>> Jean-Paul Minet
>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>> Université Catholique de Louvain
>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list