[GE users] What's the consequence if I removed these lines from sge_conf

reuti reuti at staff.uni-marburg.de
Wed Jan 6 20:20:23 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 06.01.2010 um 20:36 schrieb igardais:

> OK.
> For point 3 : *_command are all set to the default builtin setting
> For point 2 : we are using Intel MPI (mostly 3.1 and 3.2). Even if  
> based on Open MPI, I don't know if SGE integration has been ported.

No, Intel MPI is based on MPICH2. I thought you would use Open MPI,  
as you used the PE orted. When the PE orted is just defined for Open  
MPI by the admin, it won't work for Intel MPI.

For Intel MPI you can use this: http://gridengine.sunsource.net/howto/ 
mpich2-integration/mpich2-integration.html and created a PE  
accordingly. It's the mpd startup mechanism.


> For point 1 : does --rsh need to be set ? catch_rsh seems to  
> prepend the 'rsh' link in the PATH variable but according to point  
> 2, I don't know if the default --rsh command is just 'rsh' or its  
> full path.

If it's unknown what's compiled into the binary, it's best to set it  
explicitly in the jobscript or one of the sge_request files as  
default to be on the safe side. -catch_rsh will only work, when the  
used call is a plain "rsh". If it's "ssh", you have to adjust the  
created link in start_proc_args to be "ssh" and not "rsh". And with  
any absolute path it will never work as it can't be caught this way.

-- Reuti


>
> Ionel
>
>
> Le 6 janv. 2010 ? 18:52, reuti a écrit :
>
>> Hi,
>>
>> Am 06.01.2010 um 07:53 schrieb igardais:
>>
>>> What about rsh interception when using "builtin" commands ?
>>> All my mpi scripts specify "--rsh=/usr/bin/ssh" to use the classic
>>> key-based password-less login but with little control over the job.
>>
>> three things.
>>
>> First: correct. This absolute path will bypass any jopb control
>> imposed by SGE. The idea behind the -catch_rsh in the PE defintion  
>> is:
>>
>> - SGE will create a link called "rsh" in $TMPDIR on the master node
>> of the parallel job which will point to SGE's rsh-wrapper. It's
>> important to realize, that at this point the name "rsh" it's just a
>> name and is not related to any startup mechnism at all. You can even
>> tell your application "--rsh=fubar" and create a link called "fubar"
>> in $TMPDIR. This is usually done in the defined start_proc_args in
>> the PE.
>>
>> - Then SGE's rsh-wrapper will be called, which will use "qrsh -
>> inherit ..." to get to the other slave tasks.
>>
>> - The "qrsh -inherit ..." will use one of the 3 mentioned startup
>> mechanisms below. For rsh and ssh a dedicated daemon rshd/sshd will
>> be started by SGE on a dedicated port just for this one call. It's
>> not necessary to have sshd/rshd running all the time. This way you
>> can have a cluster where no user can login to a node but can still
>> use this way to start tasks between the nodes.
>>
>> Second: Did you compile Open MPI with --with-sge? Then --rsh
>> shouldn't have any effect at all, as Open MPI will detect
>> automatically that it's running under SGE.
>>
>> Third: As said, the entries for rsh_command and rsh_daemon must
>> match. When only the *_commands are defined, the *_deamons will have
>> a default. When there is a mismatch, an rsh might try to contact an
>> sshd, or the -builtin- mechanism a rshd. None of this will work. Best
>> is to include  entries of the pair.
>>
>> -- Reuti
>>
>>
>>> I'm considering rsh-interception but my first attemps (a few years
>>> back now) were unsuccessful.
>>>
>>> Any hints ?
>>>
>>> Thanks,
>>> Regards,
>>> Ionel
>>>
>>>
>>> De : reuti <reuti at staff.uni-marburg.de>
>>> ? : users at gridengine.sunsource.net
>>> Envoyé le : Mer 6 Janvier 2010, 1 h 56 min 40 s
>>> Objet : Re: [GE users] What's the consequence if I removed these
>>> lines from sge_conf
>>>
>>> Am 06.01.2010 um 01:40 schrieb kdoman:
>>>
>>>> What's the consequence of removing the lines below from sge conf?
>>> If I
>>>> don't, we cannot submit any parallel jobs that request "-pe orte"
>>>> greater than 4.
>>>>
>>>> qrsh_command                /usr/bin/ssh
>>>> rsh_command                  /usr/bin/ssh
>>>> rlogin_command              /usr/bin/ssh
>>>
>>> The definition of the the *_command must match the ones of the
>>> *_daemon. It defines what mechanism will be used to start  
>>> interactive
>>> jobs or slave tasks. You can have:
>>>
>>> Classic rsh startup (e.g. for x86):
>>>
>>> qlogin_command              /usr/bin/telnet
>>> qlogin_daemon                /usr/sbin/in.telnetd
>>> rlogin_command              /usr/sge/utilbin/lx24-x86/rlogin
>>> rlogin_daemon                /usr/sbin/in.rlogind
>>> rsh_command                  /usr/sge/utilbin/lx24-x86/rsh
>>> rsh_daemon                  /usr/sge/utilbin/lx24-x86/rshd -l
>>>
>>> All builtin:
>>>
>>> qlogin_command              builtin
>>> qlogin_daemon                builtin
>>> rlogin_command              builtin
>>> rlogin_daemon                builtin
>>> rsh_command                  builtin
>>> rsh_daemon                  builtin
>>>
>>> or ssh according to:
>>>
>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>
>>> The three options qlogin_*, rlogin_* and rsh_* must be conistent,  
>>> but
>>> can be different for each pair of them of course.
>>>
>>> Also note, that these entries can be overwritten on an exechost
>>> level, i.e. its local configuration: qconf -mconf <exechost>
>>>
>>> -- Reuti
>>>
>>>
>>>> Without the above modification, any job submission with -pe orte
>>>> greater than 4 would received this error:
>>>>
>>>> error: error: ending connection before all data received
>>>> error:
>>>> error reading job context from "qlogin_starter"
>>>>
>>> -------------------------------------------------------------------- 
>>> --
>>>> ----
>>>> A daemon (pid 2160) died unexpectedly with status 1 while  
>>>> attempting
>>>> to launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see
>>> above).
>>>>
>>>> This may be because the daemon was unable to find all the needed
>>>> shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>> have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>>
>>> -------------------------------------------------------------------- 
>>> --
>>>> ----
>>>>
>>> -------------------------------------------------------------------- 
>>> --
>>>> ----
>>>> mpirun noticed that the job aborted, but has no info as to the
>>> process
>>>> that caused that situation.
>>>>
>>> -------------------------------------------------------------------- 
>>> --
>>>> ----
>>>> mpirun: clean termination accomplished
>>>>
>>>>
>>>> Thanks.
>>>> K.
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=236695
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=236698
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=236874
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=236895
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236909

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list