[GE users] What's the consequence if I removed these lines from sge_conf

igardais igardais at yahoo.fr
Wed Jan 6 21:12:26 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I'm not kdoman and do not use orted :)
I just jumped into the thread.

I'll try to find the right combination of options.

Ionel


Le 6 janv. 2010 ? 21:20, reuti a écrit :

> Am 06.01.2010 um 20:36 schrieb igardais:
> 
>> OK.
>> For point 3 : *_command are all set to the default builtin setting
>> For point 2 : we are using Intel MPI (mostly 3.1 and 3.2). Even if  
>> based on Open MPI, I don't know if SGE integration has been ported.
> 
> No, Intel MPI is based on MPICH2. I thought you would use Open MPI,  
> as you used the PE orted. When the PE orted is just defined for Open  
> MPI by the admin, it won't work for Intel MPI.
> 
> For Intel MPI you can use this: http://gridengine.sunsource.net/howto/ 
> mpich2-integration/mpich2-integration.html and created a PE  
> accordingly. It's the mpd startup mechanism.
> 
> 
>> For point 1 : does --rsh need to be set ? catch_rsh seems to  
>> prepend the 'rsh' link in the PATH variable but according to point  
>> 2, I don't know if the default --rsh command is just 'rsh' or its  
>> full path.
> 
> If it's unknown what's compiled into the binary, it's best to set it  
> explicitly in the jobscript or one of the sge_request files as  
> default to be on the safe side. -catch_rsh will only work, when the  
> used call is a plain "rsh". If it's "ssh", you have to adjust the  
> created link in start_proc_args to be "ssh" and not "rsh". And with  
> any absolute path it will never work as it can't be caught this way.
> 
> -- Reuti
> 
> 
>> 
>> Ionel
>> 
>> 
>> Le 6 janv. 2010 ? 18:52, reuti a écrit :
>> 
>>> Hi,
>>> 
>>> Am 06.01.2010 um 07:53 schrieb igardais:
>>> 
>>>> What about rsh interception when using "builtin" commands ?
>>>> All my mpi scripts specify "--rsh=/usr/bin/ssh" to use the classic
>>>> key-based password-less login but with little control over the job.
>>> 
>>> three things.
>>> 
>>> First: correct. This absolute path will bypass any jopb control
>>> imposed by SGE. The idea behind the -catch_rsh in the PE defintion  
>>> is:
>>> 
>>> - SGE will create a link called "rsh" in $TMPDIR on the master node
>>> of the parallel job which will point to SGE's rsh-wrapper. It's
>>> important to realize, that at this point the name "rsh" it's just a
>>> name and is not related to any startup mechnism at all. You can even
>>> tell your application "--rsh=fubar" and create a link called "fubar"
>>> in $TMPDIR. This is usually done in the defined start_proc_args in
>>> the PE.
>>> 
>>> - Then SGE's rsh-wrapper will be called, which will use "qrsh -
>>> inherit ..." to get to the other slave tasks.
>>> 
>>> - The "qrsh -inherit ..." will use one of the 3 mentioned startup
>>> mechanisms below. For rsh and ssh a dedicated daemon rshd/sshd will
>>> be started by SGE on a dedicated port just for this one call. It's
>>> not necessary to have sshd/rshd running all the time. This way you
>>> can have a cluster where no user can login to a node but can still
>>> use this way to start tasks between the nodes.
>>> 
>>> Second: Did you compile Open MPI with --with-sge? Then --rsh
>>> shouldn't have any effect at all, as Open MPI will detect
>>> automatically that it's running under SGE.
>>> 
>>> Third: As said, the entries for rsh_command and rsh_daemon must
>>> match. When only the *_commands are defined, the *_deamons will have
>>> a default. When there is a mismatch, an rsh might try to contact an
>>> sshd, or the -builtin- mechanism a rshd. None of this will work. Best
>>> is to include  entries of the pair.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> I'm considering rsh-interception but my first attemps (a few years
>>>> back now) were unsuccessful.
>>>> 
>>>> Any hints ?
>>>> 
>>>> Thanks,
>>>> Regards,
>>>> Ionel
>>>> 
>>>> 
>>>> De : reuti <reuti at staff.uni-marburg.de>
>>>> ? : users at gridengine.sunsource.net
>>>> Envoyé le : Mer 6 Janvier 2010, 1 h 56 min 40 s
>>>> Objet : Re: [GE users] What's the consequence if I removed these
>>>> lines from sge_conf
>>>> 
>>>> Am 06.01.2010 um 01:40 schrieb kdoman:
>>>> 
>>>>> What's the consequence of removing the lines below from sge conf?
>>>> If I
>>>>> don't, we cannot submit any parallel jobs that request "-pe orte"
>>>>> greater than 4.
>>>>> 
>>>>> qrsh_command                /usr/bin/ssh
>>>>> rsh_command                  /usr/bin/ssh
>>>>> rlogin_command              /usr/bin/ssh
>>>> 
>>>> The definition of the the *_command must match the ones of the
>>>> *_daemon. It defines what mechanism will be used to start  
>>>> interactive
>>>> jobs or slave tasks. You can have:
>>>> 
>>>> Classic rsh startup (e.g. for x86):
>>>> 
>>>> qlogin_command              /usr/bin/telnet
>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>> rlogin_command              /usr/sge/utilbin/lx24-x86/rlogin
>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>> rsh_command                  /usr/sge/utilbin/lx24-x86/rsh
>>>> rsh_daemon                  /usr/sge/utilbin/lx24-x86/rshd -l
>>>> 
>>>> All builtin:
>>>> 
>>>> qlogin_command              builtin
>>>> qlogin_daemon                builtin
>>>> rlogin_command              builtin
>>>> rlogin_daemon                builtin
>>>> rsh_command                  builtin
>>>> rsh_daemon                  builtin
>>>> 
>>>> or ssh according to:
>>>> 
>>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>> 
>>>> The three options qlogin_*, rlogin_* and rsh_* must be conistent,  
>>>> but
>>>> can be different for each pair of them of course.
>>>> 
>>>> Also note, that these entries can be overwritten on an exechost
>>>> level, i.e. its local configuration: qconf -mconf <exechost>
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> Without the above modification, any job submission with -pe orte
>>>>> greater than 4 would received this error:
>>>>> 
>>>>> error: error: ending connection before all data received
>>>>> error:
>>>>> error reading job context from "qlogin_starter"
>>>>> 
>>>> -------------------------------------------------------------------- 
>>>> --
>>>>> ----
>>>>> A daemon (pid 2160) died unexpectedly with status 1 while  
>>>>> attempting
>>>>> to launch so we are aborting.
>>>>> 
>>>>> There may be more information reported by the environment (see
>>>> above).
>>>>> 
>>>>> This may be because the daemon was unable to find all the needed
>>>>> shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>> have the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>> 
>>>> -------------------------------------------------------------------- 
>>>> --
>>>>> ----
>>>>> 
>>>> -------------------------------------------------------------------- 
>>>> --
>>>>> ----
>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>> process
>>>>> that caused that situation.
>>>>> 
>>>> -------------------------------------------------------------------- 
>>>> --
>>>>> ----
>>>>> mpirun: clean termination accomplished
>>>>> 
>>>>> 
>>>>> Thanks.
>>>>> K.
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=236695
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=236698
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> 
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=236874
>>> 
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=236895
>> 
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236909
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236917

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list