[GE users] What's the consequence if I removed these lines from sge_conf

reuti reuti at staff.uni-marburg.de
Wed Jan 6 21:21:37 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 06.01.2010 um 22:12 schrieb igardais:

> I'm not kdoman and do not use orted :)
> I just jumped into the thread.

Aha, we should setup a rule for this year: everyone who hijacks a  
thread and generates confusion will be banned from the list for one  
week :-D

And one additonal week for each additonal offense.

-- Reuti


> I'll try to find the right combination of options.
>
> Ionel
>
>
> Le 6 janv. 2010 ? 21:20, reuti a écrit :
>
>> Am 06.01.2010 um 20:36 schrieb igardais:
>>
>>> OK.
>>> For point 3 : *_command are all set to the default builtin setting
>>> For point 2 : we are using Intel MPI (mostly 3.1 and 3.2). Even if
>>> based on Open MPI, I don't know if SGE integration has been ported.
>>
>> No, Intel MPI is based on MPICH2. I thought you would use Open MPI,
>> as you used the PE orted. When the PE orted is just defined for Open
>> MPI by the admin, it won't work for Intel MPI.
>>
>> For Intel MPI you can use this: http://gridengine.sunsource.net/ 
>> howto/
>> mpich2-integration/mpich2-integration.html and created a PE
>> accordingly. It's the mpd startup mechanism.
>>
>>
>>> For point 1 : does --rsh need to be set ? catch_rsh seems to
>>> prepend the 'rsh' link in the PATH variable but according to point
>>> 2, I don't know if the default --rsh command is just 'rsh' or its
>>> full path.
>>
>> If it's unknown what's compiled into the binary, it's best to set it
>> explicitly in the jobscript or one of the sge_request files as
>> default to be on the safe side. -catch_rsh will only work, when the
>> used call is a plain "rsh". If it's "ssh", you have to adjust the
>> created link in start_proc_args to be "ssh" and not "rsh". And with
>> any absolute path it will never work as it can't be caught this way.
>>
>> -- Reuti
>>
>>
>>>
>>> Ionel
>>>
>>>
>>> Le 6 janv. 2010 ? 18:52, reuti a écrit :
>>>
>>>> Hi,
>>>>
>>>> Am 06.01.2010 um 07:53 schrieb igardais:
>>>>
>>>>> What about rsh interception when using "builtin" commands ?
>>>>> All my mpi scripts specify "--rsh=/usr/bin/ssh" to use the classic
>>>>> key-based password-less login but with little control over the  
>>>>> job.
>>>>
>>>> three things.
>>>>
>>>> First: correct. This absolute path will bypass any jopb control
>>>> imposed by SGE. The idea behind the -catch_rsh in the PE defintion
>>>> is:
>>>>
>>>> - SGE will create a link called "rsh" in $TMPDIR on the master node
>>>> of the parallel job which will point to SGE's rsh-wrapper. It's
>>>> important to realize, that at this point the name "rsh" it's just a
>>>> name and is not related to any startup mechnism at all. You can  
>>>> even
>>>> tell your application "--rsh=fubar" and create a link called  
>>>> "fubar"
>>>> in $TMPDIR. This is usually done in the defined start_proc_args in
>>>> the PE.
>>>>
>>>> - Then SGE's rsh-wrapper will be called, which will use "qrsh -
>>>> inherit ..." to get to the other slave tasks.
>>>>
>>>> - The "qrsh -inherit ..." will use one of the 3 mentioned startup
>>>> mechanisms below. For rsh and ssh a dedicated daemon rshd/sshd will
>>>> be started by SGE on a dedicated port just for this one call. It's
>>>> not necessary to have sshd/rshd running all the time. This way you
>>>> can have a cluster where no user can login to a node but can still
>>>> use this way to start tasks between the nodes.
>>>>
>>>> Second: Did you compile Open MPI with --with-sge? Then --rsh
>>>> shouldn't have any effect at all, as Open MPI will detect
>>>> automatically that it's running under SGE.
>>>>
>>>> Third: As said, the entries for rsh_command and rsh_daemon must
>>>> match. When only the *_commands are defined, the *_deamons will  
>>>> have
>>>> a default. When there is a mismatch, an rsh might try to contact an
>>>> sshd, or the -builtin- mechanism a rshd. None of this will work.  
>>>> Best
>>>> is to include  entries of the pair.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> I'm considering rsh-interception but my first attemps (a few years
>>>>> back now) were unsuccessful.
>>>>>
>>>>> Any hints ?
>>>>>
>>>>> Thanks,
>>>>> Regards,
>>>>> Ionel
>>>>>
>>>>>
>>>>> De : reuti <reuti at staff.uni-marburg.de>
>>>>> ? : users at gridengine.sunsource.net
>>>>> Envoyé le : Mer 6 Janvier 2010, 1 h 56 min 40 s
>>>>> Objet : Re: [GE users] What's the consequence if I removed these
>>>>> lines from sge_conf
>>>>>
>>>>> Am 06.01.2010 um 01:40 schrieb kdoman:
>>>>>
>>>>>> What's the consequence of removing the lines below from sge conf?
>>>>> If I
>>>>>> don't, we cannot submit any parallel jobs that request "-pe orte"
>>>>>> greater than 4.
>>>>>>
>>>>>> qrsh_command                /usr/bin/ssh
>>>>>> rsh_command                  /usr/bin/ssh
>>>>>> rlogin_command              /usr/bin/ssh
>>>>>
>>>>> The definition of the the *_command must match the ones of the
>>>>> *_daemon. It defines what mechanism will be used to start
>>>>> interactive
>>>>> jobs or slave tasks. You can have:
>>>>>
>>>>> Classic rsh startup (e.g. for x86):
>>>>>
>>>>> qlogin_command              /usr/bin/telnet
>>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>>> rlogin_command              /usr/sge/utilbin/lx24-x86/rlogin
>>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>>> rsh_command                  /usr/sge/utilbin/lx24-x86/rsh
>>>>> rsh_daemon                  /usr/sge/utilbin/lx24-x86/rshd -l
>>>>>
>>>>> All builtin:
>>>>>
>>>>> qlogin_command              builtin
>>>>> qlogin_daemon                builtin
>>>>> rlogin_command              builtin
>>>>> rlogin_daemon                builtin
>>>>> rsh_command                  builtin
>>>>> rsh_daemon                  builtin
>>>>>
>>>>> or ssh according to:
>>>>>
>>>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>>>
>>>>> The three options qlogin_*, rlogin_* and rsh_* must be conistent,
>>>>> but
>>>>> can be different for each pair of them of course.
>>>>>
>>>>> Also note, that these entries can be overwritten on an exechost
>>>>> level, i.e. its local configuration: qconf -mconf <exechost>
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> Without the above modification, any job submission with -pe orte
>>>>>> greater than 4 would received this error:
>>>>>>
>>>>>> error: error: ending connection before all data received
>>>>>> error:
>>>>>> error reading job context from "qlogin_starter"
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>>> ----
>>>>>> A daemon (pid 2160) died unexpectedly with status 1 while
>>>>>> attempting
>>>>>> to launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see
>>>>> above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>> have the
>>>>>> location of the shared libraries on the remote nodes and this  
>>>>>> will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>>> ----
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>>> ----
>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>> process
>>>>>> that caused that situation.
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>>> ----
>>>>>> mpirun: clean termination accomplished
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> K.
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>>> dsForumId=38&dsMessageId=236695
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=236698
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=236874
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=236895
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=236909
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=236917
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236921

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list