[GE users] What's the consequence if I removed these lines from sge_conf

reuti reuti at staff.uni-marburg.de
Wed Jan 6 17:52:31 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 06.01.2010 um 07:53 schrieb igardais:

> What about rsh interception when using "builtin" commands ?
> All my mpi scripts specify "--rsh=/usr/bin/ssh" to use the classic  
> key-based password-less login but with little control over the job.

three things.

First: correct. This absolute path will bypass any jopb control  
imposed by SGE. The idea behind the -catch_rsh in the PE defintion is:

- SGE will create a link called "rsh" in $TMPDIR on the master node  
of the parallel job which will point to SGE's rsh-wrapper. It's  
important to realize, that at this point the name "rsh" it's just a  
name and is not related to any startup mechnism at all. You can even  
tell your application "--rsh=fubar" and create a link called "fubar"  
in $TMPDIR. This is usually done in the defined start_proc_args in  
the PE.

- Then SGE's rsh-wrapper will be called, which will use "qrsh - 
inherit ..." to get to the other slave tasks.

- The "qrsh -inherit ..." will use one of the 3 mentioned startup  
mechanisms below. For rsh and ssh a dedicated daemon rshd/sshd will  
be started by SGE on a dedicated port just for this one call. It's  
not necessary to have sshd/rshd running all the time. This way you  
can have a cluster where no user can login to a node but can still  
use this way to start tasks between the nodes.

Second: Did you compile Open MPI with --with-sge? Then --rsh  
shouldn't have any effect at all, as Open MPI will detect  
automatically that it's running under SGE.

Third: As said, the entries for rsh_command and rsh_daemon must  
match. When only the *_commands are defined, the *_deamons will have  
a default. When there is a mismatch, an rsh might try to contact an  
sshd, or the -builtin- mechanism a rshd. None of this will work. Best  
is to include  entries of the pair.

-- Reuti


> I'm considering rsh-interception but my first attemps (a few years  
> back now) were unsuccessful.
>
> Any hints ?
>
> Thanks,
> Regards,
> Ionel
>
>
> De : reuti <reuti at staff.uni-marburg.de>
> ? : users at gridengine.sunsource.net
> Envoyé le : Mer 6 Janvier 2010, 1 h 56 min 40 s
> Objet : Re: [GE users] What's the consequence if I removed these  
> lines from sge_conf
>
> Am 06.01.2010 um 01:40 schrieb kdoman:
>
> > What's the consequence of removing the lines below from sge conf?  
> If I
> > don't, we cannot submit any parallel jobs that request "-pe orte"
> > greater than 4.
> >
> > qrsh_command                /usr/bin/ssh
> > rsh_command                  /usr/bin/ssh
> > rlogin_command              /usr/bin/ssh
>
> The definition of the the *_command must match the ones of the
> *_daemon. It defines what mechanism will be used to start interactive
> jobs or slave tasks. You can have:
>
> Classic rsh startup (e.g. for x86):
>
> qlogin_command              /usr/bin/telnet
> qlogin_daemon                /usr/sbin/in.telnetd
> rlogin_command              /usr/sge/utilbin/lx24-x86/rlogin
> rlogin_daemon                /usr/sbin/in.rlogind
> rsh_command                  /usr/sge/utilbin/lx24-x86/rsh
> rsh_daemon                  /usr/sge/utilbin/lx24-x86/rshd -l
>
> All builtin:
>
> qlogin_command              builtin
> qlogin_daemon                builtin
> rlogin_command              builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                  builtin
>
> or ssh according to:
>
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>
> The three options qlogin_*, rlogin_* and rsh_* must be conistent, but
> can be different for each pair of them of course.
>
> Also note, that these entries can be overwritten on an exechost
> level, i.e. its local configuration: qconf -mconf <exechost>
>
> -- Reuti
>
>
> > Without the above modification, any job submission with -pe orte
> > greater than 4 would received this error:
> >
> > error: error: ending connection before all data received
> > error:
> > error reading job context from "qlogin_starter"
> >  
> ----------------------------------------------------------------------
> > ----
> > A daemon (pid 2160) died unexpectedly with status 1 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see  
> above).
> >
> > This may be because the daemon was unable to find all the needed
> > shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >  
> ----------------------------------------------------------------------
> > ----
> >  
> ----------------------------------------------------------------------
> > ----
> > mpirun noticed that the job aborted, but has no info as to the  
> process
> > that caused that situation.
> >  
> ----------------------------------------------------------------------
> > ----
> > mpirun: clean termination accomplished
> >
> >
> > Thanks.
> > K.
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?
> > dsForumId=38&dsMessageId=236695
> >
> > To unsubscribe from this discussion, e-mail: [users-
> > unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=236698
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236874

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list