[GE users] Rmpi under SGE

arnuschky arne.brutschy at ulb.ac.be
Fri Dec 17 11:58:55 GMT 2010


Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
still fail (even with Reuti's fixes):

        $ cat test-mpi-17942.e3480568
        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
        --------------------------------------------------------------------------
        A daemon (pid 8473) died unexpectedly with status 1 while attempting
        to launch so we are aborting.
        
        There may be more information reported by the environment (see above).
        
        This may be because the daemon was unable to find all the needed shared
        libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
        location of the shared libraries on the remote nodes and this will
        automatically be forwarded to the remote nodes.
        --------------------------------------------------------------------------
        
Qmaster spool messages list:

    12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job

Any idea what's going wrong now? 60 seconds is quite a long timeout, I
guess that this is not a network timeout...

Arne


On Fri, 2010-12-17 at 11:28 +0100, reuti wrote:
> Hi,
> 
> Am 17.12.2010 um 11:04 schrieb arnuschky:
> 
> > we're having massive problems using Rmpi with OpenMPI under SGE. OpenMPI
> > is tested and works fine. We're submittig one master Rscript, which is
> > in turn spawning the required slaves using Rmpi. Unfortunately, this
> > fails:
> > 
> >        $ cat testRmpi.e3480556
> >        Warning: Permanently added 'compute-1-13.local' (RSA) to the list of known hosts.
> >        Warning: Permanently added 'compute-1-10.local' (RSA) to the list of known hosts.
> >        Warning: Permanently added 'compute-1-11.local' (RSA) to the list of known hosts.
> >        Warning: Permanently added 'compute-1-12.local' (RSA) to the list of known hosts.
> >        Warning: Permanently added 'compute-1-14.local' (RSA) to the list of known hosts.
> >        Permission denied, please try again.
> 
> when Open MPI has a tight integration into SGE, I would assume SGE is configured to use "ssh". What is the output of `qconf -sconf`, there might be double entries?
> 
> http://marc.info/?l=npaci-rocks-discussion&m=126411729709528
> 
> If you want or must use ssh for sure, you need either passphraseless ssh keys (deprecated), or a hostbased authentication:
> 
> http://gridengine.sunsource.net/howto/hostbased-ssh.html
> 
> -- Reuti
> 
> 
> >        Permission denied, please try again.
> >        Permission denied (publickey,gssapi-with-mic,password).
> >        Permission denied, please try again.
> >        Permission denied, please try again.
> >        Permission denied (publickey,gssapi-with-mic,password).
> >        Permission denied, please try again.
> >        Permission denied, please try again.
> >        Permission denied, please try again.
> >        Permission denied (publickey,gssapi-with-mic,password).
> >        Permission denied, please try again.
> >        Permission denied (publickey,gssapi-with-mic,password).
> >        Permission denied, please try again.
> >        Permission denied, please try again.
> >        Permission denied (publickey,gssapi-with-mic,password).
> >        --------------------------------------------------------------------------
> >        A daemon (pid 26953) died unexpectedly with status 129 while attempting
> >        to launch so we are aborting.
> > 
> >        There may be more information reported by the environment (see above).
> > 
> >        This may be because the daemon was unable to find all the needed shared
> >        libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> >        location of the shared libraries on the remote nodes and this will
> >        automatically be forwarded to the remote nodes.
> >        --------------------------------------------------------------------------
> >        mpirun: clean termination accomplished
> > 
> > We're using openmpi-1.3.3 (--with-sge) and SGE V62u4.
> > 
> > Any hint's on what's going wrong here?
> > 
> > Cheers,
> > Arne
> > 
> > -- 
> > Arne Brutschy
> > Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
> > IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
> > Universite' Libre de Bruxelles   Tel      +32 2 650 2273
> > Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
> > 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306389
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306398
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 2273
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306417

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list