[GE users] Rmpi under SGE

arnuschky arne.brutschy at ulb.ac.be
Fri Dec 17 17:01:00 GMT 2010


Thanks for the suggestion Mat. I've tested and the problem persists
after restarting the sgeexecds, unfortunately.

Cheers,
Arne

On Fri, 2010-12-17 at 15:01 +0000, matbradford wrote:
> Arne,
> 
> We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
> the same message, when using more than a single node.
> 
> I managed to get rid of the problem by restarting the execution daemons.
> Don't know why it fixed the problem, but it did.
> 
> The problem has returned a couple of times over a period of about a month,
> but restarting the sgeexecd daemons always seems to fix it.
> 
> No other messages in the log files indicate anything unusual.
> 
> Cheers,
> 
> Mat
> 
> 
> 
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 17 December 2010 14:03
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Rmpi under SGE
> 
> Am 17.12.2010 um 12:58 schrieb arnuschky:
> 
> > Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> > still fail (even with Reuti's fixes):
> > 
> >        $ cat test-mpi-17942.e3480568
> >        error: got no connection within 60 seconds. "Timeout occured while
> waiting for connection"
> >        error: got no connection within 60 seconds. "Timeout occured while
> waiting for connection"
> 
> You are now using the plain -builtin- startup method? Does it happen on all
> hosts for such a job?
> 
> Maybe it's something special on some nodes und would for some hosts happen
> for less slots too.
> 
> -- Reuti
> 
> 
> >
> --------------------------------------------------------------------------
> >        A daemon (pid 8473) died unexpectedly with status 1 while
> attempting
> >        to launch so we are aborting.
> > 
> >        There may be more information reported by the environment (see
> above).
> > 
> >        This may be because the daemon was unable to find all the needed
> shared
> >        libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> >        location of the shared libraries on the remote nodes and this will
> >        automatically be forwarded to the remote nodes.
> >
> --------------------------------------------------------------------------
> > 
> > Qmaster spool messages list:
> > 
> >    12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
> 3480568.1 task 2.compute-2-9 failed - killing job
> > 
> > Any idea what's going wrong now? 60 seconds is quite a long timeout, I
> > guess that this is not a network timeout...
> > 
> > Arne
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
> 06450
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 2273
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306502

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list