[GE users] Rmpi under SGE
arnuschky
arne.brutschy at ulb.ac.be
Fri Dec 17 18:07:41 GMT 2010
> This can also be set during job submission time:
>
> $ qsub -V job.sh
I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?
Cheers,
Arne
> >
> > The problem has returned a couple of times over a period of about a month,
> > but restarting the sgeexecd daemons always seems to fix it.
> >
> > No other messages in the log files indicate anything unusual.
> >
> > Cheers,
> >
> > Mat
> >
> >
> >
> > -----Original Message-----
> > From: reuti [mailto:reuti at staff.uni-marburg.de]
> > Sent: 17 December 2010 14:03
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] Rmpi under SGE
> >
> > Am 17.12.2010 um 12:58 schrieb arnuschky:
> >
> >> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> >> still fail (even with Reuti's fixes):
> >>
> >> $ cat test-mpi-17942.e3480568
> >> error: got no connection within 60 seconds. "Timeout occured while
> > waiting for connection"
> >> error: got no connection within 60 seconds. "Timeout occured while
> > waiting for connection"
> >
> > You are now using the plain -builtin- startup method? Does it happen on all
> > hosts for such a job?
> >
> > Maybe it's something special on some nodes und would for some hosts happen
> > for less slots too.
> >
> > -- Reuti
> >
> >
> >>
> > --------------------------------------------------------------------------
> >> A daemon (pid 8473) died unexpectedly with status 1 while
> > attempting
> >> to launch so we are aborting.
> >>
> >> There may be more information reported by the environment (see
> > above).
> >>
> >> This may be because the daemon was unable to find all the needed
> > shared
> >> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> >> location of the shared libraries on the remote nodes and this will
> >> automatically be forwarded to the remote nodes.
> >>
> > --------------------------------------------------------------------------
> >>
> >> Qmaster spool messages list:
> >>
> >> 12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
> > 3480568.1 task 2.compute-2-9 failed - killing job
> >>
> >> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
> >> guess that this is not a network timeout...
> >>
> >> Arne
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
> > 06450
> >
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306524
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users
mailing list