[GE users] Rmpi under SGE

arnuschky arne.brutschy at ulb.ac.be
Fri Dec 17 18:07:41 GMT 2010


> This can also be set during job submission time:
> 
> $ qsub -V job.sh

I tried this as well, but didn't change anything. Anyways, I think Mat
didn't start the sgeexecd's by hand - he just restarted the daemon (by
init script I assume). Or am I missing something here?

Cheers,
Arne


> > 
> > The problem has returned a couple of times over a period of about a month,
> > but restarting the sgeexecd daemons always seems to fix it.
> > 
> > No other messages in the log files indicate anything unusual.
> > 
> > Cheers,
> > 
> > Mat
> > 
> > 
> > 
> > -----Original Message-----
> > From: reuti [mailto:reuti at staff.uni-marburg.de] 
> > Sent: 17 December 2010 14:03
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] Rmpi under SGE
> > 
> > Am 17.12.2010 um 12:58 schrieb arnuschky:
> > 
> >> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> >> still fail (even with Reuti's fixes):
> >> 
> >>       $ cat test-mpi-17942.e3480568
> >>       error: got no connection within 60 seconds. "Timeout occured while
> > waiting for connection"
> >>       error: got no connection within 60 seconds. "Timeout occured while
> > waiting for connection"
> > 
> > You are now using the plain -builtin- startup method? Does it happen on all
> > hosts for such a job?
> > 
> > Maybe it's something special on some nodes und would for some hosts happen
> > for less slots too.
> > 
> > -- Reuti
> > 
> > 
> >> 
> > --------------------------------------------------------------------------
> >>       A daemon (pid 8473) died unexpectedly with status 1 while
> > attempting
> >>       to launch so we are aborting.
> >> 
> >>       There may be more information reported by the environment (see
> > above).
> >> 
> >>       This may be because the daemon was unable to find all the needed
> > shared
> >>       libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> >>       location of the shared libraries on the remote nodes and this will
> >>       automatically be forwarded to the remote nodes.
> >> 
> > --------------------------------------------------------------------------
> >> 
> >> Qmaster spool messages list:
> >> 
> >>   12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
> > 3480568.1 task 2.compute-2-9 failed - killing job
> >> 
> >> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
> >> guess that this is not a network timeout...
> >> 
> >> Arne
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
> > 06450
> > 
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 2273
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306524

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list