[GE users] Rmpi under SGE

reuti reuti at staff.uni-marburg.de
Fri Dec 17 14:02:39 GMT 2010


Am 17.12.2010 um 12:58 schrieb arnuschky:

> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> still fail (even with Reuti's fixes):
> 
>        $ cat test-mpi-17942.e3480568
>        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
>        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"

You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen for less slots too.

-- Reuti


>        --------------------------------------------------------------------------
>        A daemon (pid 8473) died unexpectedly with status 1 while attempting
>        to launch so we are aborting.
> 
>        There may be more information reported by the environment (see above).
> 
>        This may be because the daemon was unable to find all the needed shared
>        libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>        location of the shared libraries on the remote nodes and this will
>        automatically be forwarded to the remote nodes.
>        --------------------------------------------------------------------------
> 
> Qmaster spool messages list:
> 
>    12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task 3480568.1 task 2.compute-2-9 failed - killing job
> 
> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
> guess that this is not a network timeout...
> 
> Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306450

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list