[GE users] Rmpi under SGE

matbradford matthew.bradford at hp.com
Fri Dec 17 15:01:28 GMT 2010


Arne,

We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
the same message, when using more than a single node.

I managed to get rid of the problem by restarting the execution daemons.
Don't know why it fixed the problem, but it did.

The problem has returned a couple of times over a period of about a month,
but restarting the sgeexecd daemons always seems to fix it.

No other messages in the log files indicate anything unusual.

Cheers,

Mat



-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 17 December 2010 14:03
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Rmpi under SGE

Am 17.12.2010 um 12:58 schrieb arnuschky:

> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> still fail (even with Reuti's fixes):
> 
>        $ cat test-mpi-17942.e3480568
>        error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"
>        error: got no connection within 60 seconds. "Timeout occured while
waiting for connection"

You are now using the plain -builtin- startup method? Does it happen on all
hosts for such a job?

Maybe it's something special on some nodes und would for some hosts happen
for less slots too.

-- Reuti


>
--------------------------------------------------------------------------
>        A daemon (pid 8473) died unexpectedly with status 1 while
attempting
>        to launch so we are aborting.
> 
>        There may be more information reported by the environment (see
above).
> 
>        This may be because the daemon was unable to find all the needed
shared
>        libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
>        location of the shared libraries on the remote nodes and this will
>        automatically be forwarded to the remote nodes.
>
--------------------------------------------------------------------------
> 
> Qmaster spool messages list:
> 
>    12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
3480568.1 task 2.compute-2-9 failed - killing job
> 
> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
> guess that this is not a network timeout...
> 
> Arne

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
06450

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Application/X-PKCS7-SIGNATURE (Name: "smime.p7s") 5 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list