[GE users] Rmpi under SGE

reuti reuti at staff.uni-marburg.de
Fri Dec 17 15:24:38 GMT 2010


Am 17.12.2010 um 16:01 schrieb matbradford:

> Arne,
> 
> We are using OpenMPI 1.4.2 and we had a very similar problem , with exactly
> the same message, when using more than a single node.
> 
> I managed to get rid of the problem by restarting the execution daemons.
> Don't know why it fixed the problem, but it did.

When you start the sgeexecd by hand from the root account, the environment might be different (from the one at boot time) and it's inherited to processes started by SGE. This behavior can also be adjusted in SGE's configuration.

But when some environment variables are missing (and only set when started by hand), maybe it could also be adjusted in the jobscript. For a tightly integrated job, usually -V is used to export the environment of the master task to all slaves (hence in the jobscript it must be set one time). Here -V is appropriate.

==

This can also be set during job submission time:

$ qsub -V job.sh

But usually I vote against it, as I prefer self-contained scripts (to avoid that a changed variable in the shell will have an effect on the actual job submission - this can be hard to track in case of an error). Exception clause, when you name it explicitly:

$ qsub -v LD_LIBRARY_PATH job.sh
$ qsub -v LD_LIBRARY_PATH=/usr/local/lib job.sh

-- Reuti


> 
> The problem has returned a couple of times over a period of about a month,
> but restarting the sgeexecd daemons always seems to fix it.
> 
> No other messages in the log files indicate anything unusual.
> 
> Cheers,
> 
> Mat
> 
> 
> 
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 17 December 2010 14:03
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Rmpi under SGE
> 
> Am 17.12.2010 um 12:58 schrieb arnuschky:
> 
>> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
>> still fail (even with Reuti's fixes):
>> 
>>       $ cat test-mpi-17942.e3480568
>>       error: got no connection within 60 seconds. "Timeout occured while
> waiting for connection"
>>       error: got no connection within 60 seconds. "Timeout occured while
> waiting for connection"
> 
> You are now using the plain -builtin- startup method? Does it happen on all
> hosts for such a job?
> 
> Maybe it's something special on some nodes und would for some hosts happen
> for less slots too.
> 
> -- Reuti
> 
> 
>> 
> --------------------------------------------------------------------------
>>       A daemon (pid 8473) died unexpectedly with status 1 while
> attempting
>>       to launch so we are aborting.
>> 
>>       There may be more information reported by the environment (see
> above).
>> 
>>       This may be because the daemon was unable to find all the needed
> shared
>>       libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
>>       location of the shared libraries on the remote nodes and this will
>>       automatically be forwarded to the remote nodes.
>> 
> --------------------------------------------------------------------------
>> 
>> Qmaster spool messages list:
>> 
>>   12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
> 3480568.1 task 2.compute-2-9 failed - killing job
>> 
>> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
>> guess that this is not a network timeout...
>> 
>> Arne
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
> 06450
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list