[GE users] Rmpi under SGE

reuti reuti at staff.uni-marburg.de
Tue Dec 21 09:41:43 GMT 2010


Am 21.12.2010 um 10:36 schrieb matbradford:

>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 21 December 2010 09:30
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Rmpi under SGE
>> 
>> Am 17.12.2010 um 19:07 schrieb arnuschky:
>> 
>>>> This can also be set during job submission time:
>>>> 
>>>> $ qsub -V job.sh
>>> 
>>> I tried this as well, but didn't change anything. Anyways, I think Mat
>>> didn't start the sgeexecd's by hand - he just restarted the daemon (by
>>> init script I assume). Or am I missing something here?
>> 
>> Sure, he used the script. But when you log in as root, you have most
>> likely a different environment than the machine when it boots and starts
>> the script automatically. You can check this in /proc/<pid>/environ for
>> the processes, maybe for one where it was started automatically and one
>> where it was started by hand (by using the script).
> 
> We always start our SGE daemons by hand.
> 
> The SGE directories have a dependency on the GPFS file system having
> started, and to prevent any issues, GPFS gets manually started, followed by
> SGE.
> 
> We just run a distributed ssh (xdsh) command across all the nodes that have
> been rebooted.
> 
> Cheers,
> 
> Mat
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> Cheers,
>>> Arne
>>> 
>>> 
>>>>> 
>>>>> The problem has returned a couple of times over a period of about a
>> month,
>>>>> but restarting the sgeexecd daemons always seems to fix it.
>>>>> 
>>>>> No other messages in the log files indicate anything unusual.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Mat
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: 17 December 2010 14:03
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] Rmpi under SGE
>>>>> 
>>>>> Am 17.12.2010 um 12:58 schrieb arnuschky:
>>>>> 
>>>>>> Ah. My previsously was slightly premature, Rmpi jobs with > 20
>> slaves
>>>>>> still fail (even with Reuti's fixes):
>>>>>> 
>>>>>>     $ cat test-mpi-17942.e3480568
>>>>>>     error: got no connection within 60 seconds. "Timeout occured
>> while
>>>>> waiting for connection"
>>>>>>     error: got no connection within 60 seconds. "Timeout occured
>> while
>>>>> waiting for connection"
>>>>> 
>>>>> You are now using the plain -builtin- startup method? Does it happen
>> on all
>>>>> hosts for such a job?
>>>>> 
>>>>> Maybe it's something special on some nodes und would for some hosts
>> happen
>>>>> for less slots too.
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> 
>>>>> --------------------------------------------------------------------
>> ------
>>>>>>     A daemon (pid 8473) died unexpectedly with status 1 while
>>>>> attempting
>>>>>>     to launch so we are aborting.
>>>>>> 
>>>>>>     There may be more information reported by the environment (see
>>>>> above).
>>>>>> 
>>>>>>     This may be because the daemon was unable to find all the
>> needed
>>>>> shared
>>>>>>     libraries on the remote node. You may set your LD_LIBRARY_PATH
>> to
>>>>> have the
>>>>>>     location of the shared libraries on the remote nodes and this
>> will
>>>>>>     automatically be forwarded to the remote nodes.
>>>>>> 
>>>>> --------------------------------------------------------------------
>> ------
>>>>>> 
>>>>>> Qmaster spool messages list:
>>>>>> 
>>>>>> 12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
>> task
>>>>> 3480568.1 task 2.compute-2-9 failed - killing job
>>>>>> 
>>>>>> Any idea what's going wrong now? 60 seconds is quite a long
>> timeout, I
>>>>>> guess that this is not a network timeout...
>>>>>> 
>>>>>> Arne
>>>>> 
>>>>> ------------------------------------------------------
>>>>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=3
>>>>> 06450
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>> 
>>>>> ------------------------------------------------------
>>>>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=306469
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> ------------------------------------------------------
>>>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=306475
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>> 
>>> --
>>> Arne Brutschy
>>> Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
>>> IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
>>> Universite' Libre de Bruxelles   Tel      +32 2 650 2273
>>> Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
>>> 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
>>> 
>>> ------------------------------------------------------
>>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=306524
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=307780
>> 
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307781
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307784

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list