[GE users] Rmpi under SGE

reuti reuti at staff.uni-marburg.de
Tue Dec 21 09:29:58 GMT 2010


Am 17.12.2010 um 19:07 schrieb arnuschky:

>> This can also be set during job submission time:
>> 
>> $ qsub -V job.sh
> 
> I tried this as well, but didn't change anything. Anyways, I think Mat
> didn't start the sgeexecd's by hand - he just restarted the daemon (by
> init script I assume). Or am I missing something here?

Sure, he used the script. But when you log in as root, you have most likely a different environment than the machine when it boots and starts the script automatically. You can check this in /proc/<pid>/environ for the processes, maybe for one where it was started automatically and one where it was started by hand (by using the script).

-- Reuti


> 
> Cheers,
> Arne
> 
> 
>>> 
>>> The problem has returned a couple of times over a period of about a month,
>>> but restarting the sgeexecd daemons always seems to fix it.
>>> 
>>> No other messages in the log files indicate anything unusual.
>>> 
>>> Cheers,
>>> 
>>> Mat
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: reuti [mailto:reuti at staff.uni-marburg.de] 
>>> Sent: 17 December 2010 14:03
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Rmpi under SGE
>>> 
>>> Am 17.12.2010 um 12:58 schrieb arnuschky:
>>> 
>>>> Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
>>>> still fail (even with Reuti's fixes):
>>>> 
>>>>      $ cat test-mpi-17942.e3480568
>>>>      error: got no connection within 60 seconds. "Timeout occured while
>>> waiting for connection"
>>>>      error: got no connection within 60 seconds. "Timeout occured while
>>> waiting for connection"
>>> 
>>> You are now using the plain -builtin- startup method? Does it happen on all
>>> hosts for such a job?
>>> 
>>> Maybe it's something special on some nodes und would for some hosts happen
>>> for less slots too.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> 
>>> --------------------------------------------------------------------------
>>>>      A daemon (pid 8473) died unexpectedly with status 1 while
>>> attempting
>>>>      to launch so we are aborting.
>>>> 
>>>>      There may be more information reported by the environment (see
>>> above).
>>>> 
>>>>      This may be because the daemon was unable to find all the needed
>>> shared
>>>>      libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> have the
>>>>      location of the shared libraries on the remote nodes and this will
>>>>      automatically be forwarded to the remote nodes.
>>>> 
>>> --------------------------------------------------------------------------
>>>> 
>>>> Qmaster spool messages list:
>>>> 
>>>>  12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel task
>>> 3480568.1 task 2.compute-2-9 failed - killing job
>>>> 
>>>> Any idea what's going wrong now? 60 seconds is quite a long timeout, I
>>>> guess that this is not a network timeout...
>>>> 
>>>> Arne
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
>>> 06450
>>> 
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306469
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306475
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> -- 
> Arne Brutschy
> Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
> IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
> Universite' Libre de Bruxelles   Tel      +32 2 650 2273
> Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
> 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306524
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307780

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list