[GE users] Rmpi under SGE

matbradford matthew.bradford at hp.com
Tue Dec 21 09:36:37 GMT 2010


>-----Original Message-----
>From: reuti [mailto:reuti at staff.uni-marburg.de]
>Sent: 21 December 2010 09:30
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] Rmpi under SGE
>
>Am 17.12.2010 um 19:07 schrieb arnuschky:
>
>>> This can also be set during job submission time:
>>>
>>> $ qsub -V job.sh
>>
>> I tried this as well, but didn't change anything. Anyways, I think Mat
>> didn't start the sgeexecd's by hand - he just restarted the daemon (by
>> init script I assume). Or am I missing something here?
>
>Sure, he used the script. But when you log in as root, you have most
>likely a different environment than the machine when it boots and starts
>the script automatically. You can check this in /proc/<pid>/environ for
>the processes, maybe for one where it was started automatically and one
>where it was started by hand (by using the script).

We always start our SGE daemons by hand.

The SGE directories have a dependency on the GPFS file system having
started, and to prevent any issues, GPFS gets manually started, followed by
SGE.

We just run a distributed ssh (xdsh) command across all the nodes that have
been rebooted.

Cheers,

Mat
>
>-- Reuti
>
>
>>
>> Cheers,
>> Arne
>>
>>
>>>>
>>>> The problem has returned a couple of times over a period of about a
>month,
>>>> but restarting the sgeexecd daemons always seems to fix it.
>>>>
>>>> No other messages in the log files indicate anything unusual.
>>>>
>>>> Cheers,
>>>>
>>>> Mat
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: 17 December 2010 14:03
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Rmpi under SGE
>>>>
>>>> Am 17.12.2010 um 12:58 schrieb arnuschky:
>>>>
>>>>> Ah. My previsously was slightly premature, Rmpi jobs with > 20
>slaves
>>>>> still fail (even with Reuti's fixes):
>>>>>
>>>>>      $ cat test-mpi-17942.e3480568
>>>>>      error: got no connection within 60 seconds. "Timeout occured
>while
>>>> waiting for connection"
>>>>>      error: got no connection within 60 seconds. "Timeout occured
>while
>>>> waiting for connection"
>>>>
>>>> You are now using the plain -builtin- startup method? Does it happen
>on all
>>>> hosts for such a job?
>>>>
>>>> Maybe it's something special on some nodes und would for some hosts
>happen
>>>> for less slots too.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>>
>>>> --------------------------------------------------------------------
>------
>>>>>      A daemon (pid 8473) died unexpectedly with status 1 while
>>>> attempting
>>>>>      to launch so we are aborting.
>>>>>
>>>>>      There may be more information reported by the environment (see
>>>> above).
>>>>>
>>>>>      This may be because the daemon was unable to find all the
>needed
>>>> shared
>>>>>      libraries on the remote node. You may set your LD_LIBRARY_PATH
>to
>>>> have the
>>>>>      location of the shared libraries on the remote nodes and this
>will
>>>>>      automatically be forwarded to the remote nodes.
>>>>>
>>>> --------------------------------------------------------------------
>------
>>>>>
>>>>> Qmaster spool messages list:
>>>>>
>>>>>  12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
>task
>>>> 3480568.1 task 2.compute-2-9 failed - killing job
>>>>>
>>>>> Any idea what's going wrong now? 60 seconds is quite a long
>timeout, I
>>>>> guess that this is not a network timeout...
>>>>>
>>>>> Arne
>>>>
>>>> ------------------------------------------------------
>>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=3
>>>> 06450
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>> ------------------------------------------------------
>>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=306469
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=306475
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].
>>
>> --
>> Arne Brutschy
>> Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
>> IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
>> Universite' Libre de Bruxelles   Tel      +32 2 650 2273
>> Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
>> 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
>>
>> ------------------------------------------------------
>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=306524
>>
>> To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].
>
>------------------------------------------------------
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=307780
>
>To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307781

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Application/X-PKCS7-SIGNATURE (Name: "smime.p7s") 5 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list