[GE users] Rmpi under SGE

matbradford matthew.bradford at hp.com
Tue Dec 21 09:48:08 GMT 2010


>-----Original Message-----
>From: reuti [mailto:reuti at staff.uni-marburg.de]
>Sent: 21 December 2010 09:42
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] Rmpi under SGE
>
>Am 21.12.2010 um 10:36 schrieb matbradford:
>
>>> -----Original Message-----
>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: 21 December 2010 09:30
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Rmpi under SGE
>>>
>>> Am 17.12.2010 um 19:07 schrieb arnuschky:
>>>
>>>>> This can also be set during job submission time:
>>>>>
>>>>> $ qsub -V job.sh
>>>>
>>>> I tried this as well, but didn't change anything. Anyways, I think
>Mat
>>>> didn't start the sgeexecd's by hand - he just restarted the daemon
>(by
>>>> init script I assume). Or am I missing something here?
>>>
>>> Sure, he used the script. But when you log in as root, you have most
>>> likely a different environment than the machine when it boots and
>starts
>>> the script automatically. You can check this in /proc/<pid>/environ
>for
>>> the processes, maybe for one where it was started automatically and
>one
>>> where it was started by hand (by using the script).
>>
>> We always start our SGE daemons by hand.

Sorry, not very precise here. When I say by hand, I still mean we use the
start-up script in init.d, just not automatically at boot time.

>>
>> The SGE directories have a dependency on the GPFS file system having
>> started, and to prevent any issues, GPFS gets manually started,
>followed by
>> SGE.
>>
>> We just run a distributed ssh (xdsh) command across all the nodes that
>have
>> been rebooted.
>>
>> Cheers,
>>
>> Mat
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> Cheers,
>>>> Arne
>>>>
>>>>
>>>>>>
>>>>>> The problem has returned a couple of times over a period of about
>a
>>> month,
>>>>>> but restarting the sgeexecd daemons always seems to fix it.
>>>>>>
>>>>>> No other messages in the log files indicate anything unusual.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mat
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>> Sent: 17 December 2010 14:03
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] Rmpi under SGE
>>>>>>
>>>>>> Am 17.12.2010 um 12:58 schrieb arnuschky:
>>>>>>
>>>>>>> Ah. My previsously was slightly premature, Rmpi jobs with > 20
>>> slaves
>>>>>>> still fail (even with Reuti's fixes):
>>>>>>>
>>>>>>>     $ cat test-mpi-17942.e3480568
>>>>>>>     error: got no connection within 60 seconds. "Timeout occured
>>> while
>>>>>> waiting for connection"
>>>>>>>     error: got no connection within 60 seconds. "Timeout occured
>>> while
>>>>>> waiting for connection"
>>>>>>
>>>>>> You are now using the plain -builtin- startup method? Does it
>happen
>>> on all
>>>>>> hosts for such a job?
>>>>>>
>>>>>> Maybe it's something special on some nodes und would for some
>hosts
>>> happen
>>>>>> for less slots too.
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>>
>>>>>> ------------------------------------------------------------------
>--
>>> ------
>>>>>>>     A daemon (pid 8473) died unexpectedly with status 1 while
>>>>>> attempting
>>>>>>>     to launch so we are aborting.
>>>>>>>
>>>>>>>     There may be more information reported by the environment
>(see
>>>>>> above).
>>>>>>>
>>>>>>>     This may be because the daemon was unable to find all the
>>> needed
>>>>>> shared
>>>>>>>     libraries on the remote node. You may set your
>LD_LIBRARY_PATH
>>> to
>>>>>> have the
>>>>>>>     location of the shared libraries on the remote nodes and this
>>> will
>>>>>>>     automatically be forwarded to the remote nodes.
>>>>>>>
>>>>>> ------------------------------------------------------------------
>--
>>> ------
>>>>>>>
>>>>>>> Qmaster spool messages list:
>>>>>>>
>>>>>>> 12/17/2010 12:53:19|worker|majorana|E|tightly integrated parallel
>>> task
>>>>>> 3480568.1 task 2.compute-2-9 failed - killing job
>>>>>>>
>>>>>>> Any idea what's going wrong now? 60 seconds is quite a long
>>> timeout, I
>>>>>>> guess that this is not a network timeout...
>>>>>>>
>>>>>>> Arne
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>>
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>>> Id=3
>>>>>> 06450
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>>
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>>> Id=306469
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> ------------------------------------------------------
>>>>>
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>>> Id=306475
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>> --
>>>> Arne Brutschy
>>>> Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
>>>> IRIDIA CP 194/6                  Web
>iridia.ulb.ac.be/~abrutschy
>>>> Universite' Libre de Bruxelles   Tel      +32 2 650 2273
>>>> Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
>>>> 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
>>>>
>>>> ------------------------------------------------------
>>>>
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>>> Id=306524
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>>> Id=307780
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>>
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=307781
>>
>> To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].
>
>------------------------------------------------------
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>Id=307784
>
>To unsubscribe from this discussion, e-mail: [users-
>unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307786

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Application/X-PKCS7-SIGNATURE (Name: "smime.p7s") 5 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list