[GE users] Rescheduled job causing a node to crash

reuti reuti at staff.uni-marburg.de
Tue Feb 10 11:56:28 GMT 2009


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 09.02.2009 um 22:43 schrieb hugo_hernandez:

> Hello Reuti,
> Sorry, you are right, I didn't answer all the questions.
>
>>> Reuti,
>>> We have configured all our queues with rerun.
>>
>> Hugo,
>>
>> you didn't answer all questions.
>>
>> a) it's not working for you?
> [Hugo Hernandez-Mora]
> The SGE configuration is working fine on our production cluster  
> which is using version 6.1u4 but it is not working for our testing  
> cluster which uses version 6.2u1.
>
>> b) did the node crash and disappear also from "qhost", i.e. has
>> dashes or "au" in `qstat -f`?
> [Hugo Hernandez-Mora]
> For the production cluster I have a 'au' state once the qmaster  
> detects a lost communication with the exechost.  But for the  
> testing cluster, I have noted the state is being set to 'r' until  
> the node is rebooted/reinstalled and then when booting the status  
> change to 'au' then to a normal state once the qmaster receives  
> communication from the exechost.
>
>> I saw situations, where a node can't be accessed any longer with ssh,
>> but still appears working in qhost, as the execd still sends
>> information to the  qmaster. Then SGE can't do anything.
> [Hugo Hernandez-Mora]
> It appears this is the situation on our testing cluster.   Maybe we  
> have it in some way in our production cluster but for these things  
> I am more focused currently in the testing cluster because we are  
> planning to move into this version soon.

Aha, the subject of the message is wrong then, as the problem is not  
that rescheduling will crash a node. It's just that losing a node  
wouldn't reschdule the job. For me it's working in 6.2u1.

-- Reuti


> Thanks for your time,
> -Hugo
>
>>
>> -- Reuti
>>
>>
>>> -Hugo
>>>
>>> --
>>> Hugo R. Hernandez-Mora
>>> System Administrator
>>> Laboratory of Neuro Imaging, UCLA
>>> 635 Charles E. Young Drive South, Suite 225
>>> Los Angeles, CA 90095-7332
>>> Tel: 310.267.5076
>>> Fax: 310.206.5518
>>> hugo.hernandez at loni.ucla.edu
>>> --
>>>
>>> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
>>> que o sol faze un espectacolo maravilhoso todas as manhãs
>>> cuando a maior parte das pessoas, ainda estam durmindo"
>>>
>>>
>>>> -----Original Message-----
>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Wednesday, February 04, 2009 7:45 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Rescheduled job causing a node to crash
>>>>
>>>> Am 04.02.2009 um 02:45 schrieb hugo_hernandez:
>>>>
>>>>> Reuti,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>> Sent: Friday, January 30, 2009 7:19 AM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] Rescheduled job causing a node to crash
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
>>>>>>
>>>>>>> I have noted when a node crash whatever reason it causes, and if
>>>>>>> there is a job running on it, the job is rescheduled to run  
>>>>>>> again
>>>>>>> once the node is back online (after a reboot or reinstall).  It
>> is
>>>>>>> not supposed SGE detect if there is a problem with a job, and if
>>>>>>> there is no communication between the execd daemon on the
>> exechost
>>>>>>> with the qmaster, the job must be rescheduled to run
>>>>>>> independent if
>>>>>>> the compute node in which it was previously running comes
>>>>>>> online or
>>>>>>> not?  Am I doing something wrong in my configuration?
>>>>>>
>>>>>> there are some entries in SGE's configuration which might help:
>>>>>>
>>>>>> max_unheard                  00:05:00
>>>>>> reschedule_unknown           00:01:00
>>>>> [Hugo Hernandez-Mora]
>>>>> We have set our configuration to use the same value for  
>>>>> max_unheard
>>>>> but for reschedule_unknown we have set 00:02:00.
>>>>
>>>> This is fine. And it's not working for you? You have submitted the
>>>> jobs with "-r y" or configured the queue with "rerun TRUE"?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> (man sge_conf) You will need to submit the jobs with "-r y"  
>>>>>> and/or
>>>>>> set the queue's configuration "rerun TRUE" - Reuti
>>>>>
>>>>> -Hugo
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>>> dsForumId=38&dsMessageId=101815
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessag
>>>> eId=101929
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=102250
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessag
>> eId=102380
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=103017
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103166

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list