[GE users] Rescheduled job causing a node to crash

reuti reuti at staff.uni-marburg.de
Fri Feb 6 11:54:57 GMT 2009


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 05.02.2009 um 22:14 schrieb hugo_hernandez:

> Reuti,
> We have configured all our queues with rerun.

Hugo,

you didn't answer all questions.

a) it's not working for you?

b) did the node crash and disappear also from "qhost", i.e. has  
dashes or "au" in `qstat -f`?

I saw situations, where a node can't be accessed any longer with ssh,  
but still appears working in qhost, as the execd still sends  
information to the  qmaster. Then SGE can't do anything.

-- Reuti


> -Hugo
>
> --
> Hugo R. Hernandez-Mora
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Fax: 310.206.5518
> hugo.hernandez at loni.ucla.edu
> --
>
> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> que o sol faze un espectacolo maravilhoso todas as manhãs
> cuando a maior parte das pessoas, ainda estam durmindo"
>
>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wednesday, February 04, 2009 7:45 AM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Rescheduled job causing a node to crash
>>
>> Am 04.02.2009 um 02:45 schrieb hugo_hernandez:
>>
>>> Reuti,
>>>
>>>> -----Original Message-----
>>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Friday, January 30, 2009 7:19 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Rescheduled job causing a node to crash
>>>>
>>>> Hi,
>>>>
>>>> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
>>>>
>>>>> I have noted when a node crash whatever reason it causes, and if
>>>>> there is a job running on it, the job is rescheduled to run again
>>>>> once the node is back online (after a reboot or reinstall).  It is
>>>>> not supposed SGE detect if there is a problem with a job, and if
>>>>> there is no communication between the execd daemon on the exechost
>>>>> with the qmaster, the job must be rescheduled to run  
>>>>> independent if
>>>>> the compute node in which it was previously running comes  
>>>>> online or
>>>>> not?  Am I doing something wrong in my configuration?
>>>>
>>>> there are some entries in SGE's configuration which might help:
>>>>
>>>> max_unheard                  00:05:00
>>>> reschedule_unknown           00:01:00
>>> [Hugo Hernandez-Mora]
>>> We have set our configuration to use the same value for max_unheard
>>> but for reschedule_unknown we have set 00:02:00.
>>
>> This is fine. And it's not working for you? You have submitted the
>> jobs with "-r y" or configured the queue with "rerun TRUE"?
>>
>> -- Reuti
>>
>>
>>>
>>>>
>>>> (man sge_conf) You will need to submit the jobs with "-r y" and/or
>>>> set the queue's configuration "rerun TRUE" - Reuti
>>>
>>> -Hugo
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=101815
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessag
>> eId=101929
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=102250
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=102380

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list