[GE users] Rescheduled job causing a node to crash

reuti reuti at staff.uni-marburg.de
Wed Feb 4 12:44:33 GMT 2009

Am 04.02.2009 um 02:45 schrieb hugo_hernandez:

> Reuti,
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Friday, January 30, 2009 7:19 AM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Rescheduled job causing a node to crash
>> Hi,
>> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
>>> I have noted when a node crash whatever reason it causes, and if
>>> there is a job running on it, the job is rescheduled to run again
>>> once the node is back online (after a reboot or reinstall).  It is
>>> not supposed SGE detect if there is a problem with a job, and if
>>> there is no communication between the execd daemon on the exechost
>>> with the qmaster, the job must be rescheduled to run independent if
>>> the compute node in which it was previously running comes online or
>>> not?  Am I doing something wrong in my configuration?
>> there are some entries in SGE's configuration which might help:
>> max_unheard                  00:05:00
>> reschedule_unknown           00:01:00
> [Hugo Hernandez-Mora]
> We have set our configuration to use the same value for max_unheard  
> but for reschedule_unknown we have set 00:02:00.

This is fine. And it's not working for you? You have submitted the  
jobs with "-r y" or configured the queue with "rerun TRUE"?

-- Reuti

>> (man sge_conf) You will need to submit the jobs with "-r y" and/or
>> set the queue's configuration "rerun TRUE" - Reuti
> -Hugo
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=101815
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list