[GE users] Rescheduled job causing a node to crash

hugo_hernandez hugo.hernandez at loni.ucla.edu
Wed Feb 4 01:45:17 GMT 2009


Reuti,

> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Friday, January 30, 2009 7:19 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Rescheduled job causing a node to crash
>
> Hi,
>
> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
>
> > I have noted when a node crash whatever reason it causes, and if
> > there is a job running on it, the job is rescheduled to run again
> > once the node is back online (after a reboot or reinstall).  It is
> > not supposed SGE detect if there is a problem with a job, and if
> > there is no communication between the execd daemon on the exechost
> > with the qmaster, the job must be rescheduled to run independent if
> > the compute node in which it was previously running comes online or
> > not?  Am I doing something wrong in my configuration?
>
> there are some entries in SGE's configuration which might help:
>
> max_unheard                  00:05:00
> reschedule_unknown           00:01:00
[Hugo Hernandez-Mora]
We have set our configuration to use the same value for max_unheard but for reschedule_unknown we have set 00:02:00.

>
> (man sge_conf) You will need to submit the jobs with "-r y" and/or
> set the queue's configuration "rerun TRUE" - Reuti

-Hugo

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=101815

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list