[GE users] Rescheduled job causing a node to crash

hugo_hernandez hugo.hernandez at loni.ucla.edu
Wed Feb 4 01:45:17 GMT 2009


> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Friday, January 30, 2009 7:19 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Rescheduled job causing a node to crash
> Hi,
> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
> > I have noted when a node crash whatever reason it causes, and if
> > there is a job running on it, the job is rescheduled to run again
> > once the node is back online (after a reboot or reinstall).  It is
> > not supposed SGE detect if there is a problem with a job, and if
> > there is no communication between the execd daemon on the exechost
> > with the qmaster, the job must be rescheduled to run independent if
> > the compute node in which it was previously running comes online or
> > not?  Am I doing something wrong in my configuration?
> there are some entries in SGE's configuration which might help:
> max_unheard                  00:05:00
> reschedule_unknown           00:01:00
[Hugo Hernandez-Mora]
We have set our configuration to use the same value for max_unheard but for reschedule_unknown we have set 00:02:00.

> (man sge_conf) You will need to submit the jobs with "-r y" and/or
> set the queue's configuration "rerun TRUE" - Reuti



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list