[GE users] Rescheduled job causing a node to crash

hugo_hernandez hugo.hernandez at loni.ucla.edu
Thu Feb 5 21:14:37 GMT 2009


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reuti,
We have configured all our queues with rerun.
-Hugo

--
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"


> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, February 04, 2009 7:45 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Rescheduled job causing a node to crash
>
> Am 04.02.2009 um 02:45 schrieb hugo_hernandez:
>
> > Reuti,
> >
> >> -----Original Message-----
> >> From: reuti [mailto:reuti at staff.uni-marburg.de]
> >> Sent: Friday, January 30, 2009 7:19 AM
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Rescheduled job causing a node to crash
> >>
> >> Hi,
> >>
> >> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
> >>
> >>> I have noted when a node crash whatever reason it causes, and if
> >>> there is a job running on it, the job is rescheduled to run again
> >>> once the node is back online (after a reboot or reinstall).  It is
> >>> not supposed SGE detect if there is a problem with a job, and if
> >>> there is no communication between the execd daemon on the exechost
> >>> with the qmaster, the job must be rescheduled to run independent if
> >>> the compute node in which it was previously running comes online or
> >>> not?  Am I doing something wrong in my configuration?
> >>
> >> there are some entries in SGE's configuration which might help:
> >>
> >> max_unheard                  00:05:00
> >> reschedule_unknown           00:01:00
> > [Hugo Hernandez-Mora]
> > We have set our configuration to use the same value for max_unheard
> > but for reschedule_unknown we have set 00:02:00.
>
> This is fine. And it's not working for you? You have submitted the
> jobs with "-r y" or configured the queue with "rerun TRUE"?
>
> -- Reuti
>
>
> >
> >>
> >> (man sge_conf) You will need to submit the jobs with "-r y" and/or
> >> set the queue's configuration "rerun TRUE" - Reuti
> >
> > -Hugo
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?
> > dsForumId=38&dsMessageId=101815
> >
> > To unsubscribe from this discussion, e-mail: [users-
> > unsubscribe at gridengine.sunsource.net].
> >
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=101929
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=102250

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list