[GE users] Rescheduled job causing a node to crash

hugo_hernandez hugo.hernandez at loni.ucla.edu
Mon Feb 9 21:43:54 GMT 2009


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello Reuti,
Sorry, you are right, I didn't answer all the questions.

> > Reuti,
> > We have configured all our queues with rerun.
>
> Hugo,
>
> you didn't answer all questions.
>
> a) it's not working for you?
[Hugo Hernandez-Mora]
The SGE configuration is working fine on our production cluster which is using version 6.1u4 but it is not working for our testing cluster which uses version 6.2u1.

> b) did the node crash and disappear also from "qhost", i.e. has
> dashes or "au" in `qstat -f`?
[Hugo Hernandez-Mora]
For the production cluster I have a 'au' state once the qmaster detects a lost communication with the exechost.  But for the testing cluster, I have noted the state is being set to 'r' until the node is rebooted/reinstalled and then when booting the status change to 'au' then to a normal state once the qmaster receives communication from the exechost.

> I saw situations, where a node can't be accessed any longer with ssh,
> but still appears working in qhost, as the execd still sends
> information to the  qmaster. Then SGE can't do anything.
[Hugo Hernandez-Mora]
It appears this is the situation on our testing cluster.   Maybe we have it in some way in our production cluster but for these things I am more focused currently in the testing cluster because we are planning to move into this version soon.

Thanks for your time,
-Hugo

>
> -- Reuti
>
>
> > -Hugo
> >
> > --
> > Hugo R. Hernandez-Mora
> > System Administrator
> > Laboratory of Neuro Imaging, UCLA
> > 635 Charles E. Young Drive South, Suite 225
> > Los Angeles, CA 90095-7332
> > Tel: 310.267.5076
> > Fax: 310.206.5518
> > hugo.hernandez at loni.ucla.edu
> > --
> >
> > "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> > que o sol faze un espectacolo maravilhoso todas as manhãs
> > cuando a maior parte das pessoas, ainda estam durmindo"
> >
> >
> >> -----Original Message-----
> >> From: reuti [mailto:reuti at staff.uni-marburg.de]
> >> Sent: Wednesday, February 04, 2009 7:45 AM
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Rescheduled job causing a node to crash
> >>
> >> Am 04.02.2009 um 02:45 schrieb hugo_hernandez:
> >>
> >>> Reuti,
> >>>
> >>>> -----Original Message-----
> >>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
> >>>> Sent: Friday, January 30, 2009 7:19 AM
> >>>> To: users at gridengine.sunsource.net
> >>>> Subject: Re: [GE users] Rescheduled job causing a node to crash
> >>>>
> >>>> Hi,
> >>>>
> >>>> Am 30.01.2009 um 01:58 schrieb hugo_hernandez:
> >>>>
> >>>>> I have noted when a node crash whatever reason it causes, and if
> >>>>> there is a job running on it, the job is rescheduled to run again
> >>>>> once the node is back online (after a reboot or reinstall).  It
> is
> >>>>> not supposed SGE detect if there is a problem with a job, and if
> >>>>> there is no communication between the execd daemon on the
> exechost
> >>>>> with the qmaster, the job must be rescheduled to run
> >>>>> independent if
> >>>>> the compute node in which it was previously running comes
> >>>>> online or
> >>>>> not?  Am I doing something wrong in my configuration?
> >>>>
> >>>> there are some entries in SGE's configuration which might help:
> >>>>
> >>>> max_unheard                  00:05:00
> >>>> reschedule_unknown           00:01:00
> >>> [Hugo Hernandez-Mora]
> >>> We have set our configuration to use the same value for max_unheard
> >>> but for reschedule_unknown we have set 00:02:00.
> >>
> >> This is fine. And it's not working for you? You have submitted the
> >> jobs with "-r y" or configured the queue with "rerun TRUE"?
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>>>
> >>>> (man sge_conf) You will need to submit the jobs with "-r y" and/or
> >>>> set the queue's configuration "rerun TRUE" - Reuti
> >>>
> >>> -Hugo
> >>>
> >>> ------------------------------------------------------
> >>> http://gridengine.sunsource.net/ds/viewMessage.do?
> >>> dsForumId=38&dsMessageId=101815
> >>>
> >>> To unsubscribe from this discussion, e-mail: [users-
> >>> unsubscribe at gridengine.sunsource.net].
> >>>
> >>
> >> ------------------------------------------------------
> >> http://gridengine.sunsource.net/ds/viewMessage.do?
> >> dsForumId=38&dsMessag
> >> eId=101929
> >>
> >> To unsubscribe from this discussion, e-mail: [users-
> >> unsubscribe at gridengine.sunsource.net].
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?
> > dsForumId=38&dsMessageId=102250
> >
> > To unsubscribe from this discussion, e-mail: [users-
> > unsubscribe at gridengine.sunsource.net].
> >
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=102380
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103017

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list