[GE users] Restarting sge_execd does not clear hung job status

reuti reuti at staff.uni-marburg.de
Thu Oct 7 14:16:06 BST 2010

Am 07.10.2010 um 15:10 schrieb coffman:

> On Thu, Oct 7, 2010 at 3:45 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> Am 07.10.2010 um 00:08 schrieb coffman:
> > I recently moved from 6.0u8 to 6.2u5 and am noticing a different behavior that I could use some help with.  On the previous version of grid we would occasionally have a grid system hang in such a way that it would need to be rebooted.   When this happened the job info related to the job would be cleared from the scheduler.
> >
> > Version 6.2u5 does not behave the same way.    The system running a particular job has been rebooted, so the job is definitly no longer running.    When the system comes back up, sge_execd is started on the exechost.    A qstat still shows the job as running on the host that was rebooted.    Any clues as to why it does not get cleaned up?
> is the (local) spool directory of the node removed when the node is rebooted?
> Yes.   All that is left is the following:
> ./cs511
> ./cs511/jobs
> ./cs511/job_scripts
> ./cs511/active_jobs
> ./cs511/messages
> ./cs511/execd.pid

Then it's no wonder. The execd can't check whether any job is missing, and hence won't tell the qmaster anything. The stuff in ./cs511/active_jobs must survive the reboot, then the qmaster will finally remove the crashed jobs after it was informed by the execd.

-- Reuti

> -- Reuti
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286404
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> -- 
> -MichaelC


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list