[GE users] Restarting sge_execd does not clear hung job status

reuti reuti at staff.uni-marburg.de
Thu Oct 7 14:30:05 BST 2010


Am 07.10.2010 um 15:18 schrieb coffman:

> On Thu, Oct 7, 2010 at 7:16 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 07.10.2010 um 15:10 schrieb coffman:
> 
> > On Thu, Oct 7, 2010 at 3:45 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > Am 07.10.2010 um 00:08 schrieb coffman:
> >
> > > I recently moved from 6.0u8 to 6.2u5 and am noticing a different behavior that I could use some help with.  On the previous version of grid we would occasionally have a grid system hang in such a way that it would need to be rebooted.   When this happened the job info related to the job would be cleared from the scheduler.
> > >
> > > Version 6.2u5 does not behave the same way.    The system running a particular job has been rebooted, so the job is definitly no longer running.    When the system comes back up, sge_execd is started on the exechost.    A qstat still shows the job as running on the host that was rebooted.    Any clues as to why it does not get cleaned up?
> >
> > is the (local) spool directory of the node removed when the node is rebooted?
> >
> >
> > Yes.   All that is left is the following:
> >
> > ./cs511
> > ./cs511/jobs
> > ./cs511/job_scripts
> > ./cs511/active_jobs
> > ./cs511/messages
> > ./cs511/execd.pid
> 
> Then it's no wonder. The execd can't check whether any job is missing, and hence won't tell the qmaster anything. The stuff in ./cs511/active_jobs must survive the reboot, then the qmaster will finally remove the crashed jobs after it was informed by the execd.
> 
> 
> OK.   But why would the directories be empty unless sge_execd cleaned them up?

Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?

-- Reuti


> -- Reuti
> 
> 
> >
> > -- Reuti
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286404
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
> >
> >
> > --
> > -MichaelC
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286409
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> 
> 
> -- 
> -MichaelC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286411

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list