[GE users] Restarting sge_execd does not clear hung job status

coffman michael.coffman at avagotech.com
Thu Oct 7 14:18:44 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



On Thu, Oct 7, 2010 at 7:16 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Am 07.10.2010 um 15:10 schrieb coffman:

> On Thu, Oct 7, 2010 at 3:45 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
> Hi,
>
> Am 07.10.2010 um 00:08 schrieb coffman:
>
> > I recently moved from 6.0u8 to 6.2u5 and am noticing a different behavior that I could use some help with.  On the previous version of grid we would occasionally have a grid system hang in such a way that it would need to be rebooted.   When this happened the job info related to the job would be cleared from the scheduler.
> >
> > Version 6.2u5 does not behave the same way.    The system running a particular job has been rebooted, so the job is definitly no longer running.    When the system comes back up, sge_execd is started on the exechost.    A qstat still shows the job as running on the host that was rebooted.    Any clues as to why it does not get cleaned up?
>
> is the (local) spool directory of the node removed when the node is rebooted?
>
>
> Yes.   All that is left is the following:
>
> ./cs511
> ./cs511/jobs
> ./cs511/job_scripts
> ./cs511/active_jobs
> ./cs511/messages
> ./cs511/execd.pid

Then it's no wonder. The execd can't check whether any job is missing, and hence won't tell the qmaster anything. The stuff in ./cs511/active_jobs must survive the reboot, then the qmaster will finally remove the crashed jobs after it was informed by the execd.


OK.   But why would the directories be empty unless sge_execd cleaned them up?

-- Reuti


>
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286404
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>
>
> --
> -MichaelC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286409

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
-MichaelC



More information about the gridengine-users mailing list