[GE users] Restarting sge_execd does not clear hung job status

coffman michael.coffman at avagotech.com
Thu Oct 7 15:01:41 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



On Thu, Oct 7, 2010 at 7:30 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Am 07.10.2010 um 15:18 schrieb coffman:

> On Thu, Oct 7, 2010 at 7:16 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
> Am 07.10.2010 um 15:10 schrieb coffman:
>
> > On Thu, Oct 7, 2010 at 3:45 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
> > Hi,
> >
> > Am 07.10.2010 um 00:08 schrieb coffman:
> >
> > > I recently moved from 6.0u8 to 6.2u5 and am noticing a different behavior that I could use some help with.  On the previous version of grid we would occasionally have a grid system hang in such a way that it would need to be rebooted.   When this happened the job info related to the job would be cleared from the scheduler.
> > >
> > > Version 6.2u5 does not behave the same way.    The system running a particular job has been rebooted, so the job is definitly no longer running.    When the system comes back up, sge_execd is started on the exechost.    A qstat still shows the job as running on the host that was rebooted.    Any clues as to why it does not get cleaned up?
> >
> > is the (local) spool directory of the node removed when the node is rebooted?
> >
> >
> > Yes.   All that is left is the following:
> >
> > ./cs511
> > ./cs511/jobs
> > ./cs511/job_scripts
> > ./cs511/active_jobs
> > ./cs511/messages
> > ./cs511/execd.pid
>
> Then it's no wonder. The execd can't check whether any job is missing, and hence won't tell the qmaster anything. The stuff in ./cs511/active_jobs must survive the reboot, then the qmaster will finally remove the crashed jobs after it was informed by the execd.
>
>
> OK.   But why would the directories be empty unless sge_execd cleaned them up?

Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?


The common directory is an NFS mount point, but the spool directory is local to the node.

10/07/2010 07:30:41|  main|cs408|E|fopen("/opt/grid-6.2u5/ftcrnd/common/act_qmaster") failed: No such file or directory
10/07/2010 07:31:03|  main|cs408|E|shepherd of job 825292.1 died through signal = 15
10/07/2010 07:31:03|  main|cs408|E|abnormal termination of shepherd for job 825292.1: "exit_status" file is empty
10/07/2010 07:31:03|  main|cs408|E|can't open usage file "active_jobs/825292.1/usage" for job 825292.1: No such file or directory
10/07/2010 07:31:03|  main|cs408|E|shepherd exited with exit status 19: before writing exit_status
10/07/2010 07:31:03|  main|cs408|I|controlled shutdown 6.2u5



-- Reuti


> -- Reuti
>
>
> >
> > -- Reuti
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286404
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
> >
> >
> >
> > --
> > -MichaelC
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286409
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>
>
> --
> -MichaelC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286411

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
-MichaelC



More information about the gridengine-users mailing list