Opened 13 years ago

Last modified 11 years ago

#598 new defect

IZ2794: qmaster kills all running jobs because of job state 65536

Reported by: jlopez Owned by:
Priority: high Milestone:
Component: sge Version: 6.1u3
Severity: Keywords: Linux qmaster


[Imported from gridengine issuezilla]

        Issue #:      2794             Platform:     All      Reporter: jlopez (jlopez)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.1u3       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
       * Summary:     qmaster kills all running jobs because of job state 65536
   Status whiteboard:

     Issue 2794 blocks:
   Votes for issue 2794:

   Opened: Wed Nov 19 10:09:00 -0700 2008 

The symptoms where that at 14:57 all running jobs suddenly appeared in state
65536 for the qmaster and  and a few seconds after that the qmaster killed them all.

These are the messages that appear in the qmaster logs:

11/13/2008 14:56:57|qmaster|cn142|E|execd cn068.null reports running
state for job (691876.1/1.cn068) in queue "medium_queue@cn068.null"
while job is in state 65536

11/13/2008 14:58:07|qmaster|cn142|E|execd@cn035.null reports running job
(691876.1/1.cn035) in queue "medium_queue@cn035.null" that was not
supposed to be there - killing

These two messages are repeated for every running job at that given time.

We are using the IA64 binaries under SLES10 SP1

   ------- Additional comments from jlopez Mon Jan 12 04:48:22 -0700 2009 -------
Until now the issue has not appeared again.

   ------- Additional comments from jlopez Tue Feb 3 09:13:24 -0700 2009 -------
We think it could be related to the fact that we were using berkeydb over NFSv3.

It did not happen again.

Change History (0)

Note: See TracTickets for help on using tickets.