IZ2794: qmaster kills all running jobs because of job state 65536
|Reported by:||jlopez||Owned by:|
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2794]
Issue #: 2794 Platform: All Reporter: jlopez (jlopez) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.1u3 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: qmaster kills all running jobs because of job state 65536 Status whiteboard: Attachments: Issue 2794 blocks: Votes for issue 2794: Opened: Wed Nov 19 10:09:00 -0700 2008 ------------------------ The symptoms where that at 14:57 all running jobs suddenly appeared in state 65536 for the qmaster and and a few seconds after that the qmaster killed them all. These are the messages that appear in the qmaster logs: 11/13/2008 14:56:57|qmaster|cn142|E|execd cn068.null reports running state for job (691876.1/1.cn068) in queue "firstname.lastname@example.org" while job is in state 65536 11/13/2008 14:58:07|qmaster|cn142|Eemail@example.com reports running job (691876.1/1.cn035) in queue "firstname.lastname@example.org" that was not supposed to be there - killing These two messages are repeated for every running job at that given time. We are using the IA64 binaries under SLES10 SP1 ------- Additional comments from jlopez Mon Jan 12 04:48:22 -0700 2009 ------- Until now the issue has not appeared again. ------- Additional comments from jlopez Tue Feb 3 09:13:24 -0700 2009 ------- We think it could be related to the fact that we were using berkeydb over NFSv3. It did not happen again.
Change History (0)
Note: See TracTickets for help on using tickets.