Opened 16 years ago

Last modified 9 years ago

#164 new defect

IZ977: qmaster logs job ack errors on exec hosts if load report interval is too high

Reported by: sgrell Owned by:
Priority: lowest Milestone:
Component: sge Version: current
Severity: Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=977]

        Issue #:      977              Platform:     All       Reporter: sgrell (sgrell)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      current      CC:    None defined
        Status:       REOPENED         Priority:     P5
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     qmaster logs job ack errors on exec hosts if load report interval is too high
   Status whiteboard:
      Attachments:

     Issue 977 blocks:
   Votes for issue 977:


   Opened: Tue Apr 20 04:57:00 -0700 2004 
------------------------


Hi,

the qmaster logs the following error:
   ack event for unknown job 79367

for jobs which are dispatched to a system with a
high load and when the job is deleted while it is
in t state.

Explanation:

In my case needed the execd 4 minutes to send the
first acknowledge. The qmaster resends a job every
minute, though the execd got the job 4 times.

The execd send for each resend an acknowledge
back, but the qmaster expects only one. Between
the first and the other acknowledges can the job
be deleted and the target for the acknowledges is
gone. The qmaster reports an error message in this
 case.
If the job is not deleted, now error message is
generated, but the qmaster does the needed work
for an acknowledge multiple times.

Stephan

   ------- Additional comments from andreas Tue May 4 02:02:11 -0700 2004 -------
changed summary

   ------- Additional comments from andy Thu May 27 02:41:39 -0700 2004 -------
Reprioritzing - the logging is ok if the execd does not (or cannot)
send an acknowledge.

Having the execd under an extremely high load is unusual and indicates
a problem. Logging of resulting communication problems is acceptable.

It should be investigated if the logging can be caused by other
problems as well.

   ------- Additional comments from sgrell Tue Dec 6 04:06:01 -0700 2005 -------
Changed subcomponent.

This issue should be fixed. A validation is needed.

Stephan

   ------- Additional comments from ernst Mon Dec 12 06:35:56 -0700 2005 -------
Issue is not fixed.

Change History (0)

Note: See TracTickets for help on using tickets.