Opened 16 years ago

Closed 6 years ago

#85 closed enhancement (fixed)

IZ483: reporting of reason for job abort

Reported by: fy Owned by:
Priority: normal Milestone:
Component: sge Version: 5.3p2
Severity: minor Keywords: execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=483]

        Issue #:      483              Platform:     All           Reporter: fy (fy)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      5.3p2            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     pollinger
          URL:
       * Summary:     reporting of reason for job abort
   Status whiteboard:
      Attachments:

     Issue 483 blocks:
   Votes for issue 483:


   Opened: Wed Feb 5 08:11:00 -0700 2003 
------------------------


When a job exceeds one of the resource limits, say
h_rt, the job is killed with SIGKILL, however,
there is no clear indication as to why the job was
killed. None of the files in $SGE_JOB_SPOOL_DIR
provide any added information, so writing an
epilog does not help. The best we can do is guess
from the failed code (=100), and
end_time-start_time.

The enhancment being asked for is some kind of
reason as to why the job was killed, e.g.
(reason=limit_exceeded, limit=h_rt).

   ------- Additional comments from andreas Fri Apr 15 06:56:13 -0700 2005 -------
HOWTOFIX:
In case a job was terminated due to limit exeeded a new SSTATE_* such as
SSTATE_LIMIT_EXCEEDED is needed. Each time sge_execd initiates job termination
a flag must be set with that job. This will sge_execd to report
SSTATE_LIMIT_EXCEEDED for those jobs/tasks.

   ------- Additional comments from andreas Mon Jul 25 02:15:10 -0700 2005 -------
Changed execd logging in Maintrunk from Info to Warning to improve diagnosics.
Further improvements seem possible. Added related comments to
daemons/execd/execd_signal_queue.c

   ------- Additional comments from andreas Mon Aug 8 04:16:35 -0700 2005 -------
See issue 1743 for a reasonable 6.0 based workaround and an RFE how to generally
improve job diagnostics based oo job life-cycle information available in
reporting(5).

   ------- Additional comments from sgrell Mon Dec 12 03:04:05 -0700 2005 -------
Changed the subcomponent.

Stephan

Change History (1)

comment:1 Changed 6 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

AH-2005-07-25-0

Note: See TracTickets for help on using tickets.