Opened 10 years ago

Closed 8 years ago

#795 closed enhancement (fixed)

IZ3257: execd 'job exceeds job hard limit' message should include task id as well as job id.

Reported by: ccaamad Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: PC Linux execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3257]

        Issue #:      3257             Platform:     PC            Reporter: ccaamad (ccaamad)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.2u5            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     execd 'job exceeds job hard limit' message should include task id as well as job id.
   Status whiteboard:
      Attachments:

     Issue 3257 blocks:
   Votes for issue 3257:


   Opened: Wed Mar 31 01:21:00 -0700 2010 
------------------------


Looking at the execd messages file is a valuable way to understand why a job has unexpectedly ended - in particular, messages similar to:

03/17/2010 17:02:37|  main|c3s0b11n0|W|job 10657 exceeds job hard limit "h_vmem" of queue "c3s0.q@c3s0b11n0.arc1.leeds.ac.uk"
(4195127296.00000 > limit:4194304000.00000) - sending SIGKILL

However, these messages do not currently include the task id of the job, making it difficult to track-down what has happened to array jobs.
As there may be several thousand tasks with the same job id, with many running simultaneously on the same host, making it easy to parse logs
and see what happened to them is rather useful!

Looking at the source, the following messages are defined in gridengine/source/daemons/execd/msg_execd.h, lines 217 and 218:

#define MSG_JOB_EXCEEDHLIM_USSFF      _MESSAGE(29126, _("job "sge_U32CFormat" exceeds job hard limit "SFQ" of queue "SFQ" (%8.5f >
limit:%8.5f) - sending SIGKILL"))
#define MSG_JOB_EXCEEDSLIM_USSFF      _MESSAGE(29127, _("job "sge_U32CFormat" exceeds job soft limit "SFQ" of queue "SFQ" (%8.5f >
limit:%8.5f) - sending SIGXCPU"))

And used in gridengine/source/daemons/execd/execd_ck_to_do.c lines 277-293.

At the point where these message are generated, there's a "jataskid" variable in-scope which looks like it might include what's needed.
Could the messages be extended to include this information, please?

Thanks,

Mark

   ------- Additional comments from reuti Wed Mar 31 03:12:21 -0700 2010 -------
For s_rt/h_rt it's already working this way. Looks like this message is created elsewhere.

   ------- Additional comments from ccaamad Wed Mar 31 03:58:14 -0700 2010 -------
That's right. h_rt/s_rt messages include the task id and are defined by lines 219/220 of msg_execd.h:

#define MSG_EXECD_EXCEEDHWALLCLOCK_UU _MESSAGE(29128, _("job "sge_U32CFormat"."sge_U32CFormat" exceeded hard wallclock time - initiate
terminate method"))
#define MSG_EXECD_EXCEEDSWALLCLOCK_UU _MESSAGE(29129, _("job "sge_U32CFormat"."sge_U32CFormat" exceeded soft wallclock time - initiate soft
notify method"))

And used by lines 455 and 474 of execd_ck_to_do.c.

We just need the 'exceeds job (hard|soft) limit' message to include the task id information as well. I'd include a simple patch but I'm not
yet geared-up to rebuilding grid engine and I don't want to offer something that isn't tested.

This would really be a big help - some of our users submit task arrays where >95% of tasks need <1G of memory and <5% need >4G. It aids
throughput to ask them to request 1G for the job and then resubmit those tasks that fail. Changing the message would aid identification of
what tasks have failed and why.

Thanks,

Mark

Change History (1)

comment:1 Changed 8 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.