Opened 16 years ago

Last modified 9 years ago

#167 new enhancement

IZ1014: QSUB: qsub -sync y -S blah exit.sh never returns

Reported by: templedf Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0beta2
Severity: Keywords: Sun SunOS clients
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1014]

        Issue #:      1014             Platform:     Sun           Reporter: templedf (templedf)
       Component:     gridengine          OS:        SunOS
     Subcomponent:    clients          Version:      6.0beta2         CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     roland
          URL:
       * Summary:     QSUB: qsub -sync y -S blah exit.sh never returns
   Status whiteboard:
      Attachments:

     Issue 1014 blocks:
   Votes for issue 1014:


   Opened: Mon May 3 02:34:00 -0700 2004 
------------------------


When a job submitted with -sync y goes into error
state, qsub will hang until interrupted.  The
reason is that their is no JOB_FINISH event sent
for a job that goes into the error state.

There are two possible solutions:
1) Send a JOB_FINISH event when a job goes into
the error state.  Does this make sense?  Is an
errored job technically "finished?"
2) Have JAPI react to the JERROR state of
JATASK_MOD events.  The problem is then how to
convey through JAPI to qsub or DRMAA that the job
has failed.  japi_wait() and japi_synchrnize() may
already provide appropiate error codes for this case.

Which do we like?

   ------- Additional comments from sgrell Mon May 3 05:37:12 -0700 2004 -------
The second solution. The job is not finish, it is in error state. If
the  user wants to kill the jobs, because of the error state it is his
decision and he can use qdel. Or he can fix the problem and restart
the job without submitting it again.

Stephan

   ------- Additional comments from templedf Thu May 13 08:20:26 -0700 2004 -------
I agreed that the second option is the correct one.  However,
japi_wait() is currently limited to the following error codes:

DRMAA_ERRNO_SUCCESS
   Job finished.
DRMAA_ERRNO_EXIT_TIMEOUT
   No job end within specified time.
DRMAA_ERRNO_INVALID_JOB
   The job id specified was invalid or DRMAA_JOB_IDS_SESSION_ANY has
been specified and all jobs of this session have already finished.
DRMAA_ERRNO_NO_ACTIVE_SESSION
   No active session.
DRMAA_ERRNO_DRM_COMMUNICATION_FAILURE
DRMAA_ERRNO_AUTH_FAILURE
DRMAA_ERRNO_NO_RUSAGE

None of these convey that the job has entered the error state.  This
is another example of where Issue #859 would come in handy.

   ------- Additional comments from andreas Fri May 14 06:05:43 -0700 2004 -------
It must be possible to handle this as a new job state
transition that can be monitored through japi_wait().

   ------- Additional comments from templedf Mon May 17 23:03:52 -0700 2004 -------
Implementing this as a monitorable state change is also viable.  It
has the advantage that it requires much less effort to implement.
The idea would be to allow the "event" parameter to japi_wait to
include JAPI_JOB_ERROR as a valid result.  qsub qould then need to
handle this additional state when japi_wait() returns.

As a theoretical aside, japi_synchronize() suffers from the same
problem and is unable to use this fix.  japi_sychronize() would either
need to treat errored jobs as finished or use the error code fix or
have another parameter that it can use to return information on what
happened to either the jobs as a group (e.g.
JAPI_JOB_ONE_OR_MORE_ERRORED) or to individual jobs.

Since we have an alternative to error codes, I have removed the
dependency on Issue #859.

   ------- Additional comments from andreas Mon May 24 08:31:35 -0700 2004 -------
Changed to RFE.

   ------- Additional comments from sgrell Mon Dec 12 02:55:31 -0700 2005 -------
Changed the Subcomponent.

Stephan

Change History (0)

Note: See TracTickets for help on using tickets.