Opened 17 years ago
Last modified 10 years ago
#167 new enhancement
IZ1014: QSUB: qsub -sync y -S blah exit.sh never returns
Reported by: | templedf | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.0beta2 |
Severity: | Keywords: | Sun SunOS clients | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1014]
Issue #: 1014 Platform: Sun Reporter: templedf (templedf) Component: gridengine OS: SunOS Subcomponent: clients Version: 6.0beta2 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: ENHANCEMENT Target milestone: --- Assigned to: andreas (andreas) QA Contact: roland URL: * Summary: QSUB: qsub -sync y -S blah exit.sh never returns Status whiteboard: Attachments: Issue 1014 blocks: Votes for issue 1014: Opened: Mon May 3 02:34:00 -0700 2004 ------------------------ When a job submitted with -sync y goes into error state, qsub will hang until interrupted. The reason is that their is no JOB_FINISH event sent for a job that goes into the error state. There are two possible solutions: 1) Send a JOB_FINISH event when a job goes into the error state. Does this make sense? Is an errored job technically "finished?" 2) Have JAPI react to the JERROR state of JATASK_MOD events. The problem is then how to convey through JAPI to qsub or DRMAA that the job has failed. japi_wait() and japi_synchrnize() may already provide appropiate error codes for this case. Which do we like? ------- Additional comments from sgrell Mon May 3 05:37:12 -0700 2004 ------- The second solution. The job is not finish, it is in error state. If the user wants to kill the jobs, because of the error state it is his decision and he can use qdel. Or he can fix the problem and restart the job without submitting it again. Stephan ------- Additional comments from templedf Thu May 13 08:20:26 -0700 2004 ------- I agreed that the second option is the correct one. However, japi_wait() is currently limited to the following error codes: DRMAA_ERRNO_SUCCESS Job finished. DRMAA_ERRNO_EXIT_TIMEOUT No job end within specified time. DRMAA_ERRNO_INVALID_JOB The job id specified was invalid or DRMAA_JOB_IDS_SESSION_ANY has been specified and all jobs of this session have already finished. DRMAA_ERRNO_NO_ACTIVE_SESSION No active session. DRMAA_ERRNO_DRM_COMMUNICATION_FAILURE DRMAA_ERRNO_AUTH_FAILURE DRMAA_ERRNO_NO_RUSAGE None of these convey that the job has entered the error state. This is another example of where Issue #859 would come in handy. ------- Additional comments from andreas Fri May 14 06:05:43 -0700 2004 ------- It must be possible to handle this as a new job state transition that can be monitored through japi_wait(). ------- Additional comments from templedf Mon May 17 23:03:52 -0700 2004 ------- Implementing this as a monitorable state change is also viable. It has the advantage that it requires much less effort to implement. The idea would be to allow the "event" parameter to japi_wait to include JAPI_JOB_ERROR as a valid result. qsub qould then need to handle this additional state when japi_wait() returns. As a theoretical aside, japi_synchronize() suffers from the same problem and is unable to use this fix. japi_sychronize() would either need to treat errored jobs as finished or use the error code fix or have another parameter that it can use to return information on what happened to either the jobs as a group (e.g. JAPI_JOB_ONE_OR_MORE_ERRORED) or to individual jobs. Since we have an alternative to error codes, I have removed the dependency on Issue #859. ------- Additional comments from andreas Mon May 24 08:31:35 -0700 2004 ------- Changed to RFE. ------- Additional comments from sgrell Mon Dec 12 02:55:31 -0700 2005 ------- Changed the Subcomponent. Stephan
Note: See
TracTickets for help on using
tickets.