Opened 13 years ago

Last modified 11 years ago

#568 new defect

IZ2714: PVM start/stop scripts failure should not put queue in Error state

Reported by: guenter_herbert Owned by:
Priority: normal Milestone:
Component: sge Version: current
Severity: Keywords: execution


[Imported from gridengine issuezilla]

        Issue #:      2714             Platform:     All       Reporter: guenter_herbert (guenter_herbert)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      current      CC:
                                                                         [_] reuti
                                                                         [_] Remove selected CCs
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    guenter_herbert (guenter_herbert)
      QA Contact:     pollinger
       * Summary:     PVM start/stop scripts failure should not put queue in Error state
   Status whiteboard:

     Issue 2714 blocks:
   Votes for issue 2714:

   Opened: Tue Sep 2 01:51:00 -0700 2008 

The template PVM start/stop scripts fail to follow the generic PE start/stop
script API by  returning with exit code 1 in case of an error. Rather than
that, they should follow the protocol and return with 100. Which means: just
exit job, mark job as erred and (mist important!) do not reschedule!

   ------- Additional comments from reuti Tue Sep 2 07:28:48 -0700 2008 -------
The PVM (like MPI)  templates or also the scripts from my Howto uses 1 as return code in case of an
error. As the start/stop scripts are prepared by the admins, and not the users, a return code of 1 means
something serious, e.g. a full filesystem in /tmp on the node, hence the return code will put the queue
on this node into error state and try it in another node. This is fine for the users.

If instead the jobs are put into error state, all jobs trying to run on that node will be on error, while
from a submitted bunch of jobs others would succeed as they ran on other nodes just by accident.

In `man sge_pe`I can't find the PE API stating that it must exit with 100, which is reserved for
application error. If there should be a common error code, then it could even be extended:

110 start_proc error
112 stop_proc_error
114 global prolog error
116 global epilog error
118 queue prolog error
120 queue epilog error

Defaults to reschedule the job (put node on errror), add 1 to each to disallow rescheduling of the job
(put job on error instead).

Change History (0)

Note: See TracTickets for help on using tickets.