Opened 14 years ago

Last modified 9 years ago

#273 new defect

IZ1790: shepherd does not wait for terminate_method to complete

Reported by: charpold Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u3
Severity: Keywords: Sun execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1790]

        Issue #:      1790             Platform:     Sun      Reporter: charpold (charpold)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.0u3       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     shepherd does not wait for terminate_method to complete
   Status whiteboard:
      Attachments:

     Issue 1790 blocks:
   Votes for issue 1790:  10


   Opened: Mon Sep 12 09:14:00 -0700 2005 
------------------------


Attempted to implement a user-specified signal on job termination:

We took the following script:

        iad2mgt02% more user_sig.sh
        #!/bin/sh
        PATH=/bin:/usr/bin:/sbin

        if [ "$SG_NOTIFY_SIGNAL" != "" ] ; then
          kill -$SG_NOTIFY_SIGNAL -$1
          if [ "$SG_NOTIFY_SLEEP_TIME" != "" ] ; then
             if [ $SG_NOTIFY_SLEEP_TIME -gt 0 ] ; then
                sleep $SG_NOTIFY_SLEEP_TIME
                kill -9 -$1
             else
                sleep 10
             kill -9 -$1
             fi
          else
             sleep 10
             kill -9 -$1
          fi
        else
          kill -9 -$1
        fi

Then, we did a qconf -mq all.q and set
        terminate_method=/home/sgeadmin/n1ge60/user_sig.sh $job_pid

Unfortunately, when we qdel a job, the job immediately gets reported by
Grid Engine as completed, and our clean-up process could start at any time,
with the result that the job output might be packed into a tarball and sent
to the user well before $SG_NOTIFY_SLEEP_TIME has passed, and therefore
before any signal-triggered activity has completed.

Is there a way to get Grid Engine to leave the process in "dr" state until
the terminate_method script has completed? This should be handled in the
same way as migration or checkpoint methods.

   ------- Additional comments from charpold Mon Sep 12 09:16:51 -0700 2005 -------
Fixed misspelling.

Change History (0)

Note: See TracTickets for help on using tickets.