Opened 17 years ago

Last modified 7 years ago

#14 new patch

IZ146: Failed migrate command leaves job running

Reported by: mjagdis Owned by:
Priority: normal Milestone:
Component: sge Version: current
Severity: minor Keywords: execution


[Imported from gridengine issuezilla]

        Issue #:      146              Platform:     All              Reporter: mjagdis (Mike Jagdis)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      current             CC:    None defined
        Status:       STARTED          Priority:     P3
      Resolution:                     Issue type:    PATCH
                                   Target milestone: not determined
      Assigned to:    ernst (ernst)
      QA Contact:     pollinger
       * Summary:     Failed migrate command leaves job running
   Status whiteboard:
                      Date/filename:                    Description:                                 Submitted by:
                      Thu Feb 14 05:08:00 -0700 2002: x Shepherd fix for failed migrate (text/plain) Mike Jagdis

     Issue 146 blocks:
   Votes for issue 146:

   Opened: Thu Feb 14 05:08:00 -0700 2002 

If a migrate command fails qmaster thinks the job is suspending but the shepherd
has just given up and left the job running. The shepherd thinks the job is in
the middle of a checkpoint and thus won't send periodic checkpoint signals and
won't take any action if the qmaster sends further migrate signals.

The attached patch clears the checkpoint flag on exit of the migration command
so that future checkpoints and migrates will be attempted and sends a SIGSTOP to
the (hopefully nonexistent!) job in the hope that if it is still around it can
at least be suspended.
I'd note that application type checkpoint migrates are prone to failure because
they rely on the job author doing The Right Thing :-(

   ------- Additional comments from Mike Jagdis Thu Feb 14 05:08:57 -0700 2002 -------
Created an attachment (id=4)
Shepherd fix for failed migrate

   ------- Additional comments from ernst Wed Mar 6 01:43:31 -0700 2002 -------
I will fix the problem.

   ------- Additional comments from ernst Wed Mar 13 06:05:35 -0700 2002 -------

concerning this issue I made following changes:

   - improved logging

   - shepherd internal state will be reset. Further migrate requests

     will be initiated

There are still some things to do (see mail from Andreas below).


Date: Fri, 15 Feb 2002 17:32:04 +0100 (MET)

From: Andreas Haas <Andreas.Haas@Sun.COM>

Subject: [GE dev] migrate command returns non-zero exit status


The patch indeed adresses a not yet adequately handled error condition

which results from a misconfiguration (see Mike's explanation below).

I'm curious however if sending a SIGSTOP to the job following a failed

migrate command is the best response in that case as it simply freezes the

job. The problem for me is that an error has occured and is also detected,

but then the job remains in the queue and no error notification is sent

(administrator mail/user abort mail).

Releasing the error condition by enforcing the jobs termination

shepherd_signal_job(-pid, SIGKILL);

and then shutting down the shepherd leaving a meaningful message in

the error file appears to be more appropriate.

There are two variations in shutting down the shepherd:

(1) The orderly manner is initiated with

   shepherd_state = SSTATE_???;  /* new indicator about misconfigured ckpt environment */

   sprintf(err_str, "migrate command exited with non-zero exit status %d", WEXITSTATUS(status));

   shepherd_error_impl(err_str, 0);

   and then the code must somehow return from wait_my_child() after the job has

   been reaped to allow for execution of pe_stop/epilog.

(2) The less orderly manner

   shepherd_state = SSTATE_???; /* new indicator about misconfigured ckpt environment */

   sprintf(err_str, "migrate command exited with non-zero exit status %d", WEXITSTATUS(status));


   simlpy exits shepherd with the consequence that pe_stop/epilog can't be executed.

I'm still riddling which preliminary decision must be taken in execd based

on the error indicator left behind in the shepherd_state variable. The question

is: What must happen to the job and the queue (queue/job error state/no error

state, job rerun/no rerun)?

   ------- Additional comments from andreas Thu Jun 16 07:23:15 -0700 2005 -------
Changed to execution.

   ------- Additional comments from roland Mon Jun 20 06:30:43 -0700 2005 -------
chkconfig needs a line like this on top of the startup scripts:
# chkconfig: 35 91 02

   ------- Additional comments from roland Mon Jun 20 06:34:28 -0700 2005 -------
sorry. the previeous comment is not related to this issue

Attachments (1)

4 (905 bytes) - added by dlove 9 years ago.

Download all attachments as: .zip

Change History (2)

Changed 9 years ago by dlove

  • Attachment 4 added

comment:1 Changed 7 years ago by dlove

  • Severity set to minor

EB-2002-03-13-0 says fixed partially

Note: See TracTickets for help on using tickets.