Opened 18 years ago

Last modified 8 years ago

#4 new enhancement

IZ61: Enhancements for ckpt/reschedule facility

Reported by: ernst Owned by:
Priority: normal Milestone:
Component: sge Version: current
Severity: Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=61]

        Issue #:      61               Platform:     All           Reporter: ernst (ernst)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      current          CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     Enhancements for ckpt/reschedule facility
   Status whiteboard:
      Attachments:

     Issue 61 blocks:
   Votes for issue 61:


   Opened: Mon Sep 17 05:01:00 -0700 2001 
------------------------


According to the discussion with Martin Klook and Ron Chen
(users@gridengine.sunsource.net; subject: reschedule facility)
following information enhancements would be helpfull in case of
checkpointing and (automatic) rescheduling:

- Number of rescheduling/checkpointing events
- Host where the job was fist/previously executed
- In case of rescheduling: for ckpt-jobs the restart_command should be
  executed if ckpt_command was executed previously.

The RESTARTED environment variable which is set in the job environment
could provide the number of events.

FIRST_HOST and LAST_HOST may be set accordingly. We have to make sure
that the job was able to execute the ckpt_command command successfully
before we mention a hostname through one of these varibles.

The restart_command can only be executed in case of rescheduling, when
the master knows that ckpt_command was successfully executed. We
have to transfer this information (shepherd -> execd -> qmaster) during
the runtime of the job.

   ------- Additional comments from ernst Thu Jul 11 08:10:13 -0700 2002 -------
The facility defined in Issue #315 might be used to workaround the missing
functionality. "qsub/qalter -ac ENV:variable=value" might be used
in the various checkpointing scripts do defines such variables as
FIRST_HOST, LAST_HOST or RESTARTED.

   ------- Additional comments from sgrell Mon Dec 12 03:14:46 -0700 2005 -------
Changed subcomponent.

Stephan

Change History (0)

Note: See TracTickets for help on using tickets.