Opened 15 years ago

Last modified 9 years ago

#199 new enhancement

IZ1259: Enhancements of the checkpointing interface

Reported by: reuti Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0
Severity: Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1259]

        Issue #:      1259             Platform:     Other         Reporter: reuti (reuti)
       Component:     gridengine          OS:        other
     Subcomponent:    qmaster          Version:      6.0              CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     Enhancements of the checkpointing interface
   Status whiteboard:
      Attachments:

     Issue 1259 blocks:
   Votes for issue 1259:


   Opened: Mon Aug 30 12:49:00 -0700 2004 
------------------------


Having e.g. two types of nodes in the cluster, and a
cluster queue for each type of the nodes. If you discover
inside a job script during execution, that you need
new/more resources from the other type of nodes in the
cluster, then would a command like this be useful:

qckpt -m -l "new_resource_request" -q newqueue -n
$JOB_ID

(-m like migrate, -n like now: don't do it async, return
only when done). This way you can use the
checkpointing setup for the migration of the job.

This command could be used to trigger the creation of a
checkpoint state also from the command line for a
running job (without suspending and migration at all).

qckpt -c <jobid>

(make a checkpoint of <jobid>).

========================================

Side feature:

Having a migrate-time limit like h_rt or h_cpu may
migrate the job to the pending state, and it has to wait
again.

This way you can limit the cpu time to e.g. 24 hours, and
when from time to time a user needs more time, the job
is checkpointed an requeued for the next 24h time slot -
but has to wait again.

   ------- Additional comments from reuti Tue Aug 31 11:07:29 -0700 2004 -------
Although I can solve the issue for now with:

qalter -v STEP=COPY -q new_queue_request $JOB_ID
qmod -s $JOB_ID

and don't need any checkpoint command, the exec host must also be a
submit host. As with the discussion about the qdel inside a PE prolog,
maybe there should be a small subset of commands also working on an
exec host without the submit feature and limited to the own job. Then
you could also disregard any jobid, because it's already known. For
qdel the name could be qabort.

qmigr -v STEP=COPY -q new_queue_request

could replace the above given sequence. The qckpt <jobid> (as command
line tool) would still be interesting to trigger the creation of a
checkpoint file from the command line (instead of a fixed time interval).

   ------- Additional comments from reuti Sun Sep 5 11:41:53 -0700 2004 -------
The "when" entry in the definition of a checkpointing interface should be
separated to q(ueue) and j(ob) instead of x. This way you can ignore the short
suspension of a queue, but still have the ability to migrate the job by hand with
`qmod -sj`.

   ------- Additional comments from sgrell Mon Dec 12 02:44:09 -0700 2005 -------
Changed subcomponent.

Stephan

Change History (0)

Note: See TracTickets for help on using tickets.