Opened 13 years ago

Last modified 8 years ago

#363 new defect

IZ2068: checkpointing jobs will be rescheduled forever

Reported by: joga Owned by:
Priority: normal Milestone:
Component: sge Version: current
Severity: Keywords: execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2068]

        Issue #:      2068             Platform:     All       Reporter: joga (joga)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      current      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     checkpointing jobs will be rescheduled forever
   Status whiteboard:
      Attachments:

     Issue 2068 blocks:
   Votes for issue 2068:


   Opened: Thu Jun 1 01:57:00 -0700 2006 
------------------------


In a maintrunk system (V60s2_BRANCH is not affected):

If I submit a job and specify a checkpointing environment, the job will run
through, but will be rescheduled.
It will run through for a second time, exit, be rescheduled, ...

Example:
qsub -ckpt testcheckpointobject $SGE_ROOT/examples/jobs/sleeper.sh

Qstat will show the job as running, after it finished, it will be in status Rq, ...

% qconf -sckpt testcheckpointobject
ckpt_name          testcheckpointobject
interface          userdefined
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /tmp
signal             none
when               xs

Tail of the execd messages file:
05/31/2006 15:24:05|execd|oin|I|using "0" for auto_user_oticket
05/31/2006 15:24:05|execd|oin|I|using "0" for auto_user_fshare
05/31/2006 15:24:05|execd|oin|I|using "none" for auto_user_default_project
05/31/2006 15:24:05|execd|oin|I|using "86400" for auto_user_delete_time
05/31/2006 15:24:05|execd|oin|I|using "false" for delegated_file_staging
06/01/2006 10:04:32|execd|oin|I|PTF_MAX_PRIORITY=0, PTF_MIN_PRIORITY=20
06/01/2006 10:05:35|execd|oin|E|shepherd of job 1981.1 exited with exit status = 11
06/01/2006 10:07:52|execd|oin|E|shepherd of job 1981.1 exited with exit status = 11
06/01/2006 10:30:55|execd|oin|E|shepherd of job 1982.1 exited with exit status = 11
06/01/2006 10:39:32|execd|oin|E|shepherd of job 1982.1 exited with exit status = 11

Tail of the qmaster messages file:
06/01/2006 10:35:13|qmaster|oin|W|job 1982.1 failed on host nori migrating
because: 06/01/2006 10:35:13 [115090:1911]: cant close file checkpointed: No
such file or directory
06/01/2006 10:35:13|qmaster|oin|W|rescheduling job 1982.1
06/01/2006 10:37:22|qmaster|oin|W|job 1982.1 failed on host gimli migrating
because: 06/01/2006 10:37:21 [115090:16839]: cant close file checkpointed: No
such file or directory
06/01/2006 10:37:22|qmaster|oin|W|rescheduling job 1982.1
06/01/2006 10:39:32|qmaster|oin|W|job 1982.1 failed on host oin migrating
because: 06/01/2006 10:39:32 [115090:1108]: cant close file checkpointed: No
such file or directory
06/01/2006 10:39:32|qmaster|oin|W|rescheduling job 1982.1

Change History (0)

Note: See TracTickets for help on using tickets.