[GE users] Checkpoint and resubmit
Taylor, David (D.)
dtaylo56 at ford.com
Thu Jun 17 10:42:40 BST 2004
I am trying to implement Checkpointing and resubmit on suspension of the
queue with one of the applications we use.
I have worked out the steps that need to be taken to complete this
1) Need to modify one of the files of the running job to halt the job
and write out restart file
2) Modify main run file to use restart file
3) Get restart file and modified run file to new runhost
I have written a simple migration script to do these tasks but the
migration fails with the following error.
The error occurs before it even seems to execute the migrate script.
Job 4474 (BL_Test.job) Migrates Exit Status=137 Signal=KILL
failed migrating because:
job 4474.1 died through signal KILL (9)
I also have a couple of questions
Should these tasks all be the migration script or should some of them be
done in the job script
When a job is restarted does the resubmitted job get the job files from
the original location or are the files carried over from the previous
More information about the gridengine-users