[GE users] Checkpoint and resubmit

Taylor, David (D.) dtaylo56 at ford.com
Thu Jun 17 10:42:40 BST 2004


Hello
 
I am trying to implement Checkpointing and resubmit on suspension of the
queue with one of the applications we use.
 
I have worked out the steps that need to be taken to complete this 
 
1) Need to modify one of the files of the running job to halt the job
and write out restart file
2) Modify main run file to use restart file
3) Get restart file and modified run file to new runhost 
 
I have written a simple migration script to do these tasks but the
migration fails with the following error.
 
The error occurs before it even seems to execute the migrate script.
 
Job 4474 (BL_Test.job) Migrates Exit Status=137 Signal=KILL

failed migrating because:

job 4474.1 died through signal KILL (9)

I also have a couple of questions 

Should these tasks all be the migration script or should some of them be
done  in the job script 

When a job is restarted does the resubmitted job get the job files from
the original location or are the files carried over from the previous
job 

Regards,
David Taylor 





More information about the gridengine-users mailing list