[GE users] integrate BLCR with SGE62u5
ashley.macon at colorado.edu
Mon Nov 8 23:45:44 GMT 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
I setup BLCR with SGE 6.2u5 successfully. Note, what I describe is for non-MPI jobs. You can get BLCR to work with OpenMPI but I have not yet done that.
I used these resources as guides:
For our implementation, I first created two scripts which are quite similar to those found in Appendix A of the "N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart" document:
Then I setup my checkpoint configuration:
$ qconf -sckpt blcr
ckpt_command /usr/local/bin/blcr_checkpoint $job_id $job_pid $ckpt_dir
migr_command /usr/local/bin/blcr_migrate $job_id $job_pid $ckpt_dir
Finally, I created a general submission script named "checkpoint", which again is similar to the submission script in the PDF. The key logic in this script is the:
if [ $RESTARTED -ne "0" -a -e "$ckpt_tmpdir" ]; then
Users can submit a checkpoint job with something like:
$ qsub -ckpt blcr -b y /usr/local/bin/checkpoint <prog> <prog_args>
Lastly, you may need to adjust the min_cpu_interval in your queue(s) configurations, as this determines how often checkpoints occur normally. For testing, I set it low (every 1 minute) then changed it to what our users needed (every 4 hours).
In my experience, if jobs simply suspended (S state) but did not re-queue (Rq), then it was because of a failure to either use the "checkpoint" script as the job submission script or for failing to request the ckpt environment.
> Hi guys,I read integrate sge with BLCR, but i couldn't use it.How ca i use 4 script in the document? especially submission script on page 7.I copied 4 scripts and then configure a checkpoint blcr, but i don't understand how can i use submission script!without using "submission scipt", I wrote a script and qsub that with options " -ckpt blcr and -c smx" , then execute a compiled hell.c . when suspend that sjob; execution program continue running (by top command).in that case submited script came in suspend mode.
> why?what should i do to use checkpointing and BLCR with SGE.
> I found BLCR roll at http://ircs.seas.harvard.edu/display/CLUSTERS/BLCR+Rollbut i didn't find any link for download it.anyone could download BLCR roll?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users