[GE users] integrate BLCR with SGE62u5

macona ashley.macon at colorado.edu
Mon Nov 8 23:45:44 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I setup BLCR with SGE 6.2u5 successfully. Note, what I describe is for non-MPI jobs. You can get BLCR to work with OpenMPI but I have not yet done that.

I used these resources as guides:
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf

http://gridengine.sunsource.net/howto/checkpointing.html

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267527

For our implementation, I first created two scripts which are quite similar to those found in Appendix A of the "N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart" document: 
blcr_checkpoint
blcr_migrate

Then I setup my checkpoint configuration:
$ qconf -sckpt blcr

ckpt_name          blcr
interface          application-level
ckpt_command       /usr/local/bin/blcr_checkpoint $job_id $job_pid $ckpt_dir
migr_command       /usr/local/bin/blcr_migrate $job_id $job_pid $ckpt_dir
restart_command    none
clean_command      none
ckpt_dir           /data/tmp/checkpoint
signal             none
when               xsmr


Finally, I created a general submission script named "checkpoint", which again is similar to the submission script in the PDF. The key logic in this script is the:

# BASH
if [ $RESTARTED -ne "0" -a -e "$ckpt_tmpdir" ]; then
     /usr/local/bin/cr_restart $ckptfile
else
     /usr/local/bin/cr_run $* 
fi


Users can submit a checkpoint job with something like:

$ qsub -ckpt blcr -b y /usr/local/bin/checkpoint <prog> <prog_args>

Lastly, you may need to adjust the min_cpu_interval in your queue(s) configurations, as this determines how often checkpoints occur normally. For testing, I set it low (every 1 minute) then changed it to what our users needed (every 4 hours).

In my experience, if jobs simply suspended (S state) but did not re-queue (Rq), then it was because of a failure to either use the "checkpoint" script as the job submission script or for failing to request the ckpt environment.

-- 
macona


> Hi guys,I read integrate sge with BLCR, but i couldn't use it.How ca i use 4 script in the document? especially submission script on page 7.I copied 4 scripts and then configure a checkpoint blcr, but i don't understand how can i use submission script!without using "submission scipt", I wrote a script and qsub that with options " -ckpt blcr and -c smx" , then execute a compiled hell.c . when suspend that sjob; execution program continue running (by top command).in that case submited script came in suspend mode.
> why?what should i do to use checkpointing and BLCR with SGE.
> I found BLCR roll at  http://ircs.seas.harvard.edu/display/CLUSTERS/BLCR+Rollbut i didn't find any link for download it.anyone could download BLCR roll?
> thanks

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294094

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list