[GE users] checkpointing environment trouble

Jason Crane Jason.Crane at mrsc.ucsf.edu
Wed Jul 21 22:56:00 BST 2004


Hi,

I'm trying to checkpoint an application with the condor libs as described
on the following page (under SGEEE 5.3p5):

http://gridengine.sunsource.net/howto/condorckpt.html

If I configure my checkpointing environment with the "userdefined"
interface, I don't see any evidence that the Checkpoint, Migration,
Restart, or Clean commands are being executed when I suspend the running
job queues.  Nor do I observe the "Checkpoint Signal=USR2 (or TSTP)" being
sent to the job to create a checkpoint.  The only way I have been able to
get any of the commands in my checkpoint environment to run on job
suspension is if I switch my interface from userdefined to application
level, and then only the "migration command" appears to execute, but
the checkpoint signal still isn't issued to the job.

#!/bin/csh
#$ -ckpt rrc_condor
#$ -c x
#$ -l arch=solaris64
##$ -notify
#$ -cwd
condor_test


ckpt_name          rrc_condor
interface          APPLICATION-LEVEL
ckpt_command       /netopt/sge/ckpt/ct.sh
migr_command       /netopt/sge/ckpt/mt.sh
restart_command    /netopt/sge/ckpt/rt.sh
clean_command      /netopt/sge/ckpt/clt.sh
ckpt_dir           /data/testing/grid/ckpt
queue_list         a.q b.q c.q etc....
signal             TSTP
when               xsr


As an aside, does anyone have any experience/recomendations regarding
checkpointing of MPI jobs?

Thanks,
Jason









---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list