[SGE-discuss] Condor checkpointing in SGE 6.2

Martin Koehler koehlerm at mpi-magdeburg.mpg.de
Mon Jan 9 15:09:31 GMT 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
I setup checkpointing using the SGE 6.2. and condor 7.6.5. on our
cluster. I used http://arc.liv.ac.uk/SGE/howto/checkpointing.html to
configure it. But now I've got the problem that the grid engine does
not send the usr2 signal to the process. I created a small counting
program compiled it using condor_compile and run it using the
described script, with the additional -_condor_D_ALL flag to see what
happens:

 User Job - $CondorPlatform: X86_64-CentOS_5.5 $
 User Job - $CondorVersion: 7.6.5 Jan 09 2012 BuildID: UW_development $
 Condor: Notice: Will checkpoint to /home/checkpoints/8029/checkpoint
 Condor: Notice: Remote system calls disabled.
 computing position: 0
 computing position: 1
 computing position: 2
 computing position: 3
 computing position: 4
 computing position: 5
 .......
If I use qmod -s 8029 to reschedule the job it starts again on another
node without creating a checkpoint or giving any information what
happens.

If I send the signal manually, the checkpoint is created and I get

 computing position: 657
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint
.........
Done restoring file state
About to restore signal state
About to return to user code
computing position: 658

And the checkpoint is created. So why does the Grid Engine do not send
the signal to the process?

The checkpointing is configured as followed:
qconf -sckpt condor
ckpt_name          condor
interface          TRANSPARENT
ckpt_command       NONE
migr_command       NONE
restart_command    NONE
clean_command      NONE
ckpt_dir           /home/checkpoints
signal             usr2
when               xsmr

regards
Martin

PS: It worked one time but I do not see any difference.




- -- 
Dipl.-Math. Martin Köhler
Max Planck Institute for
Dynamics of Complex Technical Systems
Sandtorstr. 1
39106 Magdeburg
Germany


phone: +49 (0)391 6110 445
email: koehlerm at mpi-magdeburg.mpg.de
www: http://www.mpi-magdeburg.mpg.de/mpcsc/koehlerm/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8LAycACgkQXeVvfIKK/EgBvgCfWqtjhZeKYbn8frepdyh7EeWn
84gAnjHEdEIL/KfjqQZyrla9HIwoyTki
=weRz
-----END PGP SIGNATURE-----


More information about the SGE-discuss mailing list