[SGE-discuss] Condor checkpointing in SGE 6.2

Reuti reuti at staff.uni-marburg.de
Mon Jan 9 15:21:50 GMT 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Martin,

Am 09.01.2012 um 16:09 schrieb Martin Koehler:

> I setup checkpointing using the SGE 6.2. and condor 7.6.5. on our
> cluster. I used http://arc.liv.ac.uk/SGE/howto/checkpointing.html to
> configure it. But now I've got the problem that the grid engine does
> not send the usr2 signal to the process. I created a small counting
> program compiled it using condor_compile and run it using the
> described script, with the additional -_condor_D_ALL flag to see what
> happens:
> 
> User Job - $CondorPlatform: X86_64-CentOS_5.5 $
> User Job - $CondorVersion: 7.6.5 Jan 09 2012 BuildID: UW_development $
> Condor: Notice: Will checkpoint to /home/checkpoints/8029/checkpoint
> Condor: Notice: Remote system calls disabled.
> computing position: 0
> computing position: 1
> computing position: 2
> computing position: 3
> computing position: 4
> computing position: 5
> .......
> If I use qmod -s 8029 to reschedule the job it starts again on another
> node without creating a checkpoint or giving any information what
> happens.

in some way it's the intended behavior, as the documentation is wrong:

https://arc.liv.ac.uk/trac/SGE/ticket/346

You have to wait a cycle of "min_cpu_interval" in the queue definition for a checkpoint file to be created. At time of suspension there is no signal sent.

The state diagrams in http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf are quite nice to exlain it.

- -- Reuti


> If I send the signal manually, the checkpoint is created and I get
> 
> computing position: 657
> Got SIGUSR2
> Saved signal state.
> About to save file state
> CondorFileTable::checkpoint
> .........
> Done restoring file state
> About to restore signal state
> About to return to user code
> computing position: 658
> 
> And the checkpoint is created. So why does the Grid Engine do not send
> the signal to the process?
> 
> The checkpointing is configured as followed:
> qconf -sckpt condor
> ckpt_name          condor
> interface          TRANSPARENT
> ckpt_command       NONE
> migr_command       NONE
> restart_command    NONE
> clean_command      NONE
> ckpt_dir           /home/checkpoints
> signal             usr2
> when               xsmr
> 
> regards
> Martin
> 
> PS: It worked one time but I do not see any difference.
> 
> 
> 
> 
> - -- 
> Dipl.-Math. Martin Köhler
> Max Planck Institute for
> Dynamics of Complex Technical Systems
> Sandtorstr. 1
> 39106 Magdeburg
> Germany
> 
> 
> phone: +49 (0)391 6110 445
> email: koehlerm at mpi-magdeburg.mpg.de
> www: http://www.mpi-magdeburg.mpg.de/mpcsc/koehlerm/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk8LAycACgkQXeVvfIKK/EgBvgCfWqtjhZeKYbn8frepdyh7EeWn
> 84gAnjHEdEIL/KfjqQZyrla9HIwoyTki
> =weRz
> -----END PGP SIGNATURE-----
> _______________________________________________
> SGE-discuss mailing list
> SGE-discuss at liv.ac.uk
> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.16 (Darwin)

iEYEARECAAYFAk8LBhkACgkQo/GbGkBRnRoRYQCgiEpNlqf0xN9ufq3RD1KmQ7oL
t14AnRISdi4GBRL5zQrqZZi6Kt9dUrbX
=FhtK
-----END PGP SIGNATURE-----


More information about the SGE-discuss mailing list