[GE users] writing user program defined "checkpoint files" when soft wall clock is reached (fwd)

reuti reuti at staff.uni-marburg.de
Mon Sep 7 11:52:13 BST 2009


Am 07.09.2009 um 11:27 schrieb l_heck:

> I have been working through the examples which reuti posted on web  
> some years
> ago to understand checkpointing. They are very clear. Thank you,  
> reuti, for
> posting them.
> I would like to implement the following:
> On our cluster big parallel applications run which write their own  
> "restart" or
> checkpointing files at user defined intervals. So they are  
> reasonably save.
> However the demand has arrived for "finite" queues and I would like  
> to "stop"
> these jobs in a clean fashion. There is a mechanism for doing so.  
> These programs
> look for a "stop" file and if that file has been written into a  
> predefined
> location they will write a set of restart files and exit cleanly.
> I would like to trigger this writing of the "stop" file on a  
> arriving using
> soft wall clock time. When that wall clock time has been reached  
> that should
> sent the  signal to a trap which then creates the stop file. Once  
> the program
> reaches the  step when it again checks for the presence of that  
> file, the
> program can write its restart files and then exit.

the job will get a sigusr1, when s_rt is reached. Is a:

trap "touch my_stop_file" usr1
(trap '' sigusr1; exec my_binary) # Two single quotes here after the  
trap command

in the job script already doing what you need? The idea is, that the  
jobscript will touch the necessary stop-file, but the binary of your  
application shouldn't get the signal, therefore a sub-shell is created.)

Another option would be to suspend the job with the trap command  
above (i.e. the job suspends itself), but define the checkpointing  
environment to migrate on suspend. This way all can be done in the  
migr_command in case of application-level checkpointing.

-- Reuti

> Has anybody out there thought about it and would it be possible to  
> catch this
> signal in a type of checkpointing mechanism?
> This could be used for our large parallel applications and would be  
> very
> benificial for implementing finite queues.
> Lydia
> ------------------------------------------
> Dr E L  Heck
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
> United Kingdom
> e-mail: lydia.heck at durham.ac.uk
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216216
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list