[GE users] writing user program defined "checkpoint files" when soft wall clock is reached (fwd)

l_heck lydia.heck at durham.ac.uk
Mon Sep 7 10:27:02 BST 2009

This is the third attempt to send this query to this list.

I have been working through the examples which reuti posted on web some years
ago to understand checkpointing. They are very clear. Thank you, reuti, for
posting them.

I would like to implement the following:

On our cluster big parallel applications run which write their own "restart" or
checkpointing files at user defined intervals. So they are reasonably save.

However the demand has arrived for "finite" queues and I would like to "stop"
these jobs in a clean fashion. There is a mechanism for doing so. These programs
look for a "stop" file and if that file has been written into a predefined
location they will write a set of restart files and exit cleanly.

I would like to trigger this writing of the "stop" file on a arriving using
soft wall clock time. When that wall clock time has been reached that should
sent the  signal to a trap which then creates the stop file. Once the program
reaches the  step when it again checks for the presence of that file, the
program can write its restart files and then exit.

Has anybody out there thought about it and would it be possible to catch this
signal in a type of checkpointing mechanism?

This could be used for our large parallel applications and would be very
benificial for implementing finite queues.


Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

United Kingdom

e-mail: lydia.heck at durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list