[GE users] writing user program defined "checkpoint files" when soft wall clock is reached (fwd)

l_heck lydia.heck at durham.ac.uk
Mon Sep 7 14:25:04 BST 2009


Hi,

I have not tried to implement the trap for sigusr1. I will run some tests
and will post the outcome.

Thank you.

Lydia

On Mon, 7 Sep 2009, reuti wrote:

> Hi,
>
> Am 07.09.2009 um 11:27 schrieb l_heck:
>
> > I have been working through the examples which reuti posted on web
> > some years
> > ago to understand checkpointing. They are very clear. Thank you,
> > reuti, for
> > posting them.
> >
> > I would like to implement the following:
> >
> > On our cluster big parallel applications run which write their own
> > "restart" or
> > checkpointing files at user defined intervals. So they are
> > reasonably save.
> >
> > However the demand has arrived for "finite" queues and I would like
> > to "stop"
> > these jobs in a clean fashion. There is a mechanism for doing so.
> > These programs
> > look for a "stop" file and if that file has been written into a
> > predefined
> > location they will write a set of restart files and exit cleanly.
> >
> > I would like to trigger this writing of the "stop" file on a
> > arriving using
> > soft wall clock time. When that wall clock time has been reached
> > that should
> > sent the  signal to a trap which then creates the stop file. Once
> > the program
> > reaches the  step when it again checks for the presence of that
> > file, the
> > program can write its restart files and then exit.
>
> the job will get a sigusr1, when s_rt is reached. Is a:
>
> trap "touch my_stop_file" usr1
> (trap '' sigusr1; exec my_binary) # Two single quotes here after the
> trap command
>
> in the job script already doing what you need? The idea is, that the
> jobscript will touch the necessary stop-file, but the binary of your
> application shouldn't get the signal, therefore a sub-shell is created.)
>
> Another option would be to suspend the job with the trap command
> above (i.e. the job suspends itself), but define the checkpointing
> environment to migrate on suspend. This way all can be done in the
> migr_command in case of application-level checkpointing.
>
> -- Reuti
>
>
> > Has anybody out there thought about it and would it be possible to
> > catch this
> > signal in a type of checkpointing mechanism?
> >
> > This could be used for our large parallel applications and would be
> > very
> > benificial for implementing finite queues.
> >
> > Lydia
> >
> >
> > ------------------------------------------
> > Dr E L  Heck
> >
> > University of Durham
> > Institute for Computational Cosmology
> > Ogden Centre
> > Department of Physics
> > South Road
> >
> > DURHAM, DH1 3LE
> > United Kingdom
> >
> > e-mail: lydia.heck at durham.ac.uk
> >
> > Tel.: + 44 191 - 334 3628
> > Fax.: + 44 191 - 334 3645
> > ___________________________________________
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?
> > dsForumId=38&dsMessageId=216216
> >
> > To unsubscribe from this discussion, e-mail: [users-
> > unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216227
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.heck at durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216255

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list