[GE users] Clarification required on checkpoiting
dag at sonsorol.org
Sat Mar 18 11:55:37 GMT 2006
Grid Engine is not magic. What you describe does not (and can not)
happen easily, transparently or automatically.
It takes a lot of work on behalf of the cluster and SGE administrator
to get checkpointing working. In most cases the techniques differ on
an application-by application basis.
Without using checkpointing features you can submit a job to Grid
Engine with a request that it be "rerunnable". This attribute means
that if the problem you describe happens, the job will automatically
be re-queued and dispatched onto a different execution host. Of
course, the job will start over from the beginning of its run.
... except if you have checkpointing enabled. In this case, your job
can be restarted from the point at which the last checkpoint
There are 2 main types of checkpointing:
- operating system level
- user level
Operating systems with native application checkpointing capabilities
are rare. The SGE docs refer over and over again to SGI IRIX as a
good example of an OS that can do this.
If you don't have an OS that natively supports checkpointing (most do
not) then you are responsible for setting up the conditions for
checkpointing and recovery. This tends to be application specific.
SGE can initiate checkpoints several different ways -- on a periodic
(time) basis, whenever a job gets suspended or whenever the sge_execd
on a compute nodes dies or gets shut down.
The checkpoint is initiated via launching a custom script or by
sending a Unix signal to the running application. It is your
responsibility (or the vendor of the running code) to write the
checkpoint script or make the binary aware of custom checkpoint/
restart unix signals.
More information is in the SGE man pages:
On Mar 18, 2006, at 5:31 AM, Srikanth wrote:
> If we are running the job on 10 systems and if one of the systems
> went-off due to some hardware failure, what will happen to that
> job? And how
> can we migrate that job to another node online without the
> interruption to
> the particular Job.
> Please clarify my query and provide the solution to above problem.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users