[GE users] Clarification required on checkpoiting

Chris Dagdigian dag at sonsorol.org
Sat Mar 18 11:55:37 GMT 2006


Hello,

Grid Engine is not magic.  What you describe does not (and can not)  
happen easily, transparently or automatically.

It takes a lot of work on behalf of the cluster and SGE administrator  
to get checkpointing working. In most cases the techniques differ on  
an application-by application basis.

Without using checkpointing features you can submit a job to Grid  
Engine with a request that it be "rerunnable". This attribute means  
that if the problem you describe happens, the job will automatically  
be re-queued and dispatched onto a different execution host. Of  
course, the job will start over from the beginning of its run.

... except if you have checkpointing enabled. In this case, your job  
can be restarted from the point at which the last checkpoint  
operation occurred.

There are 2 main types of checkpointing:

  - operating system level
  - user level

Operating systems with native application checkpointing capabilities  
are rare. The SGE docs refer over and over again to SGI IRIX as a  
good example of an OS that can do this.

If you don't have an OS that natively supports checkpointing (most do  
not) then you are responsible for setting up the conditions for  
checkpointing and recovery. This tends to be application specific.   
SGE can initiate checkpoints several different ways -- on a periodic  
(time) basis, whenever a job gets suspended or whenever the sge_execd  
on a compute nodes dies or gets shut down.

The checkpoint is initiated via launching a custom script or by  
sending a Unix signal to the running application. It is your  
responsibility (or the vendor of the running code) to write the  
checkpoint script or make the binary aware of custom checkpoint/ 
restart unix signals.

More information is in the SGE man pages:

http://gridengine.sunsource.net/nonav/source/browse/~checkout~/ 
gridengine/doc/htmlman//htmlman1/sge_ckpt.html

http://gridengine.sunsource.net/nonav/source/browse/~checkout~/ 
gridengine/doc/htmlman//htmlman5/checkpoint.html

Regards,
Chris






On Mar 18, 2006, at 5:31 AM, Srikanth wrote:

> Hi,
>
>
> If we are running the job on 10 systems and if one of the systems  
> suddenly
> went-off due to some hardware failure, what will happen to that  
> job? And how
> can we migrate that job to another node online without the  
> interruption to
> the particular Job.
>
> Please clarify my query and provide the solution to above problem.
>
> Regards,
> M.Srikanth

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list