[GE users] Clarification required on checkpoiting

Reuti reuti at staff.uni-marburg.de
Sat Mar 18 13:14:46 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Quoting Chris Dagdigian <dag at sonsorol.org>:

>
> Hello,
>
> Grid Engine is not magic.  What you describe does not (and can not)  
> happen easily, transparently or automatically.
>
> It takes a lot of work on behalf of the cluster and SGE administrator 
>  to get checkpointing working. In most cases the techniques differ on 
>  an application-by application basis.
>
> Without using checkpointing features you can submit a job to Grid  
> Engine with a request that it be "rerunnable". This attribute means  
> that if the problem you describe happens, the job will automatically  
> be re-queued and dispatched onto a different execution host. Of  
> course, the job will start over from the beginning of its run.
>
> ... except if you have checkpointing enabled. In this case, your job  
> can be restarted from the point at which the last checkpoint  
> operation occurred.
>
> There are 2 main types of checkpointing:
>
>  - operating system level
>  - user level
>
> Operating systems with native application checkpointing capabilities  
> are rare. The SGE docs refer over and over again to SGI IRIX as a  
> good example of an OS that can do this.
>
> If you don't have an OS that natively supports checkpointing (most do 
>  not) then you are responsible for setting up the conditions for  
> checkpointing and recovery. This tends to be application specific.   
> SGE can initiate checkpoints several different ways -- on a periodic  
> (time) basis, whenever a job gets suspended or whenever the sge_execd 
>  on a compute nodes dies or gets shut down.
>
> The checkpoint is initiated via launching a custom script or by  
> sending a Unix signal to the running application. It is your  
> responsibility (or the vendor of the running code) to write the  
> checkpoint script or make the binary aware of custom checkpoint/ 
> restart unix signals.
>
> More information is in the SGE man pages:
>
> http://gridengine.sunsource.net/nonav/source/browse/~checkout~/ 
> gridengine/doc/htmlman//htmlman1/sge_ckpt.html
>
> http://gridengine.sunsource.net/nonav/source/browse/~checkout~/ 
> gridengine/doc/htmlman//htmlman5/checkpoint.html
>

Just to note, that there are also two Howtos:

http://gridengine.sunsource.net/howto/checkpointing.html

http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf

-- Reuti


> Regards,
> Chris
>
>
>
>
>
>
> On Mar 18, 2006, at 5:31 AM, Srikanth wrote:
>
>> Hi,
>>
>>
>> If we are running the job on 10 systems and if one of the systems  suddenly
>> went-off due to some hardware failure, what will happen to that  
>> job? And how
>> can we migrate that job to another node online without the  interruption to
>> the particular Job.
>>
>> Please clarify my query and provide the solution to above problem.
>>
>> Regards,
>> M.Srikanth
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list