[GE users] checkpointing with chpox

Reuti reuti at staff.uni-marburg.de
Tue Nov 1 23:12:58 GMT 2005


Hi Richard,

Am 30.10.2005 um 18:10 schrieb Richard Menedetter:

> Hi
>
> 24 Sep 2005, Reuti <reuti at staff.uni-marburg.de> wrote:
>
> first of all sorry for the veeeery late answer.
> (the mail slipped off of my eyes)
>
>>> I wanted to know if somebody tried the checkpointing feature of
>>> gridengine 6 with chpox.
>>> http://www.cluster.kiev.ua/tasks/chpx_eng.html
>>>
>>> I have tried it witht the following commands, but have not
>>> succeeded.
>
>  R> the restart command is only used with the kernel level  
> checkpointing
>  R> interfaces which are specific to some OSs.
>
> Can't I use the kernel level checkpointing if the checkpointing  
> software
> supports restart by itself?

if I understand the source in the correct way, some of the SGE kernel  
level support routines get only compiled on the target platform it is  
intended for, e.g. cpr for IRIX/SGI. It would be possible to extend  
the source at these places also for chpox.

>  R> Did you had a look at the man pages of "sge_ckpt" and  
> "checkpoint" -
>  R> also these two Howtos might be helpful:
>  R> http://gridengine.sunsource.net/howto/checkpointing.html
>  R> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
>
> Thanks ... I have read the 2. link.
> But at the moment I have limited time for gridengine :(
>
> I will try again later, and report back.
>
>  R> In your case I'd go for the application-level interface.
>  R> The restart of the process then has to be put in the jobscript.
>
> this is exactly what I did not want to do.
> I wanted to go for unmodified job scripts, so that the user simply  
> enables
> checkpointing when submitting the job, and that would be it.

Until there is an official kernel level checkpointing in the Linux  
kernel, I still prefer the other choices. One option might be the  
virtual machine solution from Meiosys, where your program is running  
in a virtual machine and the virtual machine is checkpointed and  
perhaps moved to a different node: http://www.meiosys.com But they  
were bought by IBM and I didn't heard anything about a new release,

Cheers - Reuti

>  R> For the migrate command, I'd suggest to have a test included  
> for the
>  R> supplied flags of chpox, to wait until the checkpoint was  
> successfully
>  R> created - one second may be too short, as the checkpointing  
> directory
>  R> is most likely located on a shared volume.
>
> thanks for the hint.
> I will try that.
>
> is it possible to use the kernel level checkpointing with chpox?
>
>  R> Cheers - Reuti
>
> CU, Ricsi
>
> -- 
> |~)o _ _o  Richard Menedetter <ricsi at gmx.at> {ICQ: 7659421} (PGP)
> |~\|(__\|  -=> When all is said and done more will be said than  
> done <=-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list