[GE users] Help: Checkpoint Problem

Reuti reuti at staff.uni-marburg.de
Thu Apr 3 10:53:20 BST 2008


Am 01.04.2008 um 15:00 schrieb Lee Amy:
> 2008/4/1, Reuti <reuti at staff.uni-marburg.de>: Hi,
>
> Am 01.04.2008 um 13:51 schrieb Lee Amy:
>
> > Hello,
> >
> > I use MPICH 1.2.6 in my cluster and I wanna build a checkpoint by
> > SGE. So is there any way to build checkpoint step by step?
>
>
> AFAIK it's not possible with MPICH unless you program it on your own
> at an application level. LAM/MPI has it built-in since 7.1, and for
> Open MPI it's scheduled for the upcoming 1.3 release.
>
> There are Howtos for checkpointing of serial applications with SGE,
> but be aware the SGE will only "let's say" trigger an already
> existent checkpointing facility of the application (which already has
> to work without SGE).
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> Thanks for your reply. And could you tell me more details about the  
> 'an application level'? And what if I can only use MPICH 1.2.6,  
> what can I do with it in checkpoint?

How to write checkpointable applications is far beyond the scope of  
this list. If your program is checkpointable without a queuingsystem,  
then we can start to integrate it into SGE. In addition, it would be  
good if your application would be movable between different set of  
nodes, so take care where and how you store any temporary information  
and avoid recording node specific dependencies.

You may try Google with keywords like "Checkpointing Strategy   
Parallel".

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list