[GE users] checkpointing and SGE

Jerry Mersel jerry.mersel at weizmann.ac.il
Wed Jun 27 09:22:10 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Yes I know that SGE does not do checkpointing for me, thank you.
But since there was some documentation about it on SGE website and a lot
of users have experience with this I was hoping to pick your brains and 
get more info.

Particularly about matlab and blcr which doesn't seem to work. After 
generating a context file
matlab does not resart I get the error resource not available (or 
something similar).

                                                      Thanks,
                                                          Jerry

Chris Dagdigian wrote:

> Hi Jerry,
>
> Grid Engine can't magically checkpoint your application for migration  
> to another node -- all it really does is play nicely with either  
> applications or Operating Systems that themselves are checkpoint-aware.
>
> Either the code itself needs to be able to checkpoint locally or you  
> need to be running Grid Engine on an operating system that can do  
> system level checkpointing. To my knowledge, Linux and the standard  
> linux kernel does not have this sort of capability. I could not tell  
> from your messages what OS and kernel you are talking about.
>
> Most people I know who seriously use checkpointing in production  
> environments are doing it at the application level these days.
>
> Regards,
> Chris
>
>
>
>
> On Jun 24, 2007, at 5:49 AM, Jerry Mersel wrote:
>
>> In addition does the kernel have to be the same across all the nodes?
>>
>> It seems that the "N1GE6 Checkpointing and Berkeley lab Checkpoint/ 
>> Restart" doc
>> contradicts itself on weather a process can migrate across nodes.
>>
>>                                                           Regards,
>>                                                               Jerry
>>
>> Jerry Mersel wrote:
>>
>>> Hi:
>>>
>>>  I have to checkpoint a process and then restart the process on  
>>> another node.
>>>  I also have to use kernel checkpointing because I don't always  
>>> have access to
>>>  the code that is being run.
>>>
>>>  I read the documentation, N1GE6 Checkpointing and Berkeley lab  
>>> Checkpoint/Restart
>>>  and it seemed to say  that  the checkpointed process can't  
>>> migrate  to other nodes.
>>>  Am I  reading this correctly? Can someone recommend another method.
>>>
>>>
>>>                                                                       
>>> Regards,
>>>                                                                       
>>>   Jerry
>>
>
> -- 
> Chris Dagdigian  <dag at sonsorol.org>
> Current coordinates: Boston-area, USA
> GPS: http://bioteam.net/dagbin/gps?42.385693+N+71.115535+W
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list