[GE users] lam and blcr checkpointing

reuti reuti at staff.uni-marburg.de
Fri Nov 21 12:11:36 GMT 2008


Hi Jerry,

Am 19.11.2008 um 15:16 schrieb Jerry Mersel:

> Hi Reuti:
>
>   Sorry I was very unclear in my last email. I'll try to improve it  
> here.
>
>   First thanks for your response, second I am familiar with that
>   very well written and helpful HOW-TO.
>
>   I succeeded to get checkpointing and GE working together for
>   serial applications.

great. As the Howto states, LAM-MPI isn't covered. So someone has to  
look into implementing it and adjusts the scripts in the Howto.

The big difference between using LAM-MPI and BLCR together is, that  
outside of SGE the daemons are running all the time, while in a tight  
SGE integration each job gets its own set of daemons - which adds a  
level of complexity to the checkpointing process. Maybe a reference  
to the name of the daemon (which depends on the jobnumber) is stored  
in the checkpoint file.

Maybe you can try a Loose Integration, whether this is working.

-- Reuti


>   I now need to do the same for parallel applications.
>   Since I am using BLCR to do checkpointing, and LAM has integrated
>   BLCR I decided to try LAM (7.1.4).
>
>   I managed to get things working from the command line, but from
>   GE when I do the checkpoint those checkpointed files can't restart
>   the application. Neither from GE or the command line.
>
>   I am getting  kernel: Skipping a socket
>
>   I'd appreciate any ideas.
>
>                                Regards,
>                                  Jerry
>
>
>
>
>
>> Hi,
>>
>> Am 19.11.2008 um 11:05 schrieb Jerry Mersel:
>>
>>> Hi:
>>>
>>>   I got lam, with tight_integration, working with GE. I also have it
>>>   working with blcr checkpointing outside of GE. Using GE however  
>>> the
>>>   checkpointing does not work properly.
>>>
>>>   I see in /var/log/messages:
>>>
>>> Nov 19 11:34:09 hezi-1 kernel: Retry on -CR_ENOSUPPORT
>>> Nov 19 11:34:19 hezi-1 kernel: Skipping a socket.
>>>
>>>
>>> I'm using lam 7.1.4, blcr 0.6.4 and GE 6.1U4.
>>
>> you followed the Howto: http://gridengine.sunsource.net/howto/APSTC-
>> TB-2004-005.pdf ?
>>
>> -- Reuti
>>
>>> The checkpointed files appear not to be good.
>>>
>>> Anyone succeed with this?
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=89048
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=89088
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=89107
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89343

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list