[GE users] lam and blcr checkpointing

Jerry Mersel jerry.mersel at weizmann.ac.il
Sun Nov 23 06:56:06 GMT 2008


Hi Reuti:

Thanks for replying.
I also tried loose integration(with rsh), but the results were the same.


                                Regards,
                                  Jerry


> Hi Jerry,
>
> Am 19.11.2008 um 15:16 schrieb Jerry Mersel:
>
>> Hi Reuti:
>>
>>   Sorry I was very unclear in my last email. I'll try to improve it
>> here.
>>
>>   First thanks for your response, second I am familiar with that
>>   very well written and helpful HOW-TO.
>>
>>   I succeeded to get checkpointing and GE working together for
>>   serial applications.
>
> great. As the Howto states, LAM-MPI isn't covered. So someone has to
> look into implementing it and adjusts the scripts in the Howto.
>
> The big difference between using LAM-MPI and BLCR together is, that
> outside of SGE the daemons are running all the time, while in a tight
> SGE integration each job gets its own set of daemons - which adds a
> level of complexity to the checkpointing process. Maybe a reference
> to the name of the daemon (which depends on the jobnumber) is stored
> in the checkpoint file.
>
> Maybe you can try a Loose Integration, whether this is working.
>
> -- Reuti
>
>
>>   I now need to do the same for parallel applications.
>>   Since I am using BLCR to do checkpointing, and LAM has integrated
>>   BLCR I decided to try LAM (7.1.4).
>>
>>   I managed to get things working from the command line, but from
>>   GE when I do the checkpoint those checkpointed files can't restart
>>   the application. Neither from GE or the command line.
>>
>>   I am getting  kernel: Skipping a socket
>>
>>   I'd appreciate any ideas.
>>
>>                                Regards,
>>                                  Jerry
>>
>>
>>
>>
>>
>>> Hi,
>>>
>>> Am 19.11.2008 um 11:05 schrieb Jerry Mersel:
>>>
>>>> Hi:
>>>>
>>>>   I got lam, with tight_integration, working with GE. I also have it
>>>>   working with blcr checkpointing outside of GE. Using GE however
>>>> the
>>>>   checkpointing does not work properly.
>>>>
>>>>   I see in /var/log/messages:
>>>>
>>>> Nov 19 11:34:09 hezi-1 kernel: Retry on -CR_ENOSUPPORT
>>>> Nov 19 11:34:19 hezi-1 kernel: Skipping a socket.
>>>>
>>>>
>>>> I'm using lam 7.1.4, blcr 0.6.4 and GE 6.1U4.
>>>
>>> you followed the Howto: http://gridengine.sunsource.net/howto/APSTC-
>>> TB-2004-005.pdf ?
>>>
>>> -- Reuti
>>>
>>>> The checkpointed files appear not to be good.
>>>>
>>>> Anyone succeed with this?
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=89048
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=89088
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=89107
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89343
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89573

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list